First published: 6 May 2026 · Last updated: 6 May 2026
GPTBot
User-agent: GPTBot
ClaudeBot
User-agent: ClaudeBot
Google-Extended
User-agent: Google-Extended
CCBot
User-agent: CCBot
anthropic-ai
User-agent: anthropic-ai
Bytespider
User-agent: Bytespider
OAI-SearchBot
User-agent: OAI-SearchBot
ChatGPT-User
User-agent: ChatGPT-User
Claude-SearchBot
User-agent: Claude-SearchBot
Claude-User
User-agent: Claude-User
PerplexityBot
User-agent: PerplexityBot
Why Every SEO Team Needs an AI Crawler Policy in 2026
Until late 2024, the assumption was that AI crawlers were a sub-problem of regular crawl management. Block the obvious bad actors, allow Googlebot and Bingbot, move on. That stance no longer holds. Three things changed. First, AI engines are now a meaningful traffic surface in their own right. Even if direct click-through from AI citations is small (1 to 5 percent of comparable classical SERP traffic), the brand mention impact and the downstream branded search lift is significant for B2B and high-consideration consumer categories. Second, the training-versus-search split means publishers can now opt out of model training without disappearing from AI answers. Third, the bots are no longer optional infrastructure: they generate real bandwidth, real log noise, and (for ecommerce sites) real cart/checkout junk hits when misconfigured. A documented AI crawler policy, encoded in `robots.txt` and reviewed quarterly, is now part of technical SEO hygiene at the same level as canonical tags or hreflang.The Training Bot vs Search Bot Distinction (2026's Most Important Robots.txt Change)
This is the single concept that drives every decision below.Training Crawlers
- What they take: raw page content, ingested into training datasets
- How it shows up: embedded in model weights, may be regurgitated without attribution
- You get cited? Rarely. Training does not preserve URLs.
- Examples: GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider
Search/Retrieval Crawlers
- What they take: indexed snapshots used at retrieval time
- How it shows up: as a clickable citation card in the AI answer
- You get cited? Yes. URL is preserved and shown to the user.
- Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User
The Default 2026 robots.txt Pattern (Block Training, Allow Search)
This is the configuration we apply to the majority of SEO clients by default unless they have a specific reason for a different policy. Drop it into your existing `robots.txt`, after your standard Googlebot/Bingbot rules. ```=== AI TRAINING CRAWLERS: BLOCKED ===
Do not absorb our content into model training datasets.
User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Amazonbot Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: /=== AI SEARCH/RETRIEVAL CRAWLERS: ALLOWED ===
Let AI engines retrieve and cite us in live answers.
User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Claude-User Allow: / User-agent: PerplexityBot Allow: / User-agent: PerplexityBot-User Allow: / ``` A few notes on what this does and does not do. The `Allow: /` lines for search bots are not strictly required (the absence of a `Disallow` is functionally identical) but we keep them for two reasons: they document intent for the next engineer who reads the file, and they preempt accidental over-blocking from a `User-agent: *` `Disallow` rule earlier in the file. If you have a `User-agent: *` block at the top of your `robots.txt` with broad `Disallow` rules, the explicit `Allow` for AI search bots ensures they still get the content paths you want them to see.The Full Crawler Reference Table
Here is the complete 2026 reference for every AI crawler a Singapore commercial site is likely to encounter, plus the legitimate Googlebot/Bingbot baseline for context.| User-agent | Vendor | Purpose | Default action |
|---|---|---|---|
GPTBot | OpenAI | Training | Block |
OAI-SearchBot | OpenAI | ChatGPT Search index | Allow |
ChatGPT-User | OpenAI | Live fetch in chat | Allow |
ClaudeBot | Anthropic | Training | Block |
anthropic-ai | Anthropic | Legacy training | Block |
Claude-SearchBot | Anthropic | Claude search index | Allow |
Claude-User | Anthropic | Live fetch in chat | Allow |
PerplexityBot | Perplexity | Perplexity index | Allow |
PerplexityBot-User | Perplexity | Live fetch in answer | Allow |
Google-Extended | Opts in/out Gemini training | Block | |
Googlebot | Classical + AI Overviews | Allow | |
Bingbot | Microsoft | Bing + ChatGPT/Copilot fallback | Allow |
CCBot | Common Crawl | Third-party LLM training data | Block |
Bytespider | ByteDance | Doubao + TikTok training | Block |
Amazonbot | Amazon | Alexa + Rufus training | Block |
Applebot-Extended | Apple | Apple Intelligence training | Block |
Meta-ExternalAgent | Meta | Llama training | Block |
Applebot | Apple | Siri + Spotlight (search) | Allow |
Two notes on this table. First, the "Default action" column is a sensible default, not a universal answer. Section 5 below is a decision matrix for when to deviate. Second, the user-agent strings here are the canonical published values as of Q2 2026. Vendors do change them. Re-check the official documentation pages once a quarter if you maintain a sensitive site.
The Block vs Allow Decision Matrix
The default policy above suits most commercial Singapore sites. Five categories of site should deviate.
A common mistake: treating the decision as binary across the whole site. You can use path-specific rules to allow AI access to your blog and marketing pages while blocking it from `/account/`, `/checkout/`, `/admin/` or any user-data area. Use the same `Disallow: /path/` syntax under each AI bot's `User-agent` block.
Verifying AI Crawler Activity in Your Server Logs
Robots.txt is a request, not an enforcement layer. The well-behaved AI bots (GPTBot, ClaudeBot, PerplexityBot, the OpenAI/Anthropic search variants) honour it. Verifying that they actually do, and identifying any ignoring it, is a server log job.
Pull 30 days of server access logs
Filter for user agents containing "GPT", "Claude", "Perplexity", "Bytespider", "CCBot", "Anthropic", "OAI-Search", "Meta-External". Cloudflare, Fastly and most CDNs expose these in their analytics dashboard already.
Verify IP origin against published ranges
OpenAI, Anthropic and Perplexity each publish IP ranges for their bots. A request claiming to be GPTBot from outside OpenAI's published range is a spoofer, not a real OpenAI crawler. Cloudflare's "Verified Bots" list does this verification automatically.
Cross-reference against your robots.txt policy
For every bot you have set to "Block", confirm hits dropped to near-zero in the 7 days after deployment. If GPTBot is still hitting `/` in volume after a `Disallow: /` rule, your robots.txt is malformed (or being ignored by a non-OpenAI spoofer using the GPTBot user agent).
Track citation impact in AI engines
If you blocked a search bot in the audit and citations dropped on Perplexity/ChatGPT/Claude in the following 30 days, your blocking decision is working as intended. If you blocked training but want to verify search visibility is unaffected, run the multi-engine baseline test from our GEO playbook.
For sites with no log access and no CDN-level analytics, the practical proxy is monitoring AI engine citations for your top buyer-intent queries. If citations drop on a specific engine after a robots.txt change, that engine's crawler was the one being affected.
Edge Cases and Anti-Patterns We See on Singapore Audits
A few patterns recur on the technical SEO audits we run for prospective clients. None are catastrophic on their own, but they accumulate.
Blocking PerplexityBot accidentally via overly broad `User-agent: *` rules. A `Disallow: /search/` or `Disallow: /api/` under the universal user-agent block applies to PerplexityBot too. If you want PerplexityBot in but want a path blocked for everyone else, repeat the path under PerplexityBot's specific `User-agent` block with `Allow:`.
Trusting unverified bots. Spoofers crawl Singapore commercial sites claiming to be GPTBot or PerplexityBot to look benign. The user-agent string is trivial to forge. Verify by IP origin (Cloudflare, Akamai, Fastly all do this), or block the spoofers at the WAF layer.
Forgetting `Google-Extended`. Several Singapore sites we audited had blocked GPTBot and ClaudeBot but still allowed Google-Extended, the user agent that controls whether Google can use your content for Gemini training. If your policy is "no AI training", Google-Extended belongs in the block list.
Treating robots.txt as a security boundary. It is not. Stealth crawlers, fine-tuning pipelines using third-party datasets, and any bot operator who chooses to ignore robots.txt will continue to access your content. For real enforcement, you need WAF rules, bot-management services (Cloudflare Bot Management, DataDome, HUMAN), or authentication.
Forgetting to set the `User-agent` line above each `Disallow`. A common copy-paste error. Each bot's rules need their own `User-agent:` declaration. Without it, the `Disallow` falls under whichever previous user agent was declared, which is rarely what you want.
For the schema and crawl-budget hygiene that pairs with this AI crawler policy, our schema markup deep-dive covers the structured-data layer, and our Core Web Vitals guide covers the performance layer.
Frequently Asked Questions
Should I block GPTBot in 2026?
Block GPTBot if you want your content kept out of OpenAI's training datasets. Blocking GPTBot does not affect ChatGPT Search visibility because ChatGPT Search uses a separate crawler (OAI-SearchBot) and the Bing index. The default we apply to most Singapore commercial clients is to block GPTBot but allow OAI-SearchBot and ChatGPT-User, which protects training-data IP while preserving ChatGPT citation visibility.
What is the difference between ClaudeBot and Claude-SearchBot?
ClaudeBot is Anthropic's training crawler: it ingests pages into datasets used to train future Claude models. Claude-SearchBot is the retrieval crawler that powers Claude's web search feature: it indexes pages so Claude can cite them in live answers. There is also Claude-User, which fetches a specific page when a user asks Claude to visit a URL in conversation. Block ClaudeBot to opt out of training, allow Claude-SearchBot and Claude-User to stay visible inside Claude's answers.
Does PerplexityBot crawl my site by default?
Yes, unless your robots.txt blocks it. PerplexityBot is well-behaved and honours `Disallow` rules. If you want to be cited inside Perplexity answers, ensure no `Disallow: /` rule sits under a `User-agent: PerplexityBot` block, and ensure your top-of-file `User-agent: *` rules are not accidentally restricting paths Perplexity needs. We see overly broad `Disallow` rules cost Perplexity visibility on roughly 1 in 4 Singapore sites we audit.
Will blocking AI crawlers hurt my Google rankings?
No. Googlebot is a separate user agent from Google-Extended. Blocking Google-Extended opts your content out of Gemini model training but does not affect classical Google search rankings or Google AI Overviews retrieval. Googlebot continues to crawl your site for the regular search index. Confirm this by reviewing Google's published documentation on the Google-Extended user agent.
How often should I update my AI crawler robots.txt rules?
Quarterly at minimum. Vendors add new bots regularly: ClaudeBot was renamed and split into three crawlers in 2025, OAI-SearchBot only became prominent in late 2024, Meta-ExternalAgent is a 2025 addition. If you maintain a sensitive site (health, finance, regulated content, paywalled news), monthly is more appropriate. Subscribe to OpenAI, Anthropic and Google's official crawler documentation pages to catch announced changes.
Can I block AI crawlers using meta tags instead of robots.txt?
Partially. The `noai` and `noimageai` meta directives, plus `data-nosnippet`, give per-page control for crawlers that honour them. Coverage is inconsistent: Google-Extended honours `noai` on a page-by-page basis, OpenAI publishes its own per-page directive support, others do not yet. For site-wide policy, robots.txt remains the reliable layer in 2026. Use meta directives for page-specific exceptions on top of a robots.txt baseline.
