Best SEO Singapore
SEO Insights

GPTBot, ClaudeBot and PerplexityBot: A 2026 AI Crawler Configuration Guide

Jim Ng
Jim Ng
The 11 AI crawlers a Singapore site will see in 2026, grouped by purpose
TRAINING

GPTBot

OpenAI · ingests content for model training
User-agent: GPTBot
TRAINING

ClaudeBot

Anthropic · training data collection
User-agent: ClaudeBot
TRAINING

Google-Extended

Google · opts content in/out of Gemini training
User-agent: Google-Extended
TRAINING

CCBot

Common Crawl · third-party dataset feeding many LLMs
User-agent: CCBot
TRAINING

anthropic-ai

Anthropic · legacy training crawler, still active
User-agent: anthropic-ai
TRAINING

Bytespider

ByteDance · feeds Doubao + TikTok AI
User-agent: Bytespider
If you have read our multi-engine GEO playbook, you know each AI engine retrieves and cites differently. This article is the technical layer beneath it: how each engine actually accesses your site, which user agent to allow, which to block, and what the resulting robots.txt should look like for a Singapore SEO setup in 2026. The headline shift in 2026 is that the major LLM vendors have split their crawlers into training fleets and search fleets. OpenAI did it first with GPTBot (training) and OAI-SearchBot (search). Anthropic followed with the ClaudeBot, Claude-SearchBot, Claude-User trio. This split matters because you can now express a nuanced position in robots.txt that was impossible 12 months ago: do not train on my content, but do cite me when a user asks.

Why Every SEO Team Needs an AI Crawler Policy in 2026

Until late 2024, the assumption was that AI crawlers were a sub-problem of regular crawl management. Block the obvious bad actors, allow Googlebot and Bingbot, move on. That stance no longer holds. Three things changed. First, AI engines are now a meaningful traffic surface in their own right. Even if direct click-through from AI citations is small (1 to 5 percent of comparable classical SERP traffic), the brand mention impact and the downstream branded search lift is significant for B2B and high-consideration consumer categories. Second, the training-versus-search split means publishers can now opt out of model training without disappearing from AI answers. Third, the bots are no longer optional infrastructure: they generate real bandwidth, real log noise, and (for ecommerce sites) real cart/checkout junk hits when misconfigured. A documented AI crawler policy, encoded in `robots.txt` and reviewed quarterly, is now part of technical SEO hygiene at the same level as canonical tags or hreflang.

The Training Bot vs Search Bot Distinction (2026's Most Important Robots.txt Change)

This is the single concept that drives every decision below.
The two purposes an AI vendor crawls your site for, and how they use the data

Training Crawlers

Goal: ingest your text into the next model version
  • What they take: raw page content, ingested into training datasets
  • How it shows up: embedded in model weights, may be regurgitated without attribution
  • You get cited? Rarely. Training does not preserve URLs.
  • Examples: GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider

Search/Retrieval Crawlers

Goal: index your page to cite in live answers
  • What they take: indexed snapshots used at retrieval time
  • How it shows up: as a clickable citation card in the AI answer
  • You get cited? Yes. URL is preserved and shown to the user.
  • Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User
The practical decision: blocking training crawlers protects your content from being absorbed into models with no attribution. Blocking search crawlers makes you invisible inside the AI answer surface. Most publishers want the first and not the second. A handful of edge cases want both blocked (paywalled news, premium research, regulated PII), and a smaller handful want both allowed (high-volume content brands betting on training-set inclusion as a long-term authority play). For everyone in between, the "block training, allow search" pattern is the default.

The Default 2026 robots.txt Pattern (Block Training, Allow Search)

This is the configuration we apply to the majority of SEO clients by default unless they have a specific reason for a different policy. Drop it into your existing `robots.txt`, after your standard Googlebot/Bingbot rules. ```

=== AI TRAINING CRAWLERS: BLOCKED ===

Do not absorb our content into model training datasets.

User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Amazonbot Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: /

=== AI SEARCH/RETRIEVAL CRAWLERS: ALLOWED ===

Let AI engines retrieve and cite us in live answers.

User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Claude-User Allow: / User-agent: PerplexityBot Allow: / User-agent: PerplexityBot-User Allow: / ``` A few notes on what this does and does not do. The `Allow: /` lines for search bots are not strictly required (the absence of a `Disallow` is functionally identical) but we keep them for two reasons: they document intent for the next engineer who reads the file, and they preempt accidental over-blocking from a `User-agent: *` `Disallow` rule earlier in the file. If you have a `User-agent: *` block at the top of your `robots.txt` with broad `Disallow` rules, the explicit `Allow` for AI search bots ensures they still get the content paths you want them to see.

The Full Crawler Reference Table

Here is the complete 2026 reference for every AI crawler a Singapore commercial site is likely to encounter, plus the legitimate Googlebot/Bingbot baseline for context.
Every AI crawler user agent string, vendor, and recommended default action (2026)
User-agentVendorPurposeDefault action
GPTBotOpenAITrainingBlock
OAI-SearchBotOpenAIChatGPT Search indexAllow
ChatGPT-UserOpenAILive fetch in chatAllow
ClaudeBotAnthropicTrainingBlock
anthropic-aiAnthropicLegacy trainingBlock
Claude-SearchBotAnthropicClaude search indexAllow
Claude-UserAnthropicLive fetch in chatAllow
PerplexityBotPerplexityPerplexity indexAllow
PerplexityBot-UserPerplexityLive fetch in answerAllow
Google-ExtendedGoogleOpts in/out Gemini trainingBlock
GooglebotGoogleClassical + AI OverviewsAllow
BingbotMicrosoftBing + ChatGPT/Copilot fallbackAllow
CCBotCommon CrawlThird-party LLM training dataBlock
BytespiderByteDanceDoubao + TikTok trainingBlock
AmazonbotAmazonAlexa + Rufus trainingBlock
Applebot-ExtendedAppleApple Intelligence trainingBlock
Meta-ExternalAgentMetaLlama trainingBlock
ApplebotAppleSiri + Spotlight (search)Allow

Two notes on this table. First, the "Default action" column is a sensible default, not a universal answer. Section 5 below is a decision matrix for when to deviate. Second, the user-agent strings here are the canonical published values as of Q2 2026. Vendors do change them. Re-check the official documentation pages once a quarter if you maintain a sensitive site.

The Block vs Allow Decision Matrix

The default policy above suits most commercial Singapore sites. Five categories of site should deviate.

When to deviate from "block training, allow search" defaults
Site type
Recommended policy
Why
Brand-new commercial site (under 6 months, low authority)
Allow everything
Visibility now matters more than model-training opt-out. Re-evaluate at month 12.
Established commercial site (most clients)
Block training, allow search (default)
Protects IP, preserves AI citation visibility.
News, premium research, paywalled content
Block training and search (or licence to vendors)
Each cited paragraph competes with your subscription model.
User-generated content (forums, reviews)
Block training, allow search, block both on user pages
Privacy and consent: your users did not opt their content into model training.
Health, finance, legal regulated content
Block training, allow search, monitor citations
Hallucinated paraphrases of regulated content are a liability surface. Track what gets cited.
Singapore Government/statutory body sites
Allow search, training case-by-case
Public information benefits from AI distribution. Training opt-out is a separate policy choice.

A common mistake: treating the decision as binary across the whole site. You can use path-specific rules to allow AI access to your blog and marketing pages while blocking it from `/account/`, `/checkout/`, `/admin/` or any user-data area. Use the same `Disallow: /path/` syntax under each AI bot's `User-agent` block.

Verifying AI Crawler Activity in Your Server Logs

Robots.txt is a request, not an enforcement layer. The well-behaved AI bots (GPTBot, ClaudeBot, PerplexityBot, the OpenAI/Anthropic search variants) honour it. Verifying that they actually do, and identifying any ignoring it, is a server log job.

The 4-step quarterly AI crawler audit
1

Pull 30 days of server access logs

Filter for user agents containing "GPT", "Claude", "Perplexity", "Bytespider", "CCBot", "Anthropic", "OAI-Search", "Meta-External". Cloudflare, Fastly and most CDNs expose these in their analytics dashboard already.

2

Verify IP origin against published ranges

OpenAI, Anthropic and Perplexity each publish IP ranges for their bots. A request claiming to be GPTBot from outside OpenAI's published range is a spoofer, not a real OpenAI crawler. Cloudflare's "Verified Bots" list does this verification automatically.

3

Cross-reference against your robots.txt policy

For every bot you have set to "Block", confirm hits dropped to near-zero in the 7 days after deployment. If GPTBot is still hitting `/` in volume after a `Disallow: /` rule, your robots.txt is malformed (or being ignored by a non-OpenAI spoofer using the GPTBot user agent).

4

Track citation impact in AI engines

If you blocked a search bot in the audit and citations dropped on Perplexity/ChatGPT/Claude in the following 30 days, your blocking decision is working as intended. If you blocked training but want to verify search visibility is unaffected, run the multi-engine baseline test from our GEO playbook.

For sites with no log access and no CDN-level analytics, the practical proxy is monitoring AI engine citations for your top buyer-intent queries. If citations drop on a specific engine after a robots.txt change, that engine's crawler was the one being affected.

Edge Cases and Anti-Patterns We See on Singapore Audits

A few patterns recur on the technical SEO audits we run for prospective clients. None are catastrophic on their own, but they accumulate.

Blocking PerplexityBot accidentally via overly broad `User-agent: *` rules. A `Disallow: /search/` or `Disallow: /api/` under the universal user-agent block applies to PerplexityBot too. If you want PerplexityBot in but want a path blocked for everyone else, repeat the path under PerplexityBot's specific `User-agent` block with `Allow:`.

Trusting unverified bots. Spoofers crawl Singapore commercial sites claiming to be GPTBot or PerplexityBot to look benign. The user-agent string is trivial to forge. Verify by IP origin (Cloudflare, Akamai, Fastly all do this), or block the spoofers at the WAF layer.

Forgetting `Google-Extended`. Several Singapore sites we audited had blocked GPTBot and ClaudeBot but still allowed Google-Extended, the user agent that controls whether Google can use your content for Gemini training. If your policy is "no AI training", Google-Extended belongs in the block list.

Treating robots.txt as a security boundary. It is not. Stealth crawlers, fine-tuning pipelines using third-party datasets, and any bot operator who chooses to ignore robots.txt will continue to access your content. For real enforcement, you need WAF rules, bot-management services (Cloudflare Bot Management, DataDome, HUMAN), or authentication.

Forgetting to set the `User-agent` line above each `Disallow`. A common copy-paste error. Each bot's rules need their own `User-agent:` declaration. Without it, the `Disallow` falls under whichever previous user agent was declared, which is rarely what you want.

For the schema and crawl-budget hygiene that pairs with this AI crawler policy, our schema markup deep-dive covers the structured-data layer, and our Core Web Vitals guide covers the performance layer.

Frequently Asked Questions

Should I block GPTBot in 2026?

Block GPTBot if you want your content kept out of OpenAI's training datasets. Blocking GPTBot does not affect ChatGPT Search visibility because ChatGPT Search uses a separate crawler (OAI-SearchBot) and the Bing index. The default we apply to most Singapore commercial clients is to block GPTBot but allow OAI-SearchBot and ChatGPT-User, which protects training-data IP while preserving ChatGPT citation visibility.

What is the difference between ClaudeBot and Claude-SearchBot?

ClaudeBot is Anthropic's training crawler: it ingests pages into datasets used to train future Claude models. Claude-SearchBot is the retrieval crawler that powers Claude's web search feature: it indexes pages so Claude can cite them in live answers. There is also Claude-User, which fetches a specific page when a user asks Claude to visit a URL in conversation. Block ClaudeBot to opt out of training, allow Claude-SearchBot and Claude-User to stay visible inside Claude's answers.

Does PerplexityBot crawl my site by default?

Yes, unless your robots.txt blocks it. PerplexityBot is well-behaved and honours `Disallow` rules. If you want to be cited inside Perplexity answers, ensure no `Disallow: /` rule sits under a `User-agent: PerplexityBot` block, and ensure your top-of-file `User-agent: *` rules are not accidentally restricting paths Perplexity needs. We see overly broad `Disallow` rules cost Perplexity visibility on roughly 1 in 4 Singapore sites we audit.

Will blocking AI crawlers hurt my Google rankings?

No. Googlebot is a separate user agent from Google-Extended. Blocking Google-Extended opts your content out of Gemini model training but does not affect classical Google search rankings or Google AI Overviews retrieval. Googlebot continues to crawl your site for the regular search index. Confirm this by reviewing Google's published documentation on the Google-Extended user agent.

How often should I update my AI crawler robots.txt rules?

Quarterly at minimum. Vendors add new bots regularly: ClaudeBot was renamed and split into three crawlers in 2025, OAI-SearchBot only became prominent in late 2024, Meta-ExternalAgent is a 2025 addition. If you maintain a sensitive site (health, finance, regulated content, paywalled news), monthly is more appropriate. Subscribe to OpenAI, Anthropic and Google's official crawler documentation pages to catch announced changes.

Can I block AI crawlers using meta tags instead of robots.txt?

Partially. The `noai` and `noimageai` meta directives, plus `data-nosnippet`, give per-page control for crawlers that honour them. Coverage is inconsistent: Google-Extended honours `noai` on a page-by-page basis, OpenAI publishes its own per-page directive support, others do not yet. For site-wide policy, robots.txt remains the reliable layer in 2026. Use meta directives for page-specific exceptions on top of a robots.txt baseline.

Related reading

Jim Ng, Founder of Best SEO Singapore
Jim Ng

Founder of Best Marketing Agency and Best SEO Singapore. Started in 2019 cold-calling 70 businesses a day, scaled to 14, then leaned out to a 9-person AI-first team serving 146+ clients across 43 industries. Acquired Singapore Florist in 2024 and grew it to #1 rankings for competitive keywords. Every SEO strategy ships with his personal review.

Connect on LinkedIn

Want Results Like These for Your Site?

Book a free 30-minute strategy session. No pitch, just a real look at what is holding your organic traffic back.

Book A Free Growth Audit(Worth $2,500)