Question 1

What is the difference between a training crawler and a user-fetch agent?

Accepted Answer

Training crawlers like GPTBot and ClaudeBot harvest pages in bulk to build model training corpora — blocking them keeps your content out of future models. User agents like ChatGPT-User and Claude-User fetch a specific page live because a human asked the assistant about it, much like that person opening your page in a browser. Many sites block training crawlers but allow user fetches, since the latter can drive real visits and citations.

Question 2

Does blocking these bots in robots.txt actually work?

Accepted Answer

For the major vendors, yes — OpenAI, Anthropic, Google and Meta document that their named crawlers honor robots.txt, and observed behavior matches. But robots.txt is a convention, not enforcement: bots flagged as unreliable in this dataset have been repeatedly observed ignoring it or crawling under generic user agents. For those, the dependable defense is blocking by user-agent and verified IP range at your CDN, WAF or server config.

Question 3

What is Google-Extended if it isn't a real crawler?

Accepted Answer

Google-Extended is a robots.txt control token, not a distinct bot. Googlebot crawls your site once for everything; the Google-Extended directive tells Google not to use that crawled content for training Gemini models. This design means you can opt out of AI training without losing search indexing — disallowing Google-Extended has no effect on your search ranking, whereas disallowing Googlebot itself would remove you from search. Applebot-Extended works the same way for Apple Intelligence.

Question 4

Should I block AI crawlers at all?

Accepted Answer

It is a genuine trade-off, not an obvious yes. Blocking training crawlers protects content you monetize directly — paywalled journalism, premium documentation, licensable data. But answer engines increasingly drive discovery: being absent from ChatGPT search and Perplexity citations is the new being absent from Google. A common middle path is allowing search/user-fetch agents (OAI-SearchBot, Claude-SearchBot, ChatGPT-User) while disallowing pure training collectors (GPTBot, CCBot, Bytespider), which this generator's per-bot checkboxes express directly.

Question 5

Can a scraper just fake its user-agent to get around all this?

Accepted Answer

Trivially — the user-agent is a self-declared request header. That is why the detector's verdict comes with a spoofing caveat and why serious enforcement is two-layered: robots.txt for compliant bots, and IP verification for the rest. Major vendors publish their crawler IP ranges (OpenAI, Anthropic and Google all do), so a request claiming to be GPTBot from an IP outside OpenAI's published blocks is an impersonator you can drop at the firewall.

Vendor	OpenAI
Purpose	Training data collection for GPT models
Respects robots.txt	Yes (documented)

LLM Crawler Detector

robots.txt generator — check a bot to block it

Generated robots.txt (3 blocked)

How it works

Frequently asked questions

What is the difference between a training crawler and a user-fetch agent?

Does blocking these bots in robots.txt actually work?

What is Google-Extended if it isn't a real crawler?

Should I block AI crawlers at all?

Can a scraper just fake its user-agent to get around all this?

Related tools

URL Parser

Regex Tester

API Key Format Identifier

AI Provider Status History

Bot	Vendor	robots.txt
GPTBot	OpenAI	respects
OAI-SearchBot	OpenAI	respects
ChatGPT-User	OpenAI	respects
ClaudeBot	Anthropic	respects
Claude-User	Anthropic	respects
Claude-SearchBot	Anthropic	respects
anthropic-ai	Anthropic	respects
PerplexityBot	Perplexity	respects
Perplexity-User	Perplexity	unreliable
Google-Extended	Google	respects
Applebot-Extended	Apple	respects
Bytespider	ByteDance	unreliable
CCBot	Common Crawl	respects
cohere-ai	Cohere	unreliable
Meta-ExternalAgent	Meta	respects
Meta-ExternalFetcher	Meta	unreliable
Amazonbot	Amazon	respects
FacebookBot	Meta	respects
Diffbot	Diffbot	unreliable
TimpiBot	Timpi	respects