LLM Crawler Detector
Is that user agent an AI crawler? GPTBot, ClaudeBot and friends — plus robots.txt rules.
UA strings are trivially spoofed — for production verification also check the source IP against the vendor's published ranges.
AI crawler detected — OpenAI. Training data collection for GPT models. Documented as respecting robots.txt.
| Vendor | OpenAI |
| Purpose | Training data collection for GPT models |
| Respects robots.txt | Yes (documented) |
robots.txt generator — check a bot to block it
| Block | Bot | Vendor | robots.txt |
|---|---|---|---|
| GPTBot | OpenAI | respects | |
| OAI-SearchBot | OpenAI | respects | |
| ChatGPT-User | OpenAI | respects | |
| ClaudeBot | Anthropic | respects | |
| Claude-User | Anthropic | respects | |
| Claude-SearchBot | Anthropic | respects | |
| anthropic-ai | Anthropic | respects | |
| PerplexityBot | Perplexity | respects | |
| Perplexity-User | Perplexity | unreliable | |
| Google-Extended | respects | ||
| Applebot-Extended | Apple | respects | |
| Bytespider | ByteDance | unreliable | |
| CCBot | Common Crawl | respects | |
| cohere-ai | Cohere | unreliable | |
| Meta-ExternalAgent | Meta | respects | |
| Meta-ExternalFetcher | Meta | unreliable | |
| Amazonbot | Amazon | respects | |
| FacebookBot | Meta | respects | |
| Diffbot | Diffbot | unreliable | |
| TimpiBot | Timpi | respects |
Generated robots.txt (3 blocked)
# AI crawler policy — generated locally at forg.pro/tools/llm-crawler-detector User-agent: Bytespider Disallow: / User-agent: cohere-ai Disallow: / User-agent: Diffbot Disallow: / # Explicitly allowed AI crawlers User-agent: GPTBot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: anthropic-ai Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: CCBot Allow: / User-agent: Meta-ExternalAgent Allow: / User-agent: Meta-ExternalFetcher Allow: / User-agent: Amazonbot Allow: / User-agent: FacebookBot Allow: / User-agent: TimpiBot Allow: /
robots.txt is a request, not enforcement. Bots flagged unreliable should be blocked by UA/IP at your CDN or WAF if you need a guarantee.
How it works
A growing share of your server traffic is AI: training crawlers harvesting content for model corpora, search indexers building answer-engine results, and live user-fetch agents retrieving a page because someone asked a chatbot about it. Each announces itself with a user-agent token — GPTBot, ClaudeBot, PerplexityBot, Bytespider— and each has different behavior, different value to you, and a different answer to the question "does it honor robots.txt". This tool identifies them and generates the policy file to govern them.
Paste any user-agent string from your access logs and the detector matches it against an embedded dataset of twenty AI crawlers, returning the vendor, the crawl purpose, and a documented-compliance flag. The purpose distinction matters most: a hit from GPTBot means your content is being collected for training, while ChatGPT-User means a human is reading your page through an assistant right now — closer to a browser visit than a scrape. Sites that conflate the two end up blocking the traffic they actually wanted.
The robots.txt generator turns policy into deployable text. Every bot in the dataset gets an allow/block checkbox; the output lists a User-agent / Disallow: / stanza per blocked bot and explicit allows for the rest, ready to copy into the file at your domain root. Two crawlers ship pre-checked for blocking — Bytespider and other bots with documented histories of ignoring robots directives — reflecting the practical consensus rather than a neutral default.
The tool is equally honest about the limits of the mechanism. robots.txt is a request that compliant crawlers honor and non-compliant ones ignore, and any scraper can claim any user-agent string. For bots flagged unreliable, the dependable enforcement layer is your CDN or WAF, matching user-agent plus verified source IP — OpenAI, Anthropic and Google all publish their crawler IP ranges precisely so you can distinguish the real GPTBot from an impersonator. Detection and generation both run entirely in your browser; nothing you paste is transmitted anywhere.
Frequently asked questions
What is the difference between a training crawler and a user-fetch agent?
Training crawlers like GPTBot and ClaudeBot harvest pages in bulk to build model training corpora — blocking them keeps your content out of future models. User agents like ChatGPT-User and Claude-User fetch a specific page live because a human asked the assistant about it, much like that person opening your page in a browser. Many sites block training crawlers but allow user fetches, since the latter can drive real visits and citations.
Does blocking these bots in robots.txt actually work?
For the major vendors, yes — OpenAI, Anthropic, Google and Meta document that their named crawlers honor robots.txt, and observed behavior matches. But robots.txt is a convention, not enforcement: bots flagged as unreliable in this dataset have been repeatedly observed ignoring it or crawling under generic user agents. For those, the dependable defense is blocking by user-agent and verified IP range at your CDN, WAF or server config.
What is Google-Extended if it isn't a real crawler?
Google-Extended is a robots.txt control token, not a distinct bot. Googlebot crawls your site once for everything; the Google-Extended directive tells Google not to use that crawled content for training Gemini models. This design means you can opt out of AI training without losing search indexing — disallowing Google-Extended has no effect on your search ranking, whereas disallowing Googlebot itself would remove you from search. Applebot-Extended works the same way for Apple Intelligence.
Should I block AI crawlers at all?
It is a genuine trade-off, not an obvious yes. Blocking training crawlers protects content you monetize directly — paywalled journalism, premium documentation, licensable data. But answer engines increasingly drive discovery: being absent from ChatGPT search and Perplexity citations is the new being absent from Google. A common middle path is allowing search/user-fetch agents (OAI-SearchBot, Claude-SearchBot, ChatGPT-User) while disallowing pure training collectors (GPTBot, CCBot, Bytespider), which this generator's per-bot checkboxes express directly.
Can a scraper just fake its user-agent to get around all this?
Trivially — the user-agent is a self-declared request header. That is why the detector's verdict comes with a spoofing caveat and why serious enforcement is two-layered: robots.txt for compliant bots, and IP verification for the rest. Major vendors publish their crawler IP ranges (OpenAI, Anthropic and Google all do), so a request claiming to be GPTBot from an IP outside OpenAI's published blocks is an impersonator you can drop at the firewall.
Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.
Learn about FORG