Skip to main content

LLM Crawler Detector

Is that user agent an AI crawler? GPTBot, ClaudeBot and friends — plus robots.txt rules.

100% client-side⌁ nothing leaves your browser⎘ instant results
Matched locally

UA strings are trivially spoofed — for production verification also check the source IP against the vendor's published ranges.

GPTBot

AI crawler detectedOpenAI. Training data collection for GPT models. Documented as respecting robots.txt.

VendorOpenAI
PurposeTraining data collection for GPT models
Respects robots.txtYes (documented)

robots.txt generator — check a bot to block it

BlockBotVendorrobots.txt
GPTBotOpenAIrespects
OAI-SearchBotOpenAIrespects
ChatGPT-UserOpenAIrespects
ClaudeBotAnthropicrespects
Claude-UserAnthropicrespects
Claude-SearchBotAnthropicrespects
anthropic-aiAnthropicrespects
PerplexityBotPerplexityrespects
Perplexity-UserPerplexityunreliable
Google-ExtendedGooglerespects
Applebot-ExtendedApplerespects
BytespiderByteDanceunreliable
CCBotCommon Crawlrespects
cohere-aiCohereunreliable
Meta-ExternalAgentMetarespects
Meta-ExternalFetcherMetaunreliable
AmazonbotAmazonrespects
FacebookBotMetarespects
DiffbotDiffbotunreliable
TimpiBotTimpirespects

Generated robots.txt (3 blocked)

# AI crawler policy — generated locally at forg.pro/tools/llm-crawler-detector

User-agent: Bytespider
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

# Explicitly allowed AI crawlers

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: FacebookBot
Allow: /

User-agent: TimpiBot
Allow: /

robots.txt is a request, not enforcement. Bots flagged unreliable should be blocked by UA/IP at your CDN or WAF if you need a guarantee.

100%
client-side compute
0
uploads — verify in devtools
96
free tools in the directory
0
network requests per keystroke

How it works

A growing share of your server traffic is AI: training crawlers harvesting content for model corpora, search indexers building answer-engine results, and live user-fetch agents retrieving a page because someone asked a chatbot about it. Each announces itself with a user-agent token — GPTBot, ClaudeBot, PerplexityBot, Bytespider— and each has different behavior, different value to you, and a different answer to the question "does it honor robots.txt". This tool identifies them and generates the policy file to govern them.

Paste any user-agent string from your access logs and the detector matches it against an embedded dataset of twenty AI crawlers, returning the vendor, the crawl purpose, and a documented-compliance flag. The purpose distinction matters most: a hit from GPTBot means your content is being collected for training, while ChatGPT-User means a human is reading your page through an assistant right now — closer to a browser visit than a scrape. Sites that conflate the two end up blocking the traffic they actually wanted.

The robots.txt generator turns policy into deployable text. Every bot in the dataset gets an allow/block checkbox; the output lists a User-agent / Disallow: / stanza per blocked bot and explicit allows for the rest, ready to copy into the file at your domain root. Two crawlers ship pre-checked for blocking — Bytespider and other bots with documented histories of ignoring robots directives — reflecting the practical consensus rather than a neutral default.

The tool is equally honest about the limits of the mechanism. robots.txt is a request that compliant crawlers honor and non-compliant ones ignore, and any scraper can claim any user-agent string. For bots flagged unreliable, the dependable enforcement layer is your CDN or WAF, matching user-agent plus verified source IP — OpenAI, Anthropic and Google all publish their crawler IP ranges precisely so you can distinguish the real GPTBot from an impersonator. Detection and generation both run entirely in your browser; nothing you paste is transmitted anywhere.

Frequently asked questions

What is the difference between a training crawler and a user-fetch agent?

Training crawlers like GPTBot and ClaudeBot harvest pages in bulk to build model training corpora — blocking them keeps your content out of future models. User agents like ChatGPT-User and Claude-User fetch a specific page live because a human asked the assistant about it, much like that person opening your page in a browser. Many sites block training crawlers but allow user fetches, since the latter can drive real visits and citations.

Does blocking these bots in robots.txt actually work?

For the major vendors, yes — OpenAI, Anthropic, Google and Meta document that their named crawlers honor robots.txt, and observed behavior matches. But robots.txt is a convention, not enforcement: bots flagged as unreliable in this dataset have been repeatedly observed ignoring it or crawling under generic user agents. For those, the dependable defense is blocking by user-agent and verified IP range at your CDN, WAF or server config.

What is Google-Extended if it isn't a real crawler?

Google-Extended is a robots.txt control token, not a distinct bot. Googlebot crawls your site once for everything; the Google-Extended directive tells Google not to use that crawled content for training Gemini models. This design means you can opt out of AI training without losing search indexing — disallowing Google-Extended has no effect on your search ranking, whereas disallowing Googlebot itself would remove you from search. Applebot-Extended works the same way for Apple Intelligence.

Should I block AI crawlers at all?

It is a genuine trade-off, not an obvious yes. Blocking training crawlers protects content you monetize directly — paywalled journalism, premium documentation, licensable data. But answer engines increasingly drive discovery: being absent from ChatGPT search and Perplexity citations is the new being absent from Google. A common middle path is allowing search/user-fetch agents (OAI-SearchBot, Claude-SearchBot, ChatGPT-User) while disallowing pure training collectors (GPTBot, CCBot, Bytespider), which this generator's per-bot checkboxes express directly.

Can a scraper just fake its user-agent to get around all this?

Trivially — the user-agent is a self-declared request header. That is why the detector's verdict comes with a spoofing caveat and why serious enforcement is two-layered: robots.txt for compliant bots, and IP verification for the rest. Major vendors publish their crawler IP ranges (OpenAI, Anthropic and Google all do), so a request claiming to be GPTBot from an IP outside OpenAI's published blocks is an impersonator you can drop at the firewall.

Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.

Learn about FORG