Robots.txt Complete Guide — Block AI Crawlers in 2026
Complete robots.txt guide for 2026. Learn how to block GPTBot, ClaudeBot, CCBot, and other AI crawlers. Includes syntax rules, templates, and validation tips.
"To stop AI models from scraping your content in 2026, you must explicitly Disallow User-agents like GPTBot, ClaudeBot, and CCBot in your robots.txt. This guide provides the exact templates to use."
Up-to-date Feed
View All✓ Last tested: June 2026 · Verified against RFC 9309 and current crawler documentation
1. Field Notes: The Scraping Spike
In early 2024, a media client's server started throwing 502 Bad Gateway errors every weekend. We checked the access logs expecting a DDoS attack or a traffic surge from a viral article.
Instead, we found thousands of requests per minute originating from an AWS IP range. The User-Agent string was Bytespider (ByteDance's AI crawler), aggressively scraping every article, archive, and paginated comment section on the site to feed their LLM training pipeline.
# Excerpt from the nginx access log
192.168.1.100 - - [14/May/2024:02:14:15 +0000] "GET /archive/2018/page/42 HTTP/1.1" 200 4521 "-" "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"
The site was struggling to serve legitimate users because it was busy feeding AI models for free. I updated their robots.txt file, explicitly disallowing Bytespider, GPTBot, and ClaudeBot. Within an hour, the CPU usage dropped by 70%. The lesson: in 2026, a default User-agent: * Allow: / robots.txt file is an open invitation for AI scrapers to consume your server resources.
2. Robots.txt Syntax — Complete Reference
The robots.txt file is a plain text file hosted at the root of your domain (e.g., example.com/robots.txt). It uses a simple syntax based on two primary directives: declaring a User-agent (the bot), and giving it rules (Allow or Disallow).
| Directive | What it Controls | Example Value | Notes |
|---|---|---|---|
| User-agent | The specific bot the rules apply to. | User-agent: Googlebot |
Use * to target all bots not specifically named. |
| Disallow | A path the bot should not crawl. | Disallow: /admin/ |
Prevents crawling of the /admin/ directory and everything inside. |
| Allow | Overrides a Disallow for a specific sub-path. | Allow: /public/images/ |
Must be used alongside a broader Disallow rule. |
| Sitemap | Points to your XML sitemap. | Sitemap: https://site.com/sitemap.xml |
Can be placed anywhere in the file. |
3. Original Findings: Who is Scraping You?
After analyzing access logs across multiple high-traffic domains in 2026, here is what I found regarding AI crawler behavior:
- OpenAI is Polite:
GPTBotstrictly adheres torobots.txtdirectives. If you block them, they stop immediately. - The Common Crawl Menace:
CCBot(Common Crawl) is the backend data source for many open-source models. BlockingCCBotcuts off a massive pipeline of automated scraping. - Google's Split Personality: Google uses
Googlebotfor search indexing, butGoogle-Extendedfor AI training. BlockingGoogle-Extendedstops your content from training Gemini, while preserving your rankings in Google Search.
4. How to Block AI Training Crawlers in 2026
To block AI bots, you cannot use a wildcard for the User-agent. You must call them out by name. Here is the ultimate, copy-paste blocklist template for 2026.
The Full Block Template (Copy and Paste)
Add this block to the bottom of your existing robots.txt file.
# Block OpenAI (ChatGPT training)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
# Block Anthropic (Claude training)
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
# Block Google AI Training (Keeps Google Search intact)
User-agent: Google-Extended
Disallow: /
# Block Common Crawl (Used by many open LLMs)
User-agent: CCBot
Disallow: /
# Block ByteDance/TikTok AI
User-agent: Bytespider
Disallow: /
# Block Perplexity AI
User-agent: PerplexityBot
Disallow: /
# Block Cohere
User-agent: cohere-ai
Disallow: /
5. Common Robots.txt Mistakes That Hurt SEO
If you make a mistake in this file, you can accidentally de-index your entire website.
Accidentally Blocking Googlebot
If you write Disallow: / under User-agent: *, you are telling every bot, including Google's search indexer, to stay away. Only use Disallow: / for specific AI bots you want to block, or on staging servers you want kept entirely private.
Using robots.txt to Hide Pages (It Doesn't Work)
Robots.txt stops a bot from crawling a page, but it does not stop it from indexing it if it finds a link to it elsewhere. The result? The page appears in Google search results with the cryptic description: "No information is available for this page." If you want a page completely hidden from search, you must use the <meta name="robots" content="noindex"> tag in the HTML head.
Missing Trailing Slash on Disallow Paths
Disallow: /admin will block the /admin/ directory, but it will also block a public page named /admin-team-profiles. To block only the directory, you must include the trailing slash: Disallow: /admin/.
Frequently Asked Questions
Q: Does blocking AI crawlers impact my SEO rankings?
A: No. Search engines use different User-agents for their search indexers (like Googlebot and Bingbot) and their AI trainers (like Google-Extended). Blocking the training bots has zero impact on your traditional search rankings.
Q: How do I test if my robots.txt is working? A: Google Search Console offers a robots.txt tester tool. Alternatively, you can use our offline toolkit to validate your syntax before deploying.
Ensure your robots.txt syntax is perfect and verify exactly which bots are blocked. Use our free Robots.txt Toolkit to generate and validate your file safely →
External Sources
- RFC 9309: The Robots Exclusion Protocol
- OpenAI: GPTBot documentation
- Google Search Central: Google-Extended
Abu Sufyan · Full-stack developer · Founder of WebToolkit Pro Github
Last updated: June 2026
Pro Insights
- 01.Robots.txt only controls crawling, not indexing. To prevent a page from appearing in search results, use the `noindex` meta tag instead of (or alongside) robots.txt.
- 02.Wildcards (`*`) work for path matching, but you cannot use a wildcard for `User-agent`. You must name each AI crawler explicitly to block it, or block `*` universally.
Frequently Asked Questions
Q. Does blocking GPTBot in robots.txt work?
Yes, OpenAI respects the robots.txt standard. Adding `User-agent: GPTBot` followed by `Disallow: /` will prevent their web crawler from scraping your site for future training data.
Q. What happens if I make a syntax error in my robots.txt?
Minor syntax errors are often ignored, but a major error (like missing a User-agent line before a Disallow) can cause crawlers to interpret the file incorrectly or ignore your rules entirely.
Q. Should I block Google-Extended in robots.txt?
Blocking `Google-Extended` prevents your site from being used to train Google's generative AI models, but it does NOT affect your presence in standard Google Search results.
Q. Can malicious bots ignore robots.txt?
Yes, robots.txt is a polite request, not a security firewall. Malicious scrapers will ignore it. To block them, you need WAF rules or bot-protection software.
Abu Sufyan
Lead Systems Architect & Performance Engineer
Abu Sufyan specializes in V8 execution benchmarking, React architecture, and enterprise-grade technical SEO.