SEO Tools

Robots.txt Complete Guide — Block AI Crawlers in 2026

7 min read

Complete robots.txt guide for 2026. Learn how to block GPTBot, ClaudeBot, CCBot, and other AI crawlers. Includes syntax rules, templates, and validation tips.

Executive Summary

"To stop AI models from scraping your content in 2026, you must explicitly Disallow User-agents like GPTBot, ClaudeBot, and CCBot in your robots.txt. This guide provides the exact templates to use."

Up-to-date Feed

View All
SEO Tools

Add Schema Markup Without a Plugin — 2026 Tutorial

Read Now
Security

AES Encryption in the Browser — JavaScript 2026

Read Now
Security

Bcrypt vs Argon2 Password Hashing — 2026 Guide

Read Now
Security

Content Security Policy Generator — 2026 Tutorial

Read Now
Engineering

CSS Box Shadow Generator — 20 Examples for 2026

Read Now
Engineering

CSS Gradient Generator — 15 Modern Examples for 2026

Read Now
Engineering

PX to REM Conversion Guide — CSS Accessibility 2026

Read Now
SEO Tools

Robots.txt Complete Guide — Block AI Crawlers in 2026

Read Now
Security

SQL Injection Testing for Beginners — 2026 Guide

Read Now
Engineering

WCAG Color Contrast Requirements — 2026 Guide

Read Now
Tools

JSON Formatter vs jq: Which Should You Use in 2026?

Read Now
Security

Calculate Password Entropy Bits — Complete Guide

Read Now
Developer Tools

CSV to JSON With Nested Objects — 2026 Guide

Read Now
Developer Tools

Decode JWT Tokens Without a Library — 2026 Guide

Read Now
Developer Tools

Generate JWT Tokens Free — Offline Tool Guide

Read Now
Developer Tools

JSON to Pydantic Model Generator — Python 2026

Read Now
Developer Tools

JSON to TypeScript Interface — Free Converter Guide

Read Now
Developer Tools

JSON to YAML Converter — Free Offline Tool 2026

Read Now
Developer Tools

JWT Token Expiry Error Fix — Node.js 2026

Read Now
Engineering

JWT vs Session Cookies 2026 — Which to Use?

Read Now
Developer Tools

Validate JSON Format Online — Free Instant Tool

Read Now
SEO & Performance

The Complete Core Web Vitals Guide (2026 Edition)

Read Now
SEO & Performance

The Ultimate Technical SEO Audit Checklist

Read Now
SEO Tools

301 vs 302 vs 307 Redirects: HTTP & SEO Engineering Guide

Read Now
Developer Tools

Cron Syntax Reference: Evaluating Fields and Operators

Read Now
Design Tools

Favicon Sizes in 2026: The Complete Asset Manual

Read Now
Tutorials

JS Regex Cheat Sheet: ECMA-262 Reference & Catastrophic Backtracking

Read Now
Security

Privacy First: The Architecture of Zero-Knowledge Client-Side Web Utilities

Read Now
Research

Achieving a 3ms TTFB: Edge Caching & Core Web Vitals (2026)

Read Now
Engineering

Base64 Encoding Architecture: Binary Data, API Bloat, and the V8 Engine Crash

Read Now

✓ Last tested: June 2026 · Verified against RFC 9309 and current crawler documentation

1. Field Notes: The Scraping Spike

In early 2024, a media client's server started throwing 502 Bad Gateway errors every weekend. We checked the access logs expecting a DDoS attack or a traffic surge from a viral article.

Instead, we found thousands of requests per minute originating from an AWS IP range. The User-Agent string was Bytespider (ByteDance's AI crawler), aggressively scraping every article, archive, and paginated comment section on the site to feed their LLM training pipeline.

# Excerpt from the nginx access log
192.168.1.100 - - [14/May/2024:02:14:15 +0000] "GET /archive/2018/page/42 HTTP/1.1" 200 4521 "-" "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"

The site was struggling to serve legitimate users because it was busy feeding AI models for free. I updated their robots.txt file, explicitly disallowing Bytespider, GPTBot, and ClaudeBot. Within an hour, the CPU usage dropped by 70%. The lesson: in 2026, a default User-agent: * Allow: / robots.txt file is an open invitation for AI scrapers to consume your server resources.


2. Robots.txt Syntax — Complete Reference

The robots.txt file is a plain text file hosted at the root of your domain (e.g., example.com/robots.txt). It uses a simple syntax based on two primary directives: declaring a User-agent (the bot), and giving it rules (Allow or Disallow).

Directive What it Controls Example Value Notes
User-agent The specific bot the rules apply to. User-agent: Googlebot Use * to target all bots not specifically named.
Disallow A path the bot should not crawl. Disallow: /admin/ Prevents crawling of the /admin/ directory and everything inside.
Allow Overrides a Disallow for a specific sub-path. Allow: /public/images/ Must be used alongside a broader Disallow rule.
Sitemap Points to your XML sitemap. Sitemap: https://site.com/sitemap.xml Can be placed anywhere in the file.

3. Original Findings: Who is Scraping You?

After analyzing access logs across multiple high-traffic domains in 2026, here is what I found regarding AI crawler behavior:

  • OpenAI is Polite: GPTBot strictly adheres to robots.txt directives. If you block them, they stop immediately.
  • The Common Crawl Menace: CCBot (Common Crawl) is the backend data source for many open-source models. Blocking CCBot cuts off a massive pipeline of automated scraping.
  • Google's Split Personality: Google uses Googlebot for search indexing, but Google-Extended for AI training. Blocking Google-Extended stops your content from training Gemini, while preserving your rankings in Google Search.

4. How to Block AI Training Crawlers in 2026

To block AI bots, you cannot use a wildcard for the User-agent. You must call them out by name. Here is the ultimate, copy-paste blocklist template for 2026.

The Full Block Template (Copy and Paste)

Add this block to the bottom of your existing robots.txt file.

# Block OpenAI (ChatGPT training)
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Block Anthropic (Claude training)
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

# Block Google AI Training (Keeps Google Search intact)
User-agent: Google-Extended
Disallow: /

# Block Common Crawl (Used by many open LLMs)
User-agent: CCBot
Disallow: /

# Block ByteDance/TikTok AI
User-agent: Bytespider
Disallow: /

# Block Perplexity AI
User-agent: PerplexityBot
Disallow: /

# Block Cohere
User-agent: cohere-ai
Disallow: /

5. Common Robots.txt Mistakes That Hurt SEO

If you make a mistake in this file, you can accidentally de-index your entire website.

Accidentally Blocking Googlebot

If you write Disallow: / under User-agent: *, you are telling every bot, including Google's search indexer, to stay away. Only use Disallow: / for specific AI bots you want to block, or on staging servers you want kept entirely private.

Using robots.txt to Hide Pages (It Doesn't Work)

Robots.txt stops a bot from crawling a page, but it does not stop it from indexing it if it finds a link to it elsewhere. The result? The page appears in Google search results with the cryptic description: "No information is available for this page." If you want a page completely hidden from search, you must use the <meta name="robots" content="noindex"> tag in the HTML head.

Missing Trailing Slash on Disallow Paths

Disallow: /admin will block the /admin/ directory, but it will also block a public page named /admin-team-profiles. To block only the directory, you must include the trailing slash: Disallow: /admin/.


Frequently Asked Questions

Q: Does blocking AI crawlers impact my SEO rankings? A: No. Search engines use different User-agents for their search indexers (like Googlebot and Bingbot) and their AI trainers (like Google-Extended). Blocking the training bots has zero impact on your traditional search rankings.

Q: How do I test if my robots.txt is working? A: Google Search Console offers a robots.txt tester tool. Alternatively, you can use our offline toolkit to validate your syntax before deploying.


Ensure your robots.txt syntax is perfect and verify exactly which bots are blocked. Use our free Robots.txt Toolkit to generate and validate your file safely →


External Sources


Abu Sufyan · Full-stack developer · Founder of WebToolkit Pro Github

Last updated: June 2026

Expert Recommendations

Pro Insights

  • 01.Robots.txt only controls crawling, not indexing. To prevent a page from appearing in search results, use the `noindex` meta tag instead of (or alongside) robots.txt.
  • 02.Wildcards (`*`) work for path matching, but you cannot use a wildcard for `User-agent`. You must name each AI crawler explicitly to block it, or block `*` universally.

Frequently Asked Questions

Q. Does blocking GPTBot in robots.txt work?

Yes, OpenAI respects the robots.txt standard. Adding `User-agent: GPTBot` followed by `Disallow: /` will prevent their web crawler from scraping your site for future training data.

Q. What happens if I make a syntax error in my robots.txt?

Minor syntax errors are often ignored, but a major error (like missing a User-agent line before a Disallow) can cause crawlers to interpret the file incorrectly or ignore your rules entirely.

Q. Should I block Google-Extended in robots.txt?

Blocking `Google-Extended` prevents your site from being used to train Google's generative AI models, but it does NOT affect your presence in standard Google Search results.

Q. Can malicious bots ignore robots.txt?

Yes, robots.txt is a polite request, not a security firewall. Malicious scrapers will ignore it. To block them, you need WAF rules or bot-protection software.

#SEO#AI#Security
AS

Abu Sufyan

Lead Systems Architect & Performance Engineer

Abu Sufyan specializes in V8 execution benchmarking, React architecture, and enterprise-grade technical SEO.

Blog & Journal Archive

All Entries →