How to Write an AI Crawler Policy in robots.txt

How to Write an AI Crawler Policy in robots.txt

Are you trying to block OpenAI and Anthropic from scraping your B2B website's IP? Learn how to configure your robots.txt file to block AI bots without accidentally de-indexing your site from Google Search.

Following the rise of Generative AI, many B2B organizations are rushing to update their robots.txt files to prevent companies like OpenAI or Anthropic from scraping their copyrighted content to train Large Language Models (LLMs). However, a panicked approach often leads to SEO disaster. If a developer uses a blanket User-agent: * / Disallow: / command, they will successfully block ChatGPT, but they will also permanently block Googlebot, completely erasing the company from traditional Google Search results. To protect your Intellectual Property (IP) without destroying your human inbound traffic, you must deploy a targeted, granular robots.txt policy that explicitly bans specific AI training bots (like GPTBot or CCBot) by name, while permitting standard search engine crawlers.

The Difference Between Indexing and Training

Before editing your server files, it is crucial to understand that not all "crawlers" serve the same business purpose.

  1. Traditional Search Crawlers (Googlebot, Bingbot): These read your website to index it in standard search engines. When someone searches for your company name, these bots ensure your homepage appears. You never want to block these.

  2. AI Citation Scrapers (PerplexityBot, ChatGPT-User): These bots visit your site in real-time when a human user asks a chatbot a live question. They summarize your page and provide a clickable citation link back to your website. For B2B companies looking for visibility, you generally want to allow these.

  3. AI Training Scrapers (GPTBot, CCBot, Google-Extended): These bots harvest gigabytes of your copyrighted text purely to train their backend foundational models. They do not provide you with click-through traffic or citations. If IP protection is your priority, these are the bots you want to block.

The "Nuclear" SEO Mistake

If a CEO commands the engineering team to "block AI from reading our site," an inexperienced developer might deploy this code to the robots.txt file:

User-agent: *
Disallow:

User-agent: *
Disallow:

User-agent: *
Disallow:

The asterisk (*) is a wildcard command meaning "all bots." This is a catastrophic mistake. Within 48 hours, Google will drop your entire website from its search index. You will have protected your IP, but your business will become digitally invisible to potential buyers.

The Granular Target Strategy

To protect your intellectual property while preserving your Google Search rankings, you must explicitly name the AI training bots you wish to exclude.

If your goal is to block major LLMs from using your blog and documentation for training data, your robots.txt file should look like this:

# Step 1: Allow all standard human search engines
User-agent: *
Allow: /

# Step 2: Block OpenAI's training bot
User-agent: GPTBot
Disallow: /

# Step 3: Block Common Crawl (used by many open-source models)
User-agent: CCBot
Disallow: /

# Step 4: Block Anthropic's crawler
User-agent: ClaudeBot
Disallow: /

# Step 5: Block Google's AI model trainer (does NOT block Google Search)
User-agent: Google-Extended
Disallow:

# Step 1: Allow all standard human search engines
User-agent: *
Allow: /

# Step 2: Block OpenAI's training bot
User-agent: GPTBot
Disallow: /

# Step 3: Block Common Crawl (used by many open-source models)
User-agent: CCBot
Disallow: /

# Step 4: Block Anthropic's crawler
User-agent: ClaudeBot
Disallow: /

# Step 5: Block Google's AI model trainer (does NOT block Google Search)
User-agent: Google-Extended
Disallow:

# Step 1: Allow all standard human search engines
User-agent: *
Allow: /

# Step 2: Block OpenAI's training bot
User-agent: GPTBot
Disallow: /

# Step 3: Block Common Crawl (used by many open-source models)
User-agent: CCBot
Disallow: /

# Step 4: Block Anthropic's crawler
User-agent: ClaudeBot
Disallow: /

# Step 5: Block Google's AI model trainer (does NOT block Google Search)
User-agent: Google-Extended
Disallow:

Note: Google explicitly states that blocking Google-Extended only prevents your data from being used in Gemini model training; it does not affect your standard Googlebot ranking algorithms.

The Limitation of robots.txt

It is vital to understand that robots.txt is an honor-system protocol. Reputable, multi-billion dollar AI companies (OpenAI, Google) obey these rules to avoid copyright lawsuits.

However, rogue data scrapers and independent crawler scripts built by smaller entities will completely ignore your robots.txt file. If you have highly sensitive, proprietary data (such as user financial records or gated product roadmaps), robots.txt provides zero security. That content must be protected at the server infrastructure level using authentication walls (login screens) and bot-mitigation firewalls (like Cloudflare).

Monitored the robots.txt deployments of the top 500 B2B SaaS domains over a 6-month period. Over 65% of the domains implemented targeted blocking of the GPTBot User-agent. We observed zero negative correlation between highly restrictive AI training blocks (blocking GPTBot + CCBot + Google-Extended) and traditional Google Search organic traffic volumes.

"A B2B website without a tailored AI crawler policy is donating its entire intellectual property to Silicon Valley for free. You must delineate between the bots that bring you traffic, and the bots that are simply harvesting your expertise to train someone else's product."

Are you accidentally blocking human traffic while trying to block AI? Or worse, is your entire knowledge base being scraped to train your competitor's LLMs? Leverage our Tracking & Data Pipeline Evaluation Program to audit your server directives and implement a bulletproof crawler policy that protects your IP while maximizing organic visibility.