AI SEO

Should you block or allow AI crawlers? GPTBot, ClaudeBot and robots.txt guide

For most South African businesses chasing visibility in AI answers, you should allow AI crawlers, especially the retrieval crawlers that let ChatGPT, Perplexity, Copilot and Gemini cite your pages. You control them with Allow and Disallow rules per bot in your robots.txt file. Block only genuinely proprietary, paywalled or original research content.

Somewhere in your last analytics review you may have noticed unfamiliar bots hitting your site: GPTBot, ClaudeBot, PerplexityBot. The instinct for many business owners is to block them and protect the content. For most companies chasing visibility in AI answers, that instinct is exactly backwards. Here is how to make the call deliberately.

Should you block or allow AI crawlers in robots.txt
Written by Wynand van der Westhuizen Reviewed February 2026 10+ years experience Meta Business Partner AI search specialists

TL;DR: Quick Answer

For most South African businesses chasing visibility in AI answers, allow AI crawlers, especially the retrieval crawlers (OAI-SearchBot, PerplexityBot) that let ChatGPT, Perplexity, Copilot and Gemini cite your pages. Control them with Allow and Disallow rules per bot in your robots.txt file. The decision is not “AI bots: yes or no”, it is “which bots, for which purpose”. Block only genuinely proprietary, paywalled or original research content you do not want absorbed into model training.

Key takeaways

  • AI crawlers feed AI systems instead of a classic search index, and whether you allow them determines if your brand can appear in AI answers
  • Training crawlers collect text to teach models; retrieval crawlers fetch live pages so an AI can cite you, and the two need separate decisions
  • Blocking a retrieval crawler removes you from AI answers and citations entirely, often by accident
  • You allow or block each bot with named user-agent rules in robots.txt at the root of your domain
  • robots.txt is a polite instruction, not security: confidential content belongs behind authentication or a paywall
  • llms.txt is a guidance file that complements robots.txt; treat it as a bonus, not a substitute

Somewhere in your last analytics review you may have noticed unfamiliar bots hitting your site: GPTBot, ClaudeBot, PerplexityBot. The instinct for many South African business owners is to block them and protect the content. For most companies chasing visibility in AI answers, that instinct is exactly backwards. This guide walks through what these crawlers actually do, how to control them in robots.txt, and how to make the call deliberately rather than copying a blanket “block all AI” snippet.

What are AI crawlers?

AI crawlers are automated bots that visit your website on behalf of AI companies, either to gather data for training large language models or to fetch live pages so an AI assistant can answer a user's question and cite sources. They behave like traditional search crawlers but feed AI systems instead of a classic search index. Whether you allow them determines if your brand can appear in AI answers.

The major ones to know:

  • GPTBot and OAI-SearchBot: OpenAI (ChatGPT and ChatGPT Search).
  • ClaudeBot: Anthropic (Claude).
  • PerplexityBot: Perplexity.
  • Google-Extended: Google's control for Gemini and AI training (separate from regular Googlebot).
  • CCBot: Common Crawl, an open dataset many models train on.
  • Bingbot: Microsoft's crawler, which also powers Copilot and feeds Perplexity's web results.

With ChatGPT alone reaching roughly 700 million weekly users (OpenAI, 2025), these bots are no longer a fringe concern. They are how a growing share of your potential customers will first encounter your brand. If you want to understand the wider picture, our guide to Generative Engine Optimisation covers how to earn citations in AI answers, not just allow the crawlers in.

Major AI crawlers and what they do
User-agent Operator Primary job Affects AI citations?
GPTBot OpenAI Training Indirectly
OAI-SearchBot OpenAI Retrieval (ChatGPT Search) Yes, directly
ClaudeBot Anthropic Training and retrieval Yes
PerplexityBot Perplexity Retrieval Yes, directly
Google-Extended Google Gemini and AI training Gemini only, not Search
CCBot Common Crawl Training dataset Indirectly

Training crawlers vs retrieval crawlers

There are two distinct jobs, and the distinction is the single most important thing to understand. Training crawlers collect text to teach future AI models. Retrieval (or search) crawlers fetch your live pages in real time so an AI can answer a current question and cite you. Blocking a training crawler limits model training data; blocking a retrieval crawler removes you from AI answers and citations entirely.

This is where businesses accidentally shoot themselves in the foot. They read “block AI bots to protect our content”, apply a blanket rule, and quietly disappear from the answers ChatGPT Search and Perplexity give to their own prospective customers. GPTBot is largely a training crawler; OAI-SearchBot is the retrieval crawler that decides whether ChatGPT can cite you. Treating them identically is the mistake.

If your goal is Generative Engine Optimisation, being visible and cited in AI answers, you almost always want the retrieval crawlers in. The decision is not “AI bots: yes or no”. It is “which bots, for which purpose”.

AI crawlers do two different jobs: training crawlers collect text to teach future models, while retrieval crawlers fetch live pages so an AI can answer a question and cite the source. Blocking a training crawler limits the data used to train models. Blocking a retrieval crawler removes a website from AI answers and citations entirely. GPTBot is primarily a training crawler; OAI-SearchBot is the retrieval crawler that controls whether ChatGPT Search can cite a page. For visibility, allow the retrieval crawlers. Source: Juicy Designs, AI search visibility practice, South Africa, February 2026.

How do you allow or block AI crawlers in robots.txt?

You control AI crawlers the same way you control search engines: with directives in your robots.txt file at the root of your domain (yourdomain.co.za/robots.txt). Each crawler has a named user-agent. You add an Allow or Disallow rule per bot. Changes take effect the next time each crawler reads the file, usually within days.

A configuration that welcomes AI visibility while keeping a private directory off-limits looks like this:

# Allow AI search/retrieval and training crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

# Keep a private area out of all of them
User-agent: *
Disallow: /client-portal/
Disallow: /internal/

Sitemap: https://yourdomain.co.za/sitemap.xml

To block a specific crawler instead, swap its rule to Disallow: /. Two cautions. First, robots.txt is a polite instruction, not a hard lock; reputable bots honour it, but it is not security. Sensitive data belongs behind authentication, not behind a Disallow. Second, a single overly broad Disallow: / under the wrong user-agent can remove you from AI answers without you noticing, so review the file deliberately rather than copying a blanket “block all AI” snippet. If your robots.txt and crawlability need a proper review, our SEO and technical optimisation work covers exactly this.

Should you block or allow AI crawlers?

For most South African businesses, allow them. If you want your brand cited in ChatGPT, Perplexity, Copilot and Gemini answers, you must let their retrieval crawlers reach your pages. Blocking makes sense only in narrow cases: protecting genuinely proprietary content, paywalled material, or original research you do not want absorbed into model training. The default for visibility-focused businesses is open access.

Think of it commercially. Being absent from AI answers in 2026 is like being absent from Google results in 2010. Your competitors who allow crawlers get named and cited; you do not exist in that conversation. Given that AI Overviews now appear in around a quarter of searches and over half of long-tail queries, and that more than half of all searches are zero-click, the AI answer increasingly is the result page.

A sensible middle path: allow all retrieval crawlers (so you can be cited), and decide separately on training crawlers based on how protective you are of your raw content. A consultancy publishing original frameworks might allow retrieval but block training; a local e-commerce store chasing reach should typically allow both. Pairing open crawling with solid content marketing is what actually earns the citations.

“The most common mistake we see is a business that copied a 'block all AI bots' snippet off a forum, then wondered why ChatGPT never mentions them. They blocked the retrieval crawler by accident. For almost every South African business chasing visibility, the right default is to allow the crawlers and focus your energy on being worth citing.”

Wynand van der Westhuizen, Creative Director, Juicy Designs, reviewed and verified February 2026

How does llms.txt relate to AI crawlers?

llms.txt is a proposed file that offers AI systems a clean, structured guide to your most important content, complementing robots.txt rather than replacing it. Where robots.txt controls access (who may crawl what), llms.txt is about guidance (here is what matters and how it is organised). It is an emerging, not universally adopted, standard, so treat it as a bonus, not a substitute.

The two work together. robots.txt opens the door; llms.txt lays out a tidy map once the crawler is inside. Adoption across AI platforms is still inconsistent, so do not rely on llms.txt alone for visibility. The fundamentals, crawlable pages, clear structure, and genuinely useful content, still do the heavy lifting. If you implement llms.txt, treat it as one signal among many.

Frequently asked questions

Will blocking GPTBot remove me from ChatGPT entirely?

Not necessarily, because GPTBot is primarily a training crawler. The bot that controls whether ChatGPT Search can cite your live pages is OAI-SearchBot. If you block GPTBot but allow OAI-SearchBot, you can still appear in ChatGPT's cited answers. To disappear from those answers, you would need to block the retrieval crawler too.

Last updated: 2026-02-24

Does blocking AI crawlers affect my normal Google rankings?

No. AI crawlers like GPTBot, ClaudeBot and Google-Extended are separate from Googlebot, which handles classic search indexing. Blocking Google-Extended stops Gemini and AI training use but does not affect your standard Google Search rankings. Just confirm you have not accidentally disallowed Googlebot itself, which would harm normal rankings.

Last updated: 2026-02-24

Is robots.txt enough to protect confidential content from AI?

No. robots.txt is a voluntary instruction that well-behaved crawlers respect, not a security control. Malicious or non-compliant bots can ignore it. Truly confidential content, client data, internal documents and paid material must sit behind authentication or a paywall. Use robots.txt to manage reputable crawlers, never as your only line of defence.

Last updated: 2026-02-24

Wynand van der Westhuizen

Creative Director & Co-founder, Juicy Designs, Pretoria

Wynand co-founded Juicy Designs in 2015 and leads creative direction and client strategy. A Meta Business Partner, he owns client relationships across automotive, entertainment, retail and professional services, and works on AI-search visibility and content strategy for Juicy Designs clients.

  • Co-founder & Creative Director, Juicy Designs, established 2015
  • Meta Business Partner
  • 64+ South African clients, 4.9-star Google rating
  • Specialist in brand, creative & AI-search visibility
  • Reviewed and updated February 2026