How to Handle AI Scraping Your Content | The Darkroom

The short answer

Do not treat AI scraping as one yes-or-no switch. It is two different jobs done by two kinds of bots: training crawlers that ingest your text into a model and rarely send anything back, and answer crawlers that fetch your page live to cite it in an AI answer. The move is to allow the bots that cite you, block or limit the pure-training bots you do not want, and never block your own visibility by accident. Blocking everything is the one mistake that quietly erases you from AI answers.

Why "should I block AI scraping?" is the wrong question

The question that gets asked is "should I let AI scrape my content?" The honest answer is that there is no single AI scraper to allow or block. There are two jobs, done by two kinds of bots, with opposite outcomes for you. Lumping them together is how brands end up making a single angry decision that costs them the exact visibility they wanted.

The first job is training. A training crawler reads your pages to build or refresh a model. It takes your words, folds them into weights, and almost never links back. The second job is answering. An answer crawler fetches your page in real time when a user asks a question, then synthesizes a reply that can name your brand and link to you as a source. One job takes. The other cites. Treat them the same and you either give your content away for free or delete yourself from the answers that drive traffic.

So the real question is not "block or allow." It is "which bots, doing which job, do I want on my site?" Once you frame it that way, the policy almost writes itself.

The two kinds of AI crawlers, and how to tell them apart

Most AI companies now run separate, separately-named crawlers for training and answering. That separation is what makes a nuanced policy possible. A few of the common ones:

OpenAI: GPTBot gathers content that may be used for training. OAI-SearchBot fetches pages to power ChatGPT search results and citations. ChatGPT-User fetches a page because a user clicked or asked for it live.
Perplexity: PerplexityBot indexes pages so they can be surfaced and cited in answers.
Google: Googlebot powers Search and is independent of AI training. Google-Extended is a control that governs whether your content trains Gemini, without touching your Search ranking.
Common Crawl: CCBot builds a public dataset that many models train on second-hand.

The pattern is simple once you see it: bots with "Search" or "User" in the name, plus PerplexityBot, are how you get into answers. The pure-training bots (CCBot, and GPTBot if you want to opt out of training) are the ones you can restrict without losing citations. If your pages are not even being reached, that is a different problem; we cover it in why AI crawlers cannot see your website.

The real tradeoff: free training vs. earned citations

Here is the tension, stated plainly. Blocking training crawlers protects your content from being used, uncompensated, to build a model that may later compete with you. That is a legitimate position, especially for original research, proprietary data, or paywalled work. But every bot you block is also a bot that cannot cite you. And in 2026, getting cited in AI answers is becoming the front door to your brand.

For most of the brands we run, the math favors visibility. An AI answer that names you and links to you is free distribution to a high-intent buyer who is already asking the question your product solves. Cutting that off to prevent training is usually trading a large, measurable upside for a small, abstract one. The exception is content whose value is the content itself: a subscription archive, a licensed dataset, original journalism. There, gating training makes sense.

A useful test: if a page exists to attract and convince buyers, keep it open to citing bots. If a page is the product you sell, gate it. Run that test per content type, not per site.

How to allow the citers and block the takers

The mechanism is your robots.txt file, where you grant or deny access by user-agent. A nuanced policy allows the answer crawlers, opts out of pure-training crawlers, and never blocks Googlebot. In plain terms, your file should explicitly allow OAI-SearchBot, ChatGPT-User, and PerplexityBot (the citers), and disallow CCBot, plus GPTBot if you want to sit out training while staying citable.

Two cautions that trip people up. First, robots.txt is a published request, not a lock. Well-behaved bots honor it; bad actors ignore it, so it is a policy signal, not a security control. Second, do not confuse opting out of training with opting out of answers. Blocking Google-Extended keeps your content out of Gemini training but leaves Search untouched, which is exactly the granularity you want. We break down where this file fits, and where it does not, in llms.txt vs. robots.txt explained.

Where llms.txt fits, and where it does not

A common follow-up: does llms.txt handle scraping? No, and conflating the two causes real mistakes. robots.txt is the file that controls access: who is allowed to crawl what. The proposed llms.txt file is a different idea entirely; it is a curated map that guides a cooperating model toward your best content, not a gate that blocks anyone. It has no enforcement and does not stop a single crawler.

So the division of labor is clean. Use robots.txt to decide which AI bots may read you. Use llms.txt, if you adopt it, to help the bots you have already allowed find the right pages faster. They are complementary, not interchangeable. If you are weighing whether the second one is worth the effort at all, start with what is llms.txt and do you need it before you touch your access policy.

What we do across a portfolio of brands

Acromatico runs a single AI visibility engine across more than 10 brands, and our default scraping posture is the same on each: open to citers, selective on trainers, never closed to Google. We allow OAI-SearchBot, ChatGPT-User, and PerplexityBot everywhere, because citations are the whole point. We restrict CCBot and, on a per-brand basis, GPTBot, when a client has a genuine reason to opt out of training. We never block a bot that powers answers, because doing so is the fastest way to disappear from the surfaces buyers now search on.

The one move we treat as a mistake almost every time is the panic block: a single Disallow: / aimed at "all the AI bots" that takes out the citing bots along with the training ones. It feels protective. It is self-removal. The whole skill here is sorting bots by what they do, then writing a policy that matches your goals per content type rather than swinging a blanket switch.

This is also why scraping policy is not a set-and-forget task. New bots launch, names change, and a crawler that was training-only last quarter may add a citing variant this quarter. We review the bot list on a regular cadence as part of the same visibility engine that handles schema, freshness, and consistency, so the policy stays aligned with what actually drives citations.

Questions people ask

Should I block AI crawlers from scraping my content?

Not as a blanket decision. AI scraping splits into two jobs done by different bots: training crawlers that ingest your text into a model and rarely send anything back, and answer crawlers that fetch your page in real time to cite it. If your goal is AI visibility, block or limit the training crawlers you do not want and keep the answer crawlers fully allowed, because the answer crawlers are how you get cited and earn referral traffic.

What is the difference between a training crawler and a citing crawler?

A training crawler, such as GPTBot or CCBot, gathers content to build or update a model and typically does not link back. A citing or answer crawler, such as OAI-SearchBot or PerplexityBot, fetches your page when a user asks a question and can surface your brand as a cited source with a clickable link. Blocking the first protects your content from uncompensated training; blocking the second removes you from AI answers entirely.

Will blocking AI crawlers hurt my search rankings or AI citations?

Blocking training crawlers like GPTBot does not affect Google Search rankings, because Googlebot is a separate crawler. But blocking answer crawlers, or blocking Google-Extended in a way that removes you from AI surfaces, can quietly erase you from AI-generated answers. The safe default is to allow the bots that cite and read your content in real time, and only restrict the pure-training bots.

Italo Campilii

Founder, Acromatico · runs AI visibility & brand systems across 10+ live brands

— Italo & Ale

written from the studio floor · developed in the darkroom

Want this done for you?

Not sure if your robots policy is helping or hiding you? Start with an AI visibility audit.

Get a free AI Visibility Audit →

AI VisibilityWhy AI Crawlers Cannot See Your Website AI Visibilityllms.txt vs. robots.txt Explained