What this checks

Five things that account for most robots.txt mistakes seen in the wild.

Crawler access. Whether User-agent: * or Googlebot is disallowed from /. This is the most common accidental self-de-indexing mistake, and it can go unnoticed for months.
Sitemap declaration. Whether the file includes a Sitemap: directive. Declaring it here is the most reliable way to point every crawler to your sitemap, regardless of whether you have submitted it to Search Console.
Crawl-delay. High values (10 s or more) slow re-indexing for Bing, Yandex, and smaller crawlers. Google ignores this directive entirely, so a high crawl-delay gives you the cost without the benefit.
Syntax. Unrecognised directives and lines without a colon. Most parsers tolerate these silently, but some crawlers may interpret a malformed file differently than intended.
AI crawler access. Which of the major AI and LLM crawlers — GPTBot, ClaudeBot, Google-Extended, and others — your file allows or blocks.

AI crawlers and robots.txt

robots.txt is now the main place a site opts in or out of AI crawling. Different crawlers do different jobs, and the distinction matters when you decide what to allow.

Training crawlers collect pages to train models. GPTBot, Google-Extended, CCBot, and Applebot-Extended fall in this group. Blocking them keeps your content out of future model training without affecting search.
Answer crawlers fetch pages so an AI product can cite them in a live answer. OAI-SearchBot and PerplexityBot are examples. Blocking these can cost you referral traffic and visibility in AI answers — usually the opposite of what an SEO wants.

One important caveat: Google-Extended controls Gemini training only — it has no effect on how Googlebot crawls or ranks your site for ordinary Search. Blocking it does not hurt your rankings.

robots.txt directives are advisory. Reputable operators document and honour their crawler tokens, but a directive is a request, not an enforced block.

What robots.txt actually controls

Robots.txt controls crawling, not indexing. A page disallowed by robots.txt can still appear in search results if Google finds it via a link — it just will not be crawled to read its content. To prevent indexing you need a noindex meta tag or X-Robots-Tag header on the page itself.

The spec is simpler than most developers expect. Only four directives are widely respected: User-agent, Allow, Disallow, and Sitemap. Everything else is either non-standard or crawler-specific.

Common mistakes

Blocking all crawlers after a staging migration

Development and staging environments often have Disallow: / for all bots. When a site is promoted to production and the domain changes, this robots.txt sometimes follows it. Traffic drops to zero over the next few weeks as Google re-crawls and finds nothing.

No sitemap in robots.txt

Submitting a sitemap only in Search Console leaves Bing, Yandex, and every other crawler to discover pages on their own. The Sitemap: directive is universally supported and takes ten seconds to add.

Disallowing CSS and JS

Older SEO advice suggested blocking stylesheets and scripts to conserve crawl budget. Google now requires access to these resources to render pages correctly. Blocking them may cause your pages to be evaluated based on unrendered HTML, which often looks thin.

Related tools

SERP simulator — preview how your title and meta description appear in Google search results.
Keyword density — paste your page copy and see which words and phrases dominate.

Full SEO audit

Robots.txt is one signal. The full CrawlRanker scan checks meta tags, heading structure, mobile readiness, page speed, schema markup, broken links, image alt text, and an AI-content score — all in about 30 seconds, no signup.

Run a free SEO audit →