robots.txt: What to Block, What Never to Block

A robots.txt file is one of the most misunderstood files on the web. Get it wrong and you can accidentally hide your entire site from Google — it has happened to large companies with engineering teams. Get it right and you guide crawlers efficiently, protect genuinely private pages, and preserve crawl budget for the URLs that actually matter. This guide covers exactly what to put in, what to keep out, and the one rule that trips up experienced developers.

Syntax Fundamentals

A robots.txt file lives at the root of your domain — https://example.com/robots.txt — and is plain text. The structure is a series of records, each consisting of one or more User-agent lines followed by directives.

User-agent: * — applies to all crawlers.
User-agent: Googlebot — applies only to Google's crawler.
Disallow: /path/ — blocks that path for the named agent.
Allow: /path/ — explicitly permits a path, even inside a disallowed parent.
Crawl-delay: 5 — asks crawlers to wait 5 seconds between requests (Googlebot ignores this; use Google Search Console instead).

Rules are case-sensitive. /Admin and /admin are different paths. A blank Disallow: line means allow everything, which is the default behavior anyway. Comments start with #.

Minimal valid file that allows all crawlers everywhere:

User-agent: *
Allow: /

What to Block

Block anything that creates duplicate content, exposes private data, or wastes crawl budget on pages with zero search value.

Admin and login pages — Disallow: /admin/, Disallow: /wp-admin/. These have no ranking value and you do not want them indexed.
Internal search results — Disallow: /search?. Search-result pages are near-duplicate and Google's own guidelines recommend blocking them.
Faceted navigation / filter URLs — e.g., Disallow: /shop/?sort= or Disallow: /products/?color=. Each parameter combination can create thousands of near-identical pages.
Cart and checkout paths — Disallow: /cart/, Disallow: /checkout/.
Staging or dev environments — block the entire site with Disallow: / on staging subdomains or add HTTP auth so robots.txt is irrelevant.
Duplicate pagination beyond a threshold — Disallow: /*?page= if you have thousands of archive pages with thin content.

What You Must Never Block

These mistakes show up regularly in SEO audits and can tank rankings silently:

CSS, JS, and image files — Blocking /wp-content/ or /assets/ prevents Googlebot from rendering your pages correctly. Google needs to see the same resources a browser sees in order to understand your content.
Your sitemap — Some generators add a Disallow: /sitemap.xml by accident. Never do this.
Canonical pages — If /product-name/ is the canonical URL you want ranked, it must not be disallowed anywhere in the file.
Hreflang pages — If you rely on hreflang tags for international SEO, all language variants must be crawlable; otherwise Google cannot verify and process the hreflang annotations.
Structured data pages — Pages with schema markup (Product, Recipe, Article) need to be crawled for rich results to appear.

The robots.txt vs. noindex Mistake

This is the most consequential confusion in technical SEO. Robots.txt and noindex do completely different things, and using the wrong one backfires.

Scenario	Use robots.txt Disallow	Use noindex meta tag
You want the page out of the index and do not care if it is crawled	No	Yes
You want to save crawl budget on low-value pages	Yes	No — Google still crawls it
You want a page de-indexed but it has inbound links you want to preserve crawlability of	No	Yes
You want to remove a page fast from the index	No — Google cannot read the noindex if it is blocked	Yes, then use Search Console URL removal
You want to block a private admin tool entirely	Yes, plus HTTP auth	No — it may still appear via links

The critical trap: if you block a page in robots.txt and add a noindex tag, Google cannot see the noindex because it respects the robots.txt block. The page may stay in the index indefinitely. Choose one mechanism per page based on the goal.

The Sitemap Directive

robots.txt is the standard place to declare your sitemap location. Add this line anywhere in the file (it is not part of any User-agent block):

Sitemap: https://example.com/sitemap.xml

You can list multiple sitemaps:

Sitemap: https://example.com/sitemap-blog.xml
Sitemap: https://example.com/sitemap-products.xml

Google, Bing, and most major crawlers read this directive. It is the fastest way to ensure your sitemap is discovered without manually submitting it in every search console. If you use a sitemap index file, point to the index — you do not need to list every individual sitemap.

Crawl Budget: Why It Matters and How robots.txt Helps

Crawl budget is the number of URLs Googlebot will crawl on your site within a given time window. For most small sites under ~1,000 pages, crawl budget is not a concern. It becomes critical when:

Your site has more than ~10,000 URLs.
You have large e-commerce catalogs with filter and sort parameters generating millions of URL variations.
Your server is slow and Googlebot is backing off due to load.
Important new pages are taking weeks to get indexed.

robots.txt is the right tool to reclaim wasted crawl budget because blocked URLs are not fetched at all. By contrast, a noindex page still gets crawled — it just is not indexed. Blocking parameter-heavy faceted navigation with robots.txt can reduce crawl waste by 60–80% on large retail sites, freeing Googlebot to discover and index product pages faster.

A practical audit workflow: export your server logs, identify which URLs Googlebot is hitting most, check whether those URLs have search value, then disallow the patterns that do not. Do not guess — use data.

Common Patterns Reference

Pattern	What it blocks
`Disallow: /`	Entire site (use on staging only)
`Disallow: /admin/`	Exact /admin/ directory and everything under it
`Disallow: /*?`	All URLs with any query string
`Disallow: /*.pdf$`	All PDF files ($ anchors to end of URL)
`Disallow: /search`	/search and any URL starting with /search
`Allow: /admin/public/`	Reopens a subdirectory inside a blocked parent
`Disallow: /tag/`	WordPress tag archive pages
`Disallow: /*?replytocom=`	WordPress comment reply URLs (duplicate content)

Wildcards: the * character matches any sequence of characters. The $ character anchors a match to the end of the URL. These are supported by Google but not all crawlers — check documentation for any third-party bot you are targeting specifically.

Questions fréquentes

Does robots.txt guarantee a page will not appear in Google search results?+

No. A disallowed page can still appear in search results if other sites link to it — Google knows it exists from the links even without crawling it. To guarantee removal from the index, you need a noindex tag on a crawlable page, or a URL removal request in Search Console.

Should I block Googlebot from crawling my images?+

Only if you have a specific reason, such as preventing image search traffic you do not want. Blocking images with Disallow: /*.jpg$ stops Google Image Search from indexing them and can also prevent Googlebot from fully rendering pages that use those images as content.

What happens if my robots.txt file is unavailable or returns a 500 error?+

Google will temporarily pause crawling of your site until the file becomes accessible again. A missing or erroring robots.txt is treated as a temporary block — not as an open file — which can delay indexing of new content.

Can I use robots.txt to block specific bots like AI training scrapers?+

Yes, with limited effect. You can add User-agent blocks for known bots such as GPTBot, ClaudeBot, or CCBot and set Disallow: /. Compliant crawlers will respect this, but bad-faith scrapers generally ignore robots.txt entirely, so it is not a security measure.

How large can a robots.txt file be?+

Google's stated limit is 500 kibibytes (512,000 bytes). Content beyond that limit is ignored. In practice, a well-structured robots.txt should be a few hundred lines at most — if yours is approaching the size limit, you likely have a generation or templating problem to fix.