Learn
Robots.txt Best Practices: Control Crawling Without Blocking SEO
Robots.txt controls crawling, not indexing. Learn how to use it without accidentally hiding important pages from search engines.
Run a fresh DomainLens audit and use the report as your priority list.
What robots.txt actually does
Robots.txt is a crawl-control file. It tells well-behaved crawlers which paths they are allowed to request, but it does not remove URLs from the search index by itself.
That distinction matters. A URL blocked in robots.txt can still be discovered through links and appear in search with limited information. Use noindex or redirects when the goal is index removal.
Safe rules to follow
- Keep the file available at /robots.txt and return a 200 response.
- Declare your XML sitemap so crawlers can discover canonical URLs faster.
- Block crawl traps, internal search pages, and endless parameter combinations.
- Do not block CSS, JavaScript, images, or rendered resources Google needs to evaluate the page.
Common mistakes
- Leaving Disallow: / from staging after launch.
- Blocking pages that should pass canonical or noindex signals.
- Assuming robots.txt is a security feature for private URLs.
- Testing only the homepage and missing blocked templates deeper in the site.
How to validate changes
After editing robots.txt, run a fresh DomainLens audit and inspect important URLs in Google Search Console. Confirm that the rendered page, canonical target, sitemap, and robots rules agree.
For large sites, review server logs after deployment. A clean robots.txt file should reduce wasted crawling without hiding pages that need to rank.