Robots.txt Knowledge Hub

01

Foundation

Robots.txt Basics

A robots.txt file is a plain-text file placed at your website root — always at yourdomain.com/robots.txt — and follows the Robots Exclusion Protocol. Every crawler checks this file before crawling your site. It tells bots which pages to visit and which to skip. A missing robots.txt means full crawl access by default. Getting the basics right is the foundation of any technical SEO setup. Use our free robots.txt generator to create a correctly structured file for your platform in seconds, without writing any raw syntax yourself.

02

Foundation

Understanding Crawl Rules

Crawl rules define what bots are permitted or restricted from accessing on your site. Each rule has a User-agent (which bot it targets) and a directive — Allow or Disallow — followed by a URL path. Rules within one group apply only to that specified bot. When multiple rules match the same URL, the most specific (longest path) wins. An empty Disallow means full access; Disallow: / blocks your entire site. Understanding how rules stack and interact prevents accidental blocks. You can generate and test crawl rules using our free robots.txt file generator.

03

SEO Strategy

Robots.txt and SEO

Robots.txt is a critical but often misunderstood SEO lever. Its primary value lies in crawl budget management — directing Googlebot away from admin areas, internal search results, and paginated duplicates so it focuses on content that actually needs to rank. Incorrectly blocking CSS or JavaScript files harms rendering, which directly impacts rankings. Forgetting to add a Sitemap directive slows content discovery. A well-optimised robots.txt file won't skyrocket rankings overnight, but a broken one can quietly devastate them. Use our robots.txt generator to produce SEO-correct files, then verify them with our validator.

04

Bot Behavior

Googlebot Behavior

Googlebot follows the Robots Exclusion Protocol but has unique behaviors to know. It caches your robots.txt for up to 24 hours, so changes aren't instant. It completely ignores the Crawl-delay directive — use Google Search Console to control crawl rate instead. Googlebot respects named User-agent rules over wildcard (*) rules, and among matching rules, longer paths take priority. Critically, Googlebot needs access to CSS, JavaScript, and images to render pages properly — blocking these files directly hurts rankings. Always validate your robots.txt to confirm Googlebot is seeing exactly what you intend before going live.

05

Bot Behavior

Bingbot Crawl Rules

Bingbot, Microsoft's web crawler for Bing search, follows the same robots.txt standard as Googlebot but with one key difference: Bingbot does honor the Crawl-delay directive, which lets you throttle how frequently it fetches pages — useful if aggressive crawling is straining your server. Bingbot also processes named User-agent: Bingbot rules with priority over wildcard groups. For most sites, a single wildcard (*) group handles both crawlers correctly. If you need bot-specific behavior, the Bot Manager inside our robots.txt generator creates separate rule groups per crawler with ease.

06

Bot Behavior

AI Crawlers & Robots.txt

AI training bots — GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended, PerplexityBot, and others — actively crawl the web to harvest training data. Robots.txt is currently the standard mechanism for opting out. Adding Disallow: / under each AI bot's User-agent block signals that your content shouldn't be used for model training. Compliant AI crawlers respect these directives; not all do. Our robots.txt file generator includes a one-click toggle that automatically adds opt-out rules for all major AI crawlers, so you never need to manually track each bot's current user-agent string.

07

Directives

User-Agent Directives

The User-agent directive is the first line of every rule group. It identifies which crawler the rules beneath it apply to. Use * to target all bots at once, or name a specific crawler like Googlebot or AhrefsBot for targeted rules. A single group can list multiple User-agent lines before adding any Allow or Disallow directives. Named agent rules always override wildcard rules for that specific crawler. Groups must be separated by blank lines — without them, directives bleed across groups and produce unexpected behavior. Our robots.txt validator highlights User-agent structure errors in real time, so you can catch grouping issues instantly.

08

Directives

Allow Rules Explained

The Allow directive overrides a broader Disallow for a specific path. Its most common use: blocking an entire folder while carving out a single file within it. The WordPress example is classic — Disallow: /wp-admin/ blocks the admin area, but Allow: /wp-admin/admin-ajax.php keeps AJAX-powered features accessible to crawlers. When two rules of equal path length match a URL, Allow wins over Disallow. Our robots.txt file generator pre-populates correct Allow exceptions for WordPress, WooCommerce, and Shopify presets automatically, so you never accidentally break your site's crawlable functionality.

09

Directives

Disallow Rules Explained

The Disallow directive instructs a crawler to skip any URL matching the specified path pattern. Disallow: /wp-admin/ blocks everything under that folder. Disallow: / blocks your entire site. An empty Disallow: (nothing after the colon) allows full crawl access. Crucially, Disallow does not remove pages from Google's index — it only stops crawling. If a page is linked from elsewhere, Google can still list it in search results without reading its content. Use our robots.txt generator to build correct Disallow rules, then confirm behavior with our URL tester.

10

Syntax

Wildcards in Robots.txt

The asterisk (*) wildcard matches any sequence of characters within a URL path, enabling flexible pattern-based rules. For example, Disallow: /search/* blocks all URLs starting with /search/ regardless of what follows. Disallow: /*.pdf blocks any URL containing .pdf anywhere in the path. Wildcards are supported by Google, Bing, and most modern crawlers. Misused wildcards can accidentally block large sections of your site. Always run wildcard rules through our robots.txt validator to verify they match exactly what you intend — not more, not less.

11

Syntax

End Anchors ($) in Robots.txt

The dollar sign ($) end anchor forces a path pattern to match only at the end of a URL. Disallow: /*.pdf$ blocks only URLs that end in .pdf — it won't accidentally block /pdf-guide/ or /download-pdf-form/. Without the end anchor, /*.pdf would match any URL containing .pdf anywhere in the string. End anchors are supported by Googlebot and Bingbot. Combine them with wildcards for surgical rule precision that targets exactly the right URL patterns. After writing end-anchor rules, verify each one in our robots.txt validator to confirm match behavior is exactly as expected.

12

Syntax

Rule Priority & Matching

When multiple robots.txt rules match the same URL, the longest (most specific) matching path wins — this is the Google standard. Disallow: /admin/ and Disallow: /admin/settings/ both match /admin/settings/ — but the longer rule takes priority. When two rules have identical path lengths, Allow wins over Disallow. Named User-agent groups always take priority over wildcard (*) groups for that specific crawler. Rule priority is one of the most commonly misunderstood areas of robots.txt and the most common source of unexpected bot behavior. Verify your rule priority logic using our URL tester before any production change.

13

Foundation

Robots.txt File Structure

A valid robots.txt file follows strict structure: one or more User-agent lines, followed by Allow/Disallow rules, with blank lines separating each group. Comments start with #. The Sitemap directive is global — not group-specific. The file must be placed at your domain root; the filename must be lowercase. Directive keys are case-insensitive, but path values are case-sensitive. Even syntactically correct robots.txt files can behave unexpectedly when structure is off. Our robots.txt generator always produces correctly structured output, and the validator's editor checks structure errors line by line.

14

Directives

Sitemap Directives

The Sitemap directive in robots.txt tells crawlers exactly where to find your XML sitemap file. It must use an absolute URL starting with https://. You can add multiple Sitemap lines for separate sitemap files — main, image, and news sitemaps, for example. The Sitemap directive is global and applies to all crawlers regardless of its position in the file. Google, Bing, and most major crawlers respect it and use it to speed up discovery of new and updated content. Our robots.txt file generator automatically formats Sitemap directives as correct absolute URLs when you enter your domain, eliminating a very common mistake.

15

Directives

Crawl-delay Directives

The Crawl-delay directive requests that a crawler wait a specified number of seconds between fetching pages. It's useful for servers under load from aggressive bots. Importantly, Googlebot completely ignores it — to control Google's crawl rate, use the crawl rate settings in Google Search Console instead. Bingbot, Yandexbot, and many other crawlers do honor Crawl-delay. Set it per User-agent group for targeted throttling of specific bots. Setting too high a value can meaningfully slow down indexing of new content. Our robots.txt generator lets you toggle and configure crawl delay with a clean interface, without manually editing any raw syntax in the file.

16

Technical SEO

Robots.txt Rendering Issues

Googlebot renders your pages like a browser before indexing them — meaning it needs access to CSS stylesheets, JavaScript files, web fonts, and images to accurately understand your content. If robots.txt accidentally blocks these resources, Google sees a broken, unstyled version of your page. This directly impacts how it interprets your content, layout signals, and structured data — all of which influence rankings. Blocking /wp-includes/ or /assets/ is a classic culprit. Check that your resource files are all accessible using the Resource Checker inside our robots.txt validator.

17

Technical SEO

CSS & JS Crawl Access

Allowing Googlebot to access your CSS and JavaScript files is not optional — it's a confirmed ranking factor. When these files are blocked, Google's rendering pipeline produces a text-only view of your page, missing layout signals, JavaScript-rendered content, and structured data entirely. WordPress sites commonly block /wp-includes/ in older robots.txt configurations, cutting off jQuery and core CSS. Modern SEO best practice is to allow all theme and plugin asset directories. Our robots.txt file generator pre-includes the correct Allow rules for asset folders across all platform presets, so rendering access is never accidentally blocked.

18

Technical SEO

Resource Accessibility

Resource accessibility refers to whether crawlers can reach all the files needed to fully render your web pages — CSS, JavaScript, fonts, and images. A page may be crawlable at the HTML level, but if the assets that build its visual layout are blocked, Googlebot produces an incomplete rendering. This matters because Google uses rendering signals — not just raw HTML text — to understand and rank pages. The Resource Checker tab in our robots.txt validator lets you paste a list of resource URLs and immediately see which ones are blocked, so you can fix access issues before they show up as ranking problems in Search Console.

19

Advanced SEO

Crawl Budget Optimization

Crawl budget is the number of pages Googlebot crawls on your site within a given timeframe, based on site authority and server health. Wasting it on admin pages, internal search results, URL parameters, and infinite-scroll pagination leaves fewer crawls for content that actually needs to rank. Robots.txt is your primary crawl budget tool. Block patterns like /search/, /?s=, /tag/, and parameter-based duplicates. For large e-commerce or content-heavy sites, this is a tangible ranking factor. Use our robots.txt generator to apply crawl-budget-aware Disallow rules for your specific platform and configuration.

20

Advanced SEO

Crawl Efficiency Best Practices

Efficient crawling means Googlebot finds your most important content quickly without wasting allocations on low-value pages. Practical steps: add a Sitemap directive so crawlers know where to find your XML sitemap. Block parameter-based URLs that generate near-duplicate content. Allow crawl access to all CSS, JavaScript, and font resources. Avoid setting Crawl-delay for Googlebot since it ignores it. Keep your robots.txt file clean — large files with redundant rules add unnecessary processing overhead. After any change, validate your file and check Google Search Console for crawl errors to confirm changes are working as intended across your full URL set.

21

Indexing

Robots.txt & Indexing

Robots.txt controls crawling — not indexing. These are two entirely separate processes in Google's pipeline. Disallowing a URL stops Googlebot from visiting it, but Google can still add that URL to its index if it discovers the page through external or internal links. When a page is blocked from crawling but appears in the index, Google shows it with a "no information is available" message. This is a common source of confusion. If your goal is preventing indexing, use a noindex meta tag and keep the page crawlable. Use our robots.txt validator to audit unintended crawl blocks affecting your indexing behavior.

22

Indexing

Robots.txt vs Noindex

These two tools serve fundamentally different purposes and are commonly confused. Robots.txt Disallow stops crawling. A noindex meta tag stops indexing. The critical rule: never combine both on the same page. If you Disallow a URL, Googlebot can't crawl it — which means it can't read the noindex tag either. The page may still get indexed from external links, just without its content being understood. To correctly remove a page from Google's index: make it crawlable (no Disallow) and add a noindex meta tag to the HTML. Use our robots.txt validator to spot pages that may be accidentally Disallowed when they should only be noindexed.

23

Indexing

Robots.txt vs Canonical

The canonical tag (rel="canonical") tells Google which version of a duplicate page is preferred for ranking. It consolidates link signals from multiple URL variants into one canonical URL. Robots.txt Disallow, by contrast, blocks crawl access entirely — it doesn't solve duplication, it hides it. For URL parameter duplicates like ?sort= or ?color=, canonical is usually the right tool — it preserves the page while signalling which URL gets ranking credit. Use Disallow only for content with zero SEO value. Our robots.txt generator helps you target the right pages for Disallow versus leaving them for canonical treatment.

24

Indexing

Robots.txt vs X-Robots-Tag

The X-Robots-Tag is an HTTP response header that works like a noindex or nofollow directive — but for non-HTML file types like PDFs, images, and Excel files that can't carry meta tags. Meta noindex only works inside HTML. X-Robots-Tag works on any file type served by your server. Robots.txt controls crawl access for any URL. These three tools target different layers: robots.txt = crawler access, X-Robots-Tag / noindex = indexing signals. For blocking PDFs from the index, X-Robots-Tag is the correct approach. For blocking an entire PDF directory from crawling, Disallow: /*.pdf$ in robots.txt is appropriate. Generate correct file-type Disallow patterns for your site.

25

Indexing

Blocked Pages Still Indexed

A common technical SEO problem: a URL is Disallowed in robots.txt, but it still appears in Google search results. This happens because robots.txt stops crawling — not indexing. If Google discovers the URL through external links or crawlable internal links, it can index the URL without ever reading its content. The result typically shows no snippet or description. To actually remove such pages from Google's index, either: (1) make the page crawlable and add a noindex tag, or (2) submit a URL removal request through Google Search Console. Use our robots.txt validator to audit which blocked pages might be falling into this exact scenario on your site.

26

Security

Robots.txt Security Myths

One of the most dangerous misconceptions in web management: robots.txt is not a security tool. It does not restrict access to any URL — it only communicates a preference to cooperative crawlers. Listing a sensitive directory in robots.txt can actually advertise its location to malicious bots that do the opposite and crawl everything disallowed. Any human or script can read your robots.txt file directly at /robots.txt. Truly sensitive content — admin panels, private APIs, user data — must be protected with authentication, server-level access control, and firewall rules. Use robots.txt for crawl management only. Validate your current file using our robots.txt validator to audit what you're inadvertently exposing.

27

Syntax

Robots.txt Syntax Rules

Robots.txt syntax is strict and unforgiving. Each directive must appear on its own line. The format is always Field: value with a single space after the colon. Directive names (User-agent, Allow, Disallow, Sitemap) are case-insensitive but conventionally title-cased. Path values are case-sensitive. Groups must be separated by at least one blank line. Comments begin with # and must occupy their own line — inline comments after a directive are not officially supported and may be misread by some parsers. The file must use UTF-8 encoding. Crawlers that encounter malformed syntax may ignore the entire file or parse it unpredictably. Our robots.txt generator always produces syntactically correct output, eliminating these risks entirely.

28

Syntax

Common Syntax Errors

The most frequent robots.txt mistakes that silently break crawl behavior: missing blank lines between rule groups (causes directives to bleed across agents); using Disallow: /folder without a trailing slash (only blocks that exact path, not the folder's contents); using a relative URL in the Sitemap directive instead of an absolute https:// URL; incorrect capitalization of directive keys in parsers that are case-sensitive; and adding inline comments after directives on the same line. Even a single misplaced blank line can cause an entire rule group to be ignored. Catch every one of these instantly by running your file through our robots.txt validator.

29

Syntax

Duplicate User-Agent Issues

Having two separate rule groups with the same User-agent (e.g., two blocks for Googlebot) is a syntax error that produces undefined behavior. Different crawlers handle this differently — some merge the rules, some use only the first match, some use only the last. The correct approach is to combine all rules for a given bot into a single group. Similarly, listing the same User-agent multiple times within one group before the directives is valid — it targets multiple named bots under shared rules — but duplicating the group itself causes ambiguity. If you're managing rules for many bots, use our robots.txt generator's Bot Manager to handle multi-agent groups cleanly without duplication errors.

30

Syntax

Case Sensitivity in Robots.txt

Understanding what is and isn't case-sensitive in robots.txt prevents subtle bugs. Directive field names (User-agent, Disallow, Allow, Sitemap, Crawl-delay) are case-insensitive — disallow: and DISALLOW: are equivalent. However, URL path values are case-sensitive, following standard URI rules. Disallow: /Admin/ does not block /admin/. User-agent values are also case-insensitive per the spec — Googlebot and googlebot both match the same bot. Most modern crawlers normalize agent matching, but path matching always respects case exactly. When generating rules, our robots.txt generator uses the correct lowercase paths for all platform presets automatically.

31

Directives

Relative vs Absolute Sitemap URLs

The Sitemap directive in robots.txt requires an absolute URL — the full https:// path including your domain. Using a relative path like Sitemap: /sitemap.xml is technically incorrect and may be ignored by some crawlers. The correct format is Sitemap: https://yourdomain.com/sitemap.xml. This is one of the most common robots.txt mistakes and costs sites easy crawl discovery wins. Absolute URLs also allow you to reference sitemaps hosted on a CDN or subdomain if needed. Both Google and Bing require absolute URLs to process the directive correctly. Our robots.txt generator automatically formats your Sitemap directive as a correct absolute URL when you enter your domain — this error is physically impossible to make with the tool.

32

Directives

Empty Disallow Rules

An empty Disallow directive — Disallow: with nothing after the colon — is actually a valid and useful syntax. It signals full crawl permission for that User-agent group, effectively saying "crawl everything." This is the canonical way to explicitly grant full access to a specific bot while still having other rules in your file. It's also the default behavior when no robots.txt exists at all. This is distinct from Disallow: / — which blocks everything — and from omitting the directive entirely. When building bot-specific rule groups, use the empty Disallow as an explicit permit. Our robots.txt generator handles this pattern correctly in its allow-all configurations.

33

Use Cases

Blocking Entire Websites

To block all crawlers from your entire site, use User-agent: * followed by Disallow: /. This is the standard pattern for staging environments and development sites. However, understand the critical limitation: this does not remove existing indexed pages from Google's index, and Google can still list URLs discovered from external links. A full site block via robots.txt does not equal removal from search — it only stops future crawling. For staging sites, password protection or server-level IP restrictions are far stronger. If you need to block only specific bots, use named User-agent groups instead. Use our robots.txt generator's staging preset to generate the correct full-block configuration instantly.

34

Use Cases

Blocking Search Result Pages

Internal site search result pages are a major crawl budget drain and a duplicate content risk. Google explicitly recommends blocking them. For WordPress, the pattern is Disallow: /?s=. For custom search implementations, block the search URL path — e.g. Disallow: /search/. These pages have no unique value to rank, vary infinitely based on query strings, and consume crawl quota that should be spent on product pages, posts, and category pages that actually drive traffic. For e-commerce sites especially, unblocked search result pages can generate thousands of near-duplicate indexed URLs. Use our robots.txt generator — it includes pre-built search-blocking rules for WordPress, WooCommerce, and Shopify platform presets.

35

Use Cases

Blocking Admin Pages

Admin areas should always be blocked from crawlers — they have zero indexing value, expose sensitive functionality paths, and waste crawl budget. For WordPress, the standard rule is Disallow: /wp-admin/ with a critical carve-out: Allow: /wp-admin/admin-ajax.php must be added back to preserve AJAX functionality relied on by themes and plugins. Without the Allow exception, front-end features powered by admin-ajax break for Googlebot's rendering pass. Other platforms have equivalent paths — Magento's /admin/, Shopify handles this automatically. Our robots.txt generator pre-configures correct admin blocks with the necessary Allow exceptions per platform.

36

Use Cases

Blocking Login Pages

Login and registration pages offer no SEO value and should be blocked from crawlers to conserve crawl budget. Common patterns: Disallow: /wp-login.php for WordPress, Disallow: /login/ and Disallow: /register/ for custom platforms. For SaaS products with extensive auth flows — password reset, email verification, account creation — blocking all auth-related paths is best practice. Remember again: blocking login pages from crawling does not protect them from unauthorized access — that requires proper authentication logic. It only prevents crawlers from consuming budget on URLs they can never read or rank. Verify your login block rules are working with our URL tester.

37

Use Cases

Blocking Parameter URLs

URL parameters like ?sort=, ?color=, ?ref=, and session IDs create near-duplicate pages that fragment crawl budget and dilute link equity. When these are crawlable, Googlebot may discover thousands of parameter combinations as unique URLs. Blocking parameter patterns in robots.txt is one option: Disallow: /*?* blocks all parameterized URLs, but this is a blunt instrument — it blocks even URLs with legitimate query strings you want indexed. A more surgical approach is using Google Search Console's URL Parameters tool or canonical tags on specific parameter pages. Use our robots.txt generator to configure targeted parameter-blocking rules without over-blocking.

38

Use Cases

Blocking Faceted Navigation

Faceted navigation — filter combinations on e-commerce category pages — is one of the biggest crawl budget killers for online stores. A product category with 10 filter options can generate millions of unique URLs (color + size + brand + price combinations). Most of these have no independent ranking value and represent duplicate or near-duplicate content. Blocking faceted URL patterns in robots.txt conserves crawl budget for product detail pages that actually convert. Typical patterns: Disallow: /category/*?* or per-parameter rules. Canonical tags on category pages with filters are an additional layer. For WooCommerce and Magento, our robots.txt generator includes faceted navigation blocking presets that target the right URL patterns.

39

Use Cases

Blocking Pagination URLs

Pagination URLs like /page/2/, ?page=3, or /blog/page/4/ are a nuanced crawl budget decision. Deep pagination pages rarely rank and pull crawl quota away from fresh content. However, completely blocking all pagination can prevent Googlebot from discovering articles or products that only appear on page 3 or beyond. The recommended approach: block deep pagination (pages 5+) while allowing pages 2–4, or ensure all important content is linked from crawlable hub pages. For WordPress, Disallow: */page/* blocks all paginated archives. Test the impact before deploying using our robots.txt validator.

40

Use Cases

Blocking PDF Files

PDFs can be crawled and indexed by Google, which is often desirable for whitepapers, guides, and downloadable resources that can rank for informational queries. However, private documents, internal reports, or duplicate-content PDFs should be blocked. The robots.txt pattern is Disallow: /*.pdf$ — the end anchor $ ensures only URLs ending in .pdf are matched. To block PDFs in a specific directory only: Disallow: /private/*.pdf$. If your goal is preventing PDF indexing (not just crawling), use the X-Robots-Tag HTTP header with noindex — blocking crawling alone won't delist PDFs already indexed. Generate precise PDF-blocking patterns with our robots.txt generator.

41

Use Cases

Blocking Image Files

Blocking images from crawlers should almost never be done on a live site — it removes your images from Google Image Search entirely and prevents Googlebot from using visual signals during page rendering. To specifically block image crawling, use User-agent: Googlebot-Image with Disallow: / — this targets only Google's dedicated image crawler, not the main web crawler. Adding Disallow: /*.jpg$ to the wildcard group would block image rendering for all bots. Only block image directories from the main crawler if they contain purely internal, non-indexable assets. Protect AI training crawlers from scraping your image library using our generator's AI bot blocking toggles instead.

42

Use Cases

Blocking CSS Files

Blocking CSS files from Googlebot is one of the most impactful technical SEO mistakes you can make — and it still appears in live robots.txt files regularly. When CSS is blocked, Googlebot renders your pages without any styling: it sees raw HTML with no layout, no visual hierarchy, and no design signals it uses to interpret page structure. This directly harms Core Web Vitals scoring and content comprehension. Never add rules like Disallow: /*.css$ or block your theme's CSS folder. If your site has inherited a legacy robots.txt that blocks CSS, fix it immediately. Check whether any CSS paths on your site are currently blocked using the Resource Checker in our robots.txt validator.

43

Use Cases

Blocking JavaScript Files

Blocking JavaScript from Googlebot is equally harmful as blocking CSS — often worse, since most modern sites rely on JavaScript for rendering critical content, navigation, and structured data. If Googlebot can't load your JS files, it sees a broken page shell, missing content blocks, and incomplete schema markup. The impact is measured directly in rankings. A legacy WordPress pattern that blocks /wp-includes/js/ is one common culprit. Never add Disallow: /*.js$ to your robots.txt. The only JS files appropriate to block are things like analytics scripts or A/B testing tools that serve no rendering function. Audit your current JS accessibility using our validator.

44

AI Crawlers

AI Bot Blocking Rules

AI training crawlers have proliferated rapidly since 2023. Unlike search bots, they crawl to harvest content for model training — not to help your site get found in search results. Blocking them via robots.txt is the current industry-standard opt-out mechanism. The challenge: the list of active AI crawlers grows constantly, each with its own User-agent string. Maintaining this list manually is tedious and error-prone. The standard block pattern per bot is a named User-agent group with Disallow: /. Our robots.txt generator maintains an up-to-date list of all major AI crawler user-agents and adds all block rules with a single toggle — no manual tracking required on your end.

45

AI Crawlers

GPTBot Blocking

GPTBot is OpenAI's web crawler, used to collect training data for ChatGPT and other OpenAI models. OpenAI introduced it in August 2023 and explicitly documented how to opt out via robots.txt. The block rule is straightforward: add User-agent: GPTBot followed by Disallow: /. GPTBot is documented as compliant with robots.txt — OpenAI states it respects the directive. You can also block it from specific paths only, allowing it to crawl public content while blocking proprietary or paid content. Note that blocking GPTBot has no effect on ChatGPT's browsing feature (which uses a different user-agent) or on content already in OpenAI's training data. Add GPTBot blocking in one click using our robots.txt generator's AI bot panel.

46

AI Crawlers

ClaudeBot Blocking

ClaudeBot is Anthropic's web crawler, used to gather training and evaluation data for Claude models. Anthropic documents ClaudeBot's user-agent string and states that it respects robots.txt disallow rules. To opt out, add User-agent: ClaudeBot and Disallow: / as a separate rule group. Anthropic also uses additional user-agents for research crawling — check Anthropic's crawler documentation for the current complete list. As with all AI crawler blocking, blocking ClaudeBot only prevents future crawling; it has no effect on data already collected. Our robots.txt generator includes ClaudeBot and all Anthropic crawler variants in its AI blocking panel, keeping your file current as user-agent strings are updated.

47

AI Crawlers

PerplexityBot Blocking

PerplexityBot is used by Perplexity AI, the answer-engine search product. It differs from pure training crawlers — it actively retrieves and summarizes content to answer user queries in real time, similar to how Googlebot enables Google search. Whether to block it is a strategic decision: blocking it removes your content from Perplexity's answers and citations, which some publishers see as a traffic source. To block: User-agent: PerplexityBot with Disallow: /. Perplexity states it respects robots.txt, though this has been disputed in some cases. Our robots.txt generator includes PerplexityBot in the AI panel and lets you toggle it independently from other AI crawlers.

48

AI Crawlers

llms.txt Basics

llms.txt is an emerging proposed standard — not yet widely adopted — that serves as a companion file to robots.txt specifically for large language models. Where robots.txt controls crawler access, llms.txt aims to give AI systems structured, curated information about your site in a format optimized for LLM consumption: a Markdown file placed at /llms.txt that summarizes your content, context, and permissions for AI tools. It's conceptually similar to how sitemaps helped early search crawlers — it guides AI rather than blocking it. The standard was proposed in 2024 and is gaining interest among developer-focused sites. It works alongside robots.txt, not as a replacement. The llms.txt ecosystem is still evolving and does not have official support from major AI labs yet.

49

Platforms

Robots.txt for WordPress

WordPress generates a virtual robots.txt by default — there's no physical file unless you create one. The virtual default is very permissive. A production WordPress robots.txt should: block /wp-admin/ with an Allow exception for admin-ajax.php; block /?s= to prevent search result indexing; allow /wp-includes/ and /wp-content/ for full rendering access; and include a Sitemap directive pointing to your XML sitemap. To create a physical robots.txt in WordPress, either upload one via FTP or use an SEO plugin. Our robots.txt generator's WordPress preset auto-generates this complete configuration including all critical Allow exceptions.

50

Platforms

Robots.txt for Shopify

Shopify auto-generates a robots.txt file for all stores. Until 2021, merchants couldn't edit it at all. Shopify now allows customization via the robots.txt.liquid template in the theme editor — giving you full control through Liquid templating. Shopify's default file already blocks checkout, cart, account, and admin paths correctly. Common additions: blocking internal collection filter URLs to save crawl budget (Disallow: /collections/*?*), blocking policy pages with duplicate boilerplate, and adding AI crawler blocks. Custom apps and third-party storefront frameworks may override this behavior. Use our robots.txt generator's Shopify preset to generate the correct Liquid-compatible additions for your theme.

51

Platforms

Robots.txt for Magento

Magento (Adobe Commerce) generates a default robots.txt but allows full customization from the admin panel under Content → Design → Configuration → Edit → Search Engine Robots. Key rules for any Magento robots.txt: block /admin/, /checkout/, /customer/, and /catalog/product_compare/. Layered navigation generates enormous URL sprawl — blocking faceted filter parameters is critical for Magento crawl budget. Magento 2 also creates separate robots.txt per store view for multi-store setups. After generating your Magento configuration, validate that key product and category paths remain crawlable using our robots.txt validator.

52

Platforms

Robots.txt for Wix

Wix automatically manages a robots.txt file for every site and does not allow users to directly edit or replace it. The platform handles basic crawl management — blocking internal Wix system pages, editor URLs, and app management paths. However, this lack of control means you can't add custom Disallow rules, AI bot blocking, or Sitemap pointers beyond what Wix includes by default. The main SEO lever Wix provides is at the individual page level — you can mark pages as "hidden from search" in page settings, which adds a noindex tag rather than a robots.txt rule. If granular robots.txt control is important for your SEO strategy, this is one of Wix's known platform limitations. For validation of Wix's auto-generated file, use our validator to check what it currently contains.

53

Platforms

Robots.txt for Blogger

Blogger (Google's blogging platform) provides limited robots.txt customization via Settings → Crawlers and Indexing → Custom robots.txt. You can add a custom robots.txt that replaces Blogger's default file. Blogger's default already blocks /search (Blogger's internal search results) — critical for avoiding duplicate content. A clean Blogger robots.txt should block /search, /p/sitemap.html, label archive paths that generate thin content, and include a Sitemap directive pointing to your Atom or sitemap feed. Since Blogger is hosted on Google's infrastructure, Googlebot crawls it by default with good signals — the main focus is keeping search and label pages out of the index. Use our generator to produce a clean custom file for your Blogger settings.

54

Platforms

Robots.txt for Webflow

Webflow provides full robots.txt editing within its Project Settings → SEO panel. You can write or paste any valid robots.txt content directly — no FTP or file system access needed. Webflow's default file allows all crawlers, so any blocking rules must be manually added. Key additions for Webflow sites: block Webflow's internal staging/editor URLs if they're accessible, add AI crawler block rules, and include a Sitemap directive pointing to your Webflow-generated sitemap at https://yourdomain.com/sitemap.xml. Webflow CMS sites may also benefit from blocking filtered collection list URLs if they generate parameter-based duplicates. Generate a clean Webflow-ready file using our robots.txt generator, then paste directly into Webflow's SEO settings.

55

Platforms

Robots.txt for Ecommerce Sites

E-commerce sites have uniquely complex robots.txt needs due to their massive URL surface. A well-configured e-commerce robots.txt must: block checkout, cart, wishlist, and account pages; block faceted navigation filter combinations; block internal search results; block sort/filter parameter URLs; allow all CSS, JS, and image assets for full rendering; and include a Sitemap directive. Poorly configured robots.txt on e-commerce sites can waste crawl budget on millions of parameter-generated URLs while leaving new product pages undiscovered for days. Large e-commerce sites (10,000+ products) should treat robots.txt as a performance-critical configuration, not an afterthought. Our robots.txt generator includes dedicated WooCommerce, Shopify, and Magento presets with all these blocks pre-configured.

56

Platforms

Robots.txt for SaaS Websites

SaaS websites combine a marketing site (fully crawlable) with an application backend (should be fully blocked). A clean SaaS robots.txt strategy separates these two zones. Block all application routes — dashboards, settings, API endpoints, user data paths, billing pages, and auth flows. Explicitly allow the marketing site: homepage, blog, pricing, features, landing pages, and docs if publicly useful. User-generated content routes that create unique valuable pages (like public project pages or shareable links) need careful per-route decisions. For SaaS products on subdomains (app.yourdomain.com), the app subdomain should have its own robots.txt with Disallow: / to block it entirely. Generate SaaS-appropriate configurations with our robots.txt generator.

57

Platforms

Robots.txt for News Websites

News sites have a critical dependency on fast crawling — an article published at 9:00am that isn't crawled until noon has lost its timing advantage entirely. News robots.txt should prioritize crawlability of article paths and avoid anything that might slow Googlebot down. Specifically: never use Crawl-delay; allow all content paths with minimal Disallow rules; block only truly valueless paths like tag archives, author archives with thin content, and internal search. Crucially, include a News Sitemap directive alongside your main sitemap — Google News uses dedicated news sitemaps for fast article discovery. Block AI training crawlers (GPTBot, ClaudeBot) if you have a paid content model. Generate a news-optimised robots.txt and validate article URL access with our validator.

58

Platforms

Robots.txt for Portfolio Websites

Portfolio websites are typically small in URL count with minimal crawl budget concerns. A clean, minimal robots.txt is the right approach: allow everything, add a Sitemap directive, and optionally block AI crawlers if you don't want your creative work used as training data. There's no need for complex Disallow patterns on a 20-page portfolio. The most impactful addition is the Sitemap directive pointing Googlebot to your sitemap for fast project page discovery. If your portfolio uses a platform like Cargo, Squarespace, or a custom React/Next.js build, the configuration approach varies. Keep it simple, keep it clean. Use our robots.txt generator to produce a minimal, correct file in under 30 seconds — Sitemap included, no unnecessary rules added.

59

Platforms

Robots.txt for Multilingual Sites

Multilingual sites using subdirectories (e.g. /en/, /de/, /fr/) use a single root robots.txt file. Ensure all language subdirectories are crawlable — never accidentally block a language folder. If using subdomains (de.yourdomain.com), each subdomain needs its own robots.txt file. For ccTLDs (yourdomain.de), each domain has a separate robots.txt. A common mistake on multilingual sites is blocking a language subdirectory while testing and forgetting to unblock it. Include multiple Sitemap directives — one per language sitemap file or a sitemap index URL. Run the URL tester on sample URLs from each language to confirm full crawl access across all locales.

60

Platforms

Robots.txt for Staging Websites

Staging sites must always use a full block robots.txt: User-agent: * / Disallow: /. Without this, staging content can be indexed and appear in Google search results alongside — or instead of — your live site, creating duplicate content problems and confusing users. However, robots.txt alone is not enough: a determined crawler or user can still access staging content. Layer it with HTTP basic authentication, IP allowlisting, or a firewall rule as the primary access control. The most critical deployment rule in any CI/CD pipeline should be: staging environments deploy with blocked robots.txt, production environments deploy with correct crawlable robots.txt. Never accidentally swap the two. Generate the staging block configuration instantly with our generator.

61

Platforms

Robots.txt for Dynamic Websites

Dynamic websites that generate URLs based on user input, database queries, or session state face unique robots.txt challenges. Session ID URLs, user-specific paths, dynamically created search results, and A/B test variants can all generate infinite URL spaces that waste crawl budget. Robots.txt can target these with wildcard patterns — e.g. Disallow: /*?sessionid= or Disallow: /user/*/settings/. For dynamically generated content that should be indexed, ensure stable canonical URLs are used and that crawl paths lead to those canonical versions. Use our robots.txt validator's URL tester to verify wildcard rules correctly match dynamic URL patterns before deploying them live.

62

Platforms

Robots.txt for React Websites

React (Create React App) applications are typically single-page applications served from a public/ folder. Place your robots.txt directly inside the public/ directory — it will be served at the root of your domain. The key SEO concern for client-side React apps: JavaScript rendering is essential since all content is generated in the browser. Never block JS or your /static/ bundle directory. For SPAs without server-side rendering, Googlebot's rendering queue adds delay to indexing — consider pre-rendering or SSR instead. If your React app has hash routing (/#/path), those fragments are invisible to Googlebot. Use our robots.txt generator to produce a rendering-safe file for your React deployment.

63

Platforms

Robots.txt for Next.js Websites

Next.js offers first-class robots.txt support. In the App Router (Next.js 13+), you can export a robots.js or robots.ts file from app/ — Next.js generates the robots.txt automatically at build time. Alternatively, place a static robots.txt in the public/ folder for Pages Router projects. Since Next.js supports SSR and SSG, Googlebot can render pages without JavaScript — but you still need to allow access to /_next/static/ for full CSS and JS access. Block API routes (Disallow: /api/) and any internal paths. Generate the correct Next.js robots.txt content with our robots.txt generator.

64

Testing & Validation

Robots.txt Testing Methods

Testing robots.txt before deployment prevents costly crawl mistakes. There are four primary methods: (1) URL testing — paste a specific URL and your robots.txt into a tester to see which rule applies; (2) syntax validation — check the file for structural errors, missing blank lines, and malformed directives; (3) resource accessibility checking — verify CSS, JS, and image paths are crawlable; (4) Google Search Console's robots.txt tester — shows how Googlebot specifically reads your live file. Always test after any change, not just on initial setup. Even small edits — adding a single Disallow line — can accidentally block important paths if wildcard patterns are involved. Our robots.txt validator combines all three non-GSC methods in a single tool — paste your file once and test everything in one session.

65

Testing & Validation

Robots.txt Validation Methods

Validation goes beyond syntax checking — a syntactically valid robots.txt can still produce incorrect crawl behavior. Complete validation includes: (1) syntax check — is the file correctly structured with proper line breaks and directive formatting? (2) rule logic check — do Allow/Disallow rules interact correctly, do wildcards match intended URLs? (3) URL-level testing — does a specific URL like /product/123/ get blocked or allowed? (4) resource check — are your asset directories accessible? (5) live file check — fetch the live robots.txt from your domain and confirm it matches what you intended to deploy. Google Search Console's robots.txt tool handles live validation for Googlebot specifically. Our robots.txt validator covers steps 1 through 4 with a single paste of your file.

66

Testing & Validation

Live Robots.txt Fetching

Live robots.txt fetching means retrieving the actual robots.txt file currently being served from your domain — rather than testing a local copy — to confirm what crawlers are actually reading. Crawlers see your live file, not the version sitting in your CMS draft or local text editor. A common production issue: a staging robots.txt gets accidentally deployed over a live one, blocking the entire site. Detecting this requires fetching the live file and reading it. You can do this manually by visiting yourdomain.com/robots.txt in a browser, via curl in terminal (curl https://yourdomain.com/robots.txt), or via Google Search Console. Always compare your live file against your intended configuration after any deployment. Our validator lets you paste the fetched content for immediate analysis.

67

Testing & Validation

Robots.txt Debugging

When robots.txt behaves unexpectedly — pages blocked that should be crawlable, or crawlable pages still not indexed — use this debugging sequence: (1) fetch and read the live robots.txt at your domain root; (2) confirm the file is served with HTTP 200 (not 404 or 500 — both mean all bots get full access); (3) test specific problem URLs using the URL tester in our validator; (4) check Google Search Console's Coverage report for "Blocked by robots.txt" errors; (5) remember the caching lag — Google caches your robots.txt for up to 24 hours, so recent changes may not have propagated. If a page is blocked by robots.txt but still appearing in search results, see Topic 25 — this is a crawling vs indexing distinction.

68

Testing & Validation

Google Search Console & Robots.txt

Google Search Console (GSC) provides several robots.txt-related tools and reports. The Coverage report shows URLs with "Blocked by robots.txt" status — useful for catching accidental blocks. The URL Inspection tool shows whether a specific URL is blocked by robots.txt. GSC's Crawl Stats report shows crawl frequency trends that can indicate whether robots.txt changes are having the intended effect on bot behavior. Historically, GSC had a dedicated robots.txt Tester tool, though its availability has changed over time. GSC also shows when Googlebot encounters a 500 error fetching your robots.txt — which causes it to temporarily stop crawling your site entirely. Always monitor GSC after any robots.txt change. Validate your file with our validator before deploying to catch issues before GSC flags them.

69

Advanced SEO

Robots.txt Caching Behavior

Crawlers cache your robots.txt file and use the cached version for a period of time before re-fetching. Google caches robots.txt for up to 24 hours by default, meaning changes you make today may not be reflected in Googlebot's behavior until the following day. Bingbot has similar caching behavior. This has a practical implication: if you need to urgently unblock a page, the fix won't be instant — there will be a crawl delay equal to the remaining cache TTL. You can influence caching behavior by setting a Cache-Control: max-age header on your robots.txt response (though Google caps it at 24 hours). For urgent unblocking after a major deployment error, using Google Search Console's URL inspection and request indexing tool provides faster recovery.

70

Advanced SEO

Robots.txt File Limits

Google enforces a 500 kibibytes (512,000 bytes) size limit on robots.txt files. Content beyond this limit is ignored — meaning any rules past the 500 KiB mark simply don't exist from Googlebot's perspective. In practice, a well-written robots.txt is rarely more than a few kilobytes; hitting the limit usually indicates an over-engineered file with excessive redundancy. There is no official limit on the number of rules, groups, or Sitemap directives, but best practice is to keep the file lean: use wildcards rather than listing individual URLs, combine rules into logical groups, and remove redundant or deprecated directives. If your robots.txt is growing large, it's often a sign that your crawl management strategy needs architectural cleanup rather than more rules. Use our generator to produce optimally structured, compact files.

71

Technical SEO

Robots.txt and JavaScript SEO

JavaScript SEO and robots.txt intersect in a critical way: if Googlebot can't access your JavaScript files, it can't execute your JS, which means it sees only the initial HTML skeleton of your page — none of the dynamically rendered content. For JavaScript-heavy sites (React, Vue, Angular, Next.js), this is catastrophic for indexing completeness. The robots.txt rule is simple: never block your JavaScript bundle directories. Specifically allow or leave unblocked: /_next/static/, /static/js/, /assets/. Also remember that Googlebot processes JavaScript in a deferred rendering queue — SSR or pre-rendering removes this delay. Check JS accessibility with our Resource Checker.

72

Technical SEO

Robots.txt and Image SEO

Image SEO is directly tied to robots.txt in two ways. First, blocking image files or image directories from the main Googlebot crawler prevents your images from appearing in Google Images search — a meaningful traffic source for photography, product, food, and visual content sites. Second, Google uses images during page rendering to understand visual context, product details, and layout — blocked images produce an incomplete page understanding. To specifically opt out of Google Image Search without affecting page crawling, use the dedicated User-agent: Googlebot-Image group. Never block image directories from the main User-agent: * group. Audit your image directory accessibility using the Resource Checker in our validator.

73

Technical SEO

Robots.txt and Technical SEO

Robots.txt is one of the foundational layers of technical SEO — it sits at the entry point of the entire crawl-render-index pipeline. Get it wrong and every other technical SEO effort downstream is compromised. The correct technical SEO approach treats robots.txt as a precision tool: block pages with zero ranking potential (admin, checkout, session URLs, search results), explicitly allow all rendering resources (CSS, JS, fonts, images), include a Sitemap directive, and leave everything else open. Technical SEO audits should include robots.txt as a mandatory checklist item — checking for accidental blocks of important paths, missing Sitemap directives, and rendering resource accessibility. Generate a technically correct file with our robots.txt generator, then verify it fully with our validator.

74

Technical SEO

Robots.txt and Crawlability

Crawlability is a page's ability to be accessed and fetched by search engine crawlers. Robots.txt is the primary signal controlling crawlability — a Disallow rule makes a URL uncrawlable to compliant bots. But crawlability isn't determined by robots.txt alone: server errors (500s, timeouts), redirect loops, nofollow on all internal links, login walls, and JavaScript errors can all make pages effectively uncrawlable even without a robots.txt rule. A complete crawlability audit checks all these layers. Robots.txt auditing specifically means confirming that every URL you want indexed is not accidentally blocked by a rule, wildcard pattern, or parent directory Disallow. Test individual URL crawlability directly in our robots.txt validator's URL Tester — paste any URL and see exactly which rule (if any) is blocking it.

75

Technical SEO

Robots.txt and Renderability

Renderability is a step beyond crawlability — it asks whether a crawler can not only fetch a page, but fully render it as a browser would. A page can be crawlable (robots.txt allows it) yet not fully renderable if the resources it depends on are blocked. Googlebot's modern rendering pipeline fetches the HTML, then processes all linked CSS and JavaScript to build a complete visual model of the page. If robots.txt blocks any of these resources, the render is incomplete — meaning Google's understanding of the page's content, layout, and structured data is degraded. Renderability requires both the page URL and all its dependencies to be unblocked. The Resource Checker inside our robots.txt validator is specifically designed to verify renderability — paste a list of asset URLs and see which ones are blocked, so you can restore full rendering access before it impacts rankings.

Your Complete Robots.txt Reference

How to use this hub

All Robots.txt Topics

Ready to apply this knowledge?