robots.txt setup guide - control crawler access

How robots.txt works

Search engines use crawlers (Googlebot, Bingbot, etc.) to discover and index web pages. robots.txt tells these crawlers which paths they can and cannot access. The file sits at the site root (https://example.com/robots.txt). robots.txt became an official internet standard via RFC 9309 in 2022.

Need to generate one quickly? Use robots.txt Generator.

Syntax

User-agent: [crawler name]
Allow: [path to allow]
Disallow: [path to block]

Allow all pages for Googlebot:

User-agent: Googlebot
Allow: /

Equivalent shorthand (empty Disallow means nothing is blocked):

User-agent: Googlebot
Disallow:

Multiple crawlers

Allow Googlebot full access, block Yeti (Naver) completely, allow Bingbot only /about:

User-agent: Googlebot
Allow: /

User-agent: Yeti
Disallow: /

User-agent: Bingbot
Allow: /about
Disallow: /

Common user-agents

Googlebot - Google
Bingbot - Bing
Yeti - Naver
DuckDuckBot - DuckDuckGo
Baiduspider - Baidu
GPTBot - OpenAI
ClaudeBot - Anthropic
* - all crawlers

Additional directives

Sitemap

Google and other major crawlers support the Sitemap directive. While not part of the core RFC 9309 spec, it’s widely adopted.

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Crawl-delay

Crawl-delay sets seconds between requests. Bing supports it, Google does not (use Search Console’s crawl rate settings instead).

User-agent: Bingbot
Crawl-delay: 1

Comments

Lines starting with # are comments.

# Allow all crawlers
User-agent: *
Allow: /

Summary

robots.txt goes in the site root directory
User-agent specifies the target crawler
Allow/Disallow controls path access
Sitemap points crawlers to your sitemap
robots.txt is advisory, not a security mechanism

robotstxt.org
Related: Why robots.txt matters for SEO