🏷️ seo,

robots.txt setup guide - control crawler access

위린이 위린이 · Updated · 1 min read
robots.txt setup guide - control crawler access

How robots.txt works

Search engines use crawlers (Googlebot, Bingbot, etc.) to discover and index web pages. robots.txt tells these crawlers which paths they can and cannot access. The file sits at the site root (https://example.com/robots.txt). robots.txt became an official internet standard via RFC 9309 in 2022.

Need to generate one quickly? Use robots.txt Generator.

Syntax

User-agent: [crawler name]
Allow: [path to allow]
Disallow: [path to block]

Allow all pages for Googlebot:

User-agent: Googlebot
Allow: /

Equivalent shorthand (empty Disallow means nothing is blocked):

User-agent: Googlebot
Disallow:

Multiple crawlers

Allow Googlebot full access, block Yeti (Naver) completely, allow Bingbot only /about:

User-agent: Googlebot
Allow: /

User-agent: Yeti
Disallow: /

User-agent: Bingbot
Allow: /about
Disallow: /

Common user-agents

  • Googlebot - Google
  • Bingbot - Bing
  • Yeti - Naver
  • DuckDuckBot - DuckDuckGo
  • Baiduspider - Baidu
  • GPTBot - OpenAI
  • ClaudeBot - Anthropic
  • * - all crawlers

Additional directives

Sitemap

Google and other major crawlers support the Sitemap directive. While not part of the core RFC 9309 spec, it’s widely adopted.

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Crawl-delay

Crawl-delay sets seconds between requests. Bing supports it, Google does not (use Search Console’s crawl rate settings instead).

User-agent: Bingbot
Crawl-delay: 1

Comments

Lines starting with # are comments.

# Allow all crawlers
User-agent: *
Allow: /

Summary

  1. robots.txt goes in the site root directory
  2. User-agent specifies the target crawler
  3. Allow/Disallow controls path access
  4. Sitemap points crawlers to your sitemap
  5. robots.txt is advisory, not a security mechanism
위린이

Written by ✍️ 위린이

Whisky, Camping, Cars, Guitar, Gaming, Design, Food