The AI Crawler Landscape in 2026

The number of AI crawlers visiting websites has exploded since 2023. What started with a handful of crawlers from OpenAI and Google has grown into an ecosystem of 20+ distinct AI-related bots, each serving different models, different companies, and different use cases.

This is the definitive reference list. For each crawler, we cover the user agent string, the operating company, what data it collects, its typical crawl behavior, and how to control its access via robots.txt.

Tier 1: High-Impact AI Crawlers

These crawlers power the AI products with the largest user bases. Optimizing for these should be your first priority.

GPTBot (OpenAI)

User Agent: GPTBot/1.0 (+https://openai.com/gptbot)
Purpose: Training data and knowledge retrieval for GPT models and ChatGPT
Crawl Frequency: High – 100-500+ requests/day on mid-size sites
JavaScript: Does not execute JavaScript
Respects robots.txt: Yes

GPTBot is the single most important AI crawler for most websites. Blocking it means ChatGPT will have limited or outdated information about your brand. See our deep dive on GPTBot for full details.

ChatGPT-User (OpenAI)

User Agent: ChatGPT-User/1.0 (+https://openai.com/bot)
Purpose: Real-time browsing when ChatGPT users request live web content
Crawl Frequency: Variable – depends on user queries mentioning your site
Respects robots.txt: Yes

ClaudeBot (Anthropic)

User Agent: ClaudeBot/1.0 (+https://www.anthropic.com/crawlers)
Purpose: Training data and retrieval for Claude models
Crawl Frequency: Moderate – 50-200 requests/day on mid-size sites
Respects robots.txt: Yes

Read our ClaudeBot vs GPTBot comparison for optimization strategies.

Google-Extended (Google)

User Agent: Google-Extended
Purpose: Training data for Gemini (separate from search indexing by Googlebot)
Crawl Frequency: Moderate
Respects robots.txt: Yes

Important: blocking Google-Extended does NOT affect your Google Search rankings. It only controls whether your content is used for Gemini training.

PerplexityBot (Perplexity AI)

User Agent: PerplexityBot/1.0 (+https://perplexity.ai/bot)
Purpose: Real-time search and answer generation for Perplexity AI
Crawl Frequency: Moderate-High – crawls on demand when users ask questions
Respects robots.txt: Yes

Perplexity is unique because it operates more like a search engine than a training pipeline. Its crawler is often triggered by real-time user queries.

Tier 2: Significant AI Crawlers

These crawlers have meaningful reach and should be part of your optimization strategy.

Bytespider (ByteDance)

User Agent: Bytespider
Purpose: Training data for TikTok and Douyin search and AI features
Crawl Frequency: Very High – often the most aggressive crawler by request volume
Respects robots.txt: Partially – known to sometimes ignore directives

Bytespider is notorious for its aggressive crawling. Many sites rate-limit or block it entirely.

CCBot (Common Crawl)

User Agent: CCBot/2.0 (+https://commoncrawl.org/faq/)
Purpose: Open dataset used for training most major LLMs
Crawl Frequency: Periodic – large batch crawls several times per year
Respects robots.txt: Yes

Common Crawl is the foundation dataset for many AI models. Blocking CCBot may reduce your presence across multiple AI systems simultaneously.

Applebot-Extended (Apple)

User Agent: Applebot-Extended
Purpose: Training data for Apple Intelligence features including Siri
Crawl Frequency: Moderate
Respects robots.txt: Yes

Meta-ExternalAgent (Meta)

User Agent: Meta-ExternalAgent/1.0
Purpose: Training data for Meta AI (WhatsApp, Instagram, Facebook AI features)
Crawl Frequency: Moderate
Respects robots.txt: Yes

cohere-ai (Cohere)

User Agent: cohere-ai
Purpose: Training data for Cohere's enterprise AI models
Crawl Frequency: Low-Moderate
Respects robots.txt: Yes

Tier 3: Emerging and Niche AI Crawlers

These crawlers have smaller reach but are worth monitoring.

YouBot (You.com)

User Agent: YouBot
Purpose: Search and AI-powered answer generation for You.com
Respects robots.txt: Yes

Diffbot (Diffbot)

User Agent: Diffbot
Purpose: Web data extraction for AI knowledge graph
Respects robots.txt: Yes

Amazonbot (Amazon)

User Agent: Amazonbot
Purpose: Alexa AI features and Amazon search improvements
Respects robots.txt: Yes

OAI-SearchBot (OpenAI)

User Agent: OAI-SearchBot/1.0
Purpose: OpenAI's dedicated search product crawler (ChatGPT Search)
Respects robots.txt: Yes

AI2Bot (Allen Institute for AI)

User Agent: AI2Bot
Purpose: Research and open-source AI model training
Respects robots.txt: Yes

Timpibot (Timpi)

User Agent: Timpibot
Purpose: Decentralized search index
Respects robots.txt: Yes

Webzio-Extended (Webz.io)

User Agent: Webzio-Extended
Purpose: Web data for AI training datasets
Respects robots.txt: Yes

iaskspider (iAsk.ai)

User Agent: iaskspider
Purpose: AI-powered answer engine
Respects robots.txt: Yes

A Robots.txt Strategy for AI Crawlers

Here is a starting framework that allows all major AI crawlers access to your content pages while blocking sensitive areas:

Allow Tier 1 and Tier 2 crawlers by default
Block checkout, account, cart, and admin paths for all crawlers
Rate-limit or block Bytespider if its request volume is problematic
Monitor and evaluate Tier 3 crawlers before making allow/block decisions

For a complete robots.txt configuration walkthrough, see our robots.txt strategy guide.

How to Monitor All These Crawlers

Tracking 20+ AI crawlers manually through server logs is not practical. You need automated monitoring that identifies every AI crawler by user agent, tracks crawl frequency per crawler over time, shows which pages each crawler accesses most, alerts you when a new AI crawler starts visiting your site, and reports errors and blocked requests per crawler.

This is exactly what botjar does. One integration, complete visibility into every AI crawler touching your site.

Track every AI crawler in one dashboard. Botjar identifies, classifies, and monitors 20+ AI crawlers automatically – no log parsing required. Get your free bot audit →

The Complete List of AI Crawlers in 2026