The AI Crawler Landscape in 2026
The number of AI crawlers visiting websites has exploded since 2023. What started with a handful of crawlers from OpenAI and Google has grown into an ecosystem of 20+ distinct AI-related bots, each serving different models, different companies, and different use cases.
This is the definitive reference list. For each crawler, we cover the user agent string, the operating company, what data it collects, its typical crawl behavior, and how to control its access via robots.txt.
Tier 1: High-Impact AI Crawlers
These crawlers power the AI products with the largest user bases. Optimizing for these should be your first priority.
GPTBot (OpenAI)
- User Agent:
GPTBot/1.0 (+https://openai.com/gptbot) - Purpose: Training data and knowledge retrieval for GPT models and ChatGPT
- Crawl Frequency: High – 100-500+ requests/day on mid-size sites
- JavaScript: Does not execute JavaScript
- Respects robots.txt: Yes
GPTBot is the single most important AI crawler for most websites. Blocking it means ChatGPT will have limited or outdated information about your brand. See our deep dive on GPTBot for full details.
ChatGPT-User (OpenAI)
- User Agent:
ChatGPT-User/1.0 (+https://openai.com/bot) - Purpose: Real-time browsing when ChatGPT users request live web content
- Crawl Frequency: Variable – depends on user queries mentioning your site
- Respects robots.txt: Yes
ClaudeBot (Anthropic)
- User Agent:
ClaudeBot/1.0 (+https://www.anthropic.com/crawlers) - Purpose: Training data and retrieval for Claude models
- Crawl Frequency: Moderate – 50-200 requests/day on mid-size sites
- Respects robots.txt: Yes
Read our ClaudeBot vs GPTBot comparison for optimization strategies.
Google-Extended (Google)
- User Agent:
Google-Extended - Purpose: Training data for Gemini (separate from search indexing by Googlebot)
- Crawl Frequency: Moderate
- Respects robots.txt: Yes
Important: blocking Google-Extended does NOT affect your Google Search rankings. It only controls whether your content is used for Gemini training.
PerplexityBot (Perplexity AI)
- User Agent:
PerplexityBot/1.0 (+https://perplexity.ai/bot) - Purpose: Real-time search and answer generation for Perplexity AI
- Crawl Frequency: Moderate-High – crawls on demand when users ask questions
- Respects robots.txt: Yes
Perplexity is unique because it operates more like a search engine than a training pipeline. Its crawler is often triggered by real-time user queries.
Tier 2: Significant AI Crawlers
These crawlers have meaningful reach and should be part of your optimization strategy.
Bytespider (ByteDance)
- User Agent:
Bytespider - Purpose: Training data for TikTok and Douyin search and AI features
- Crawl Frequency: Very High – often the most aggressive crawler by request volume
- Respects robots.txt: Partially – known to sometimes ignore directives
Bytespider is notorious for its aggressive crawling. Many sites rate-limit or block it entirely.
CCBot (Common Crawl)
- User Agent:
CCBot/2.0 (+https://commoncrawl.org/faq/) - Purpose: Open dataset used for training most major LLMs
- Crawl Frequency: Periodic – large batch crawls several times per year
- Respects robots.txt: Yes
Common Crawl is the foundation dataset for many AI models. Blocking CCBot may reduce your presence across multiple AI systems simultaneously.
Applebot-Extended (Apple)
- User Agent:
Applebot-Extended - Purpose: Training data for Apple Intelligence features including Siri
- Crawl Frequency: Moderate
- Respects robots.txt: Yes
Meta-ExternalAgent (Meta)
- User Agent:
Meta-ExternalAgent/1.0 - Purpose: Training data for Meta AI (WhatsApp, Instagram, Facebook AI features)
- Crawl Frequency: Moderate
- Respects robots.txt: Yes
cohere-ai (Cohere)
- User Agent:
cohere-ai - Purpose: Training data for Cohere's enterprise AI models
- Crawl Frequency: Low-Moderate
- Respects robots.txt: Yes
Tier 3: Emerging and Niche AI Crawlers
These crawlers have smaller reach but are worth monitoring.
YouBot (You.com)
- User Agent:
YouBot - Purpose: Search and AI-powered answer generation for You.com
- Respects robots.txt: Yes
Diffbot (Diffbot)
- User Agent:
Diffbot - Purpose: Web data extraction for AI knowledge graph
- Respects robots.txt: Yes
Amazonbot (Amazon)
- User Agent:
Amazonbot - Purpose: Alexa AI features and Amazon search improvements
- Respects robots.txt: Yes
OAI-SearchBot (OpenAI)
- User Agent:
OAI-SearchBot/1.0 - Purpose: OpenAI's dedicated search product crawler (ChatGPT Search)
- Respects robots.txt: Yes
AI2Bot (Allen Institute for AI)
- User Agent:
AI2Bot - Purpose: Research and open-source AI model training
- Respects robots.txt: Yes
Timpibot (Timpi)
- User Agent:
Timpibot - Purpose: Decentralized search index
- Respects robots.txt: Yes
Webzio-Extended (Webz.io)
- User Agent:
Webzio-Extended - Purpose: Web data for AI training datasets
- Respects robots.txt: Yes
iaskspider (iAsk.ai)
- User Agent:
iaskspider - Purpose: AI-powered answer engine
- Respects robots.txt: Yes
A Robots.txt Strategy for AI Crawlers
Here is a starting framework that allows all major AI crawlers access to your content pages while blocking sensitive areas:
- Allow Tier 1 and Tier 2 crawlers by default
- Block checkout, account, cart, and admin paths for all crawlers
- Rate-limit or block Bytespider if its request volume is problematic
- Monitor and evaluate Tier 3 crawlers before making allow/block decisions
For a complete robots.txt configuration walkthrough, see our robots.txt strategy guide.
How to Monitor All These Crawlers
Tracking 20+ AI crawlers manually through server logs is not practical. You need automated monitoring that identifies every AI crawler by user agent, tracks crawl frequency per crawler over time, shows which pages each crawler accesses most, alerts you when a new AI crawler starts visiting your site, and reports errors and blocked requests per crawler.
This is exactly what botjar does. One integration, complete visibility into every AI crawler touching your site.
Track every AI crawler in one dashboard. Botjar identifies, classifies, and monitors 20+ AI crawlers automatically – no log parsing required. Get your free bot audit →