AnyCrawl is a high-performance web crawler and data scraping application built on Node.js/TypeScript. This project is specifically optimized for Large Language Models (LLMs), capable of converting website content into LLM-usable data formats and extracting structured Search Engine Results Page (SERP) data from search engines like Google, Bing, and Baidu.
AnyCrawl excels in several areas:
AnyCrawl supports various crawling engines:
Quickly start using Docker Compose:
docker compose up --build
Variable Name | Description | Default Value | Example |
---|---|---|---|
NODE_ENV |
Runtime environment | production | production, development |
ANYCRAWL_API_PORT |
API service port | 8080 | 8080 |
ANYCRAWL_HEADLESS |
Whether the browser engine uses headless mode | true | true, false |
ANYCRAWL_PROXY_URL |
Proxy server URL (supports HTTP and SOCKS) | (None) | http://proxy:8080 |
ANYCRAWL_IGNORE_SSL_ERROR |
Ignore SSL certificate errors | true | true, false |
ANYCRAWL_KEEP_ALIVE |
Keep connection alive between requests | true | true, false |
ANYCRAWL_AVAILABLE_ENGINES |
Available crawling engines (comma-separated) | cheerio,playwright,puppeteer | playwright,puppeteer |
ANYCRAWL_API_DB_TYPE |
Database type | sqlite | sqlite, postgresql |
ANYCRAWL_API_DB_CONNECTION |
Database connection string/path | /usr/src/app/db/database.db | /path/to/db.sqlite |
ANYCRAWL_REDIS_URL |
Redis connection URL | redis://redis:6379 | redis://localhost:6379 |
ANYCRAWL_API_AUTH_ENABLED |
Enable API authentication | false | true, false |
ANYCRAWL_API_CREDITS_ENABLED |
Enable credit system | false | true, false |
curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"url": "https://example.com",
"engine": "cheerio"
}'
Parameter | Type | Description | Default Value |
---|---|---|---|
url |
string (required) | The URL to scrape. Must be a valid URL starting with http:// or https:// | - |
engine |
string | The scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering, modern engine), puppeteer (JavaScript rendering, Chrome engine) | cheerio |
proxy |
string | The proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port | (None) |
curl -X POST http://localhost:8080/v1/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"query": "AnyCrawl",
"limit": 10,
"engine": "google",
"lang": "all"
}'
Parameter | Type | Description | Default Value |
---|---|---|---|
query |
string (required) | The search query to execute | - |
engine |
string | The search engine to use. Options: google | |
pages |
integer | The number of search result pages to retrieve | 1 |
lang |
string | The language code for the search results (e.g., 'en', 'zh', 'all') | en-US |
You can use the Playground to test the API and generate code examples for your favorite programming language.
💡 Note: If you are self-hosting AnyCrawl, make sure to replace https://api.anycrawl.dev
with your own server URL.
A: Yes, AnyCrawl supports HTTP and SOCKS proxies. Configure it through the ANYCRAWL_PROXY_URL
environment variable.
A: AnyCrawl supports Puppeteer and Playwright to handle JavaScript rendering needs.
AnyCrawl represents the cutting edge of modern web crawling technology, especially in AI and machine learning application scenarios. Its high performance, ease of use, and rich features make it an ideal choice for developers and enterprises to handle large-scale data scraping tasks.