any4ai/AnyCrawlPlease refer to the latest official releases for information GitHub Homepage

High-performance Node.js/TypeScript web crawler optimized for LLMs, supporting multi-engine crawling and structured data extraction.

MITTypeScript 500any4aiAnyCrawl Last Updated: 2025-07-03

AnyCrawl Project Detailed Introduction

🚀 Project Overview

AnyCrawl is a high-performance web crawler and data scraping application built on Node.js/TypeScript. This project is specifically optimized for Large Language Models (LLMs), capable of converting website content into LLM-usable data formats and extracting structured Search Engine Results Page (SERP) data from search engines like Google, Bing, and Baidu.

🎯 Core Features

AnyCrawl excels in several areas:

SERP Crawling: Supports multiple search engines with batch processing capabilities.
Webpage Crawling: Efficient single-page content extraction.
Site Crawling: Intelligent traversal for full-site crawling.
High-Performance Architecture: Multi-threaded and multi-process architecture design.
Batch Processing: Efficiently handles batch crawling tasks.

🏗️ Technical Architecture

Modern Design

Built on Node.js/TypeScript
Optimized for Large Language Models (LLMs)
Supports native multi-threaded batch processing
Modern architecture design

Supported Crawling Engines

AnyCrawl supports various crawling engines:

Cheerio: Static HTML parsing, fastest speed.
Playwright: JavaScript rendering, using a modern engine.
Puppeteer: JavaScript rendering, using the Chrome engine.

🚀 Quick Start

Docker Deployment

Quickly start using Docker Compose:

docker compose up --build

Environment Variable Configuration

Variable Name	Description	Default Value	Example
`NODE_ENV`	Runtime environment	production	production, development
`ANYCRAWL_API_PORT`	API service port	8080	8080
`ANYCRAWL_HEADLESS`	Whether the browser engine uses headless mode	true	true, false
`ANYCRAWL_PROXY_URL`	Proxy server URL (supports HTTP and SOCKS)	(None)	http://proxy:8080
`ANYCRAWL_IGNORE_SSL_ERROR`	Ignore SSL certificate errors	true	true, false
`ANYCRAWL_KEEP_ALIVE`	Keep connection alive between requests	true	true, false
`ANYCRAWL_AVAILABLE_ENGINES`	Available crawling engines (comma-separated)	cheerio,playwright,puppeteer	playwright,puppeteer
`ANYCRAWL_API_DB_TYPE`	Database type	sqlite	sqlite, postgresql
`ANYCRAWL_API_DB_CONNECTION`	Database connection string/path	/usr/src/app/db/database.db	/path/to/db.sqlite
`ANYCRAWL_REDIS_URL`	Redis connection URL	redis://redis:6379	redis://localhost:6379
`ANYCRAWL_API_AUTH_ENABLED`	Enable API authentication	false	true, false
`ANYCRAWL_API_CREDITS_ENABLED`	Enable credit system	false	true, false

📝 API Usage Guide

Webpage Scraping API

Basic Usage

curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
  "url": "https://example.com",
  "engine": "cheerio"
}'

Parameter Description

Parameter	Type	Description	Default Value
`url`	string (required)	The URL to scrape. Must be a valid URL starting with http:// or https://	-
`engine`	string	The scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering, modern engine), puppeteer (JavaScript rendering, Chrome engine)	cheerio
`proxy`	string	The proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port	(None)

Search Engine Scraping API

Basic Usage

curl -X POST http://localhost:8080/v1/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

Parameter Description

Parameter	Type	Description	Default Value
`query`	string (required)	The search query to execute	-
`engine`	string	The search engine to use. Options: google	google
`pages`	integer	The number of search result pages to retrieve	1
`lang`	string	The language code for the search results (e.g., 'en', 'zh', 'all')	en-US

🧪 Testing and Development

Playground

You can use the Playground to test the API and generate code examples for your favorite programming language.

💡 Note: If you are self-hosting AnyCrawl, make sure to replace https://api.anycrawl.dev with your own server URL.

❓ Frequently Asked Questions

Q: Can I use a proxy?

A: Yes, AnyCrawl supports HTTP and SOCKS proxies. Configure it through the ANYCRAWL_PROXY_URL environment variable.

Q: How do I handle JavaScript-rendered content?

A: AnyCrawl supports Puppeteer and Playwright to handle JavaScript rendering needs.

Summary

AnyCrawl represents the cutting edge of modern web crawling technology, especially in AI and machine learning application scenarios. Its high performance, ease of use, and rich features make it an ideal choice for developers and enterprises to handle large-scale data scraping tasks.