Home
Login

High-performance Node.js/TypeScript web crawler optimized for LLMs, supporting multi-engine crawling and structured data extraction.

MITTypeScript 500any4aiAnyCrawl Last Updated: 2025-07-03

AnyCrawl Project Detailed Introduction

🚀 Project Overview

AnyCrawl is a high-performance web crawler and data scraping application built on Node.js/TypeScript. This project is specifically optimized for Large Language Models (LLMs), capable of converting website content into LLM-usable data formats and extracting structured Search Engine Results Page (SERP) data from search engines like Google, Bing, and Baidu.

🎯 Core Features

AnyCrawl excels in several areas:

  • SERP Crawling: Supports multiple search engines with batch processing capabilities.
  • Webpage Crawling: Efficient single-page content extraction.
  • Site Crawling: Intelligent traversal for full-site crawling.
  • High-Performance Architecture: Multi-threaded and multi-process architecture design.
  • Batch Processing: Efficiently handles batch crawling tasks.

🏗️ Technical Architecture

Modern Design

  • Built on Node.js/TypeScript
  • Optimized for Large Language Models (LLMs)
  • Supports native multi-threaded batch processing
  • Modern architecture design

Supported Crawling Engines

AnyCrawl supports various crawling engines:

  1. Cheerio: Static HTML parsing, fastest speed.
  2. Playwright: JavaScript rendering, using a modern engine.
  3. Puppeteer: JavaScript rendering, using the Chrome engine.

🚀 Quick Start

Docker Deployment

Quickly start using Docker Compose:

docker compose up --build

Environment Variable Configuration

Variable Name Description Default Value Example
NODE_ENV Runtime environment production production, development
ANYCRAWL_API_PORT API service port 8080 8080
ANYCRAWL_HEADLESS Whether the browser engine uses headless mode true true, false
ANYCRAWL_PROXY_URL Proxy server URL (supports HTTP and SOCKS) (None) http://proxy:8080
ANYCRAWL_IGNORE_SSL_ERROR Ignore SSL certificate errors true true, false
ANYCRAWL_KEEP_ALIVE Keep connection alive between requests true true, false
ANYCRAWL_AVAILABLE_ENGINES Available crawling engines (comma-separated) cheerio,playwright,puppeteer playwright,puppeteer
ANYCRAWL_API_DB_TYPE Database type sqlite sqlite, postgresql
ANYCRAWL_API_DB_CONNECTION Database connection string/path /usr/src/app/db/database.db /path/to/db.sqlite
ANYCRAWL_REDIS_URL Redis connection URL redis://redis:6379 redis://localhost:6379
ANYCRAWL_API_AUTH_ENABLED Enable API authentication false true, false
ANYCRAWL_API_CREDITS_ENABLED Enable credit system false true, false

📝 API Usage Guide

Webpage Scraping API

Basic Usage

curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
  "url": "https://example.com",
  "engine": "cheerio"
}'

Parameter Description

Parameter Type Description Default Value
url string (required) The URL to scrape. Must be a valid URL starting with http:// or https:// -
engine string The scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering, modern engine), puppeteer (JavaScript rendering, Chrome engine) cheerio
proxy string The proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port (None)

Search Engine Scraping API

Basic Usage

curl -X POST http://localhost:8080/v1/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

Parameter Description

Parameter Type Description Default Value
query string (required) The search query to execute -
engine string The search engine to use. Options: google google
pages integer The number of search result pages to retrieve 1
lang string The language code for the search results (e.g., 'en', 'zh', 'all') en-US

🧪 Testing and Development

Playground

You can use the Playground to test the API and generate code examples for your favorite programming language.

💡 Note: If you are self-hosting AnyCrawl, make sure to replace https://api.anycrawl.dev with your own server URL.

❓ Frequently Asked Questions

Q: Can I use a proxy?

A: Yes, AnyCrawl supports HTTP and SOCKS proxies. Configure it through the ANYCRAWL_PROXY_URL environment variable.

Q: How do I handle JavaScript-rendered content?

A: AnyCrawl supports Puppeteer and Playwright to handle JavaScript rendering needs.

Summary

AnyCrawl represents the cutting edge of modern web crawling technology, especially in AI and machine learning application scenarios. Its high performance, ease of use, and rich features make it an ideal choice for developers and enterprises to handle large-scale data scraping tasks.

Star History Chart