unclecode/crawl4aiPlease refer to the latest official releases for information GitHub Homepage

Open-source, high-performance web crawler and data extraction tool optimized for LLMs and AI agents.

Apache-2.0Python 46.0kunclecode Last Updated: 2025-06-18

Crawl4AI - Open-Source Intelligent Web Crawler Optimized for LLMs

Project Overview

Crawl4AI is a high-speed, AI-ready web crawler tailored for LLMs, AI agents, and data pipelines. The project is fully open-source, flexible, and built for real-time performance, providing developers with unparalleled speed, accuracy, and ease of deployment.

Core Features

🤖 Built for LLMs

Generates intelligent, clean Markdown optimized for RAG and fine-tuning applications
Provides clean, structured content suitable for AI model processing
Supports all LLMs (open-source and proprietary) for structured data extraction

⚡ Lightning Speed

Delivers 6x faster results with real-time, cost-effective performance
Based on asynchronous architecture, supporting massive concurrent processing
Memory-adaptive scheduler dynamically adjusts concurrency based on system memory

🌐 Flexible Browser Control

Session management, proxy support, and custom hooks
Supports user-owned browsers, providing complete control and avoiding bot detection
Browser profile management, saving authentication status, cookies, and settings
Supports Chromium, Firefox, and WebKit multi-browsers

🧠 Heuristic Intelligence

Uses advanced algorithms for efficient extraction, reducing reliance on expensive models
BM25 algorithm filtering, extracting core information and removing irrelevant content
Intelligent content cleaning and noise reduction

Main Functional Modules

📝 Markdown Generation

Clean Markdown: Generates accurately formatted, structured Markdown
Adapted Markdown: Removes noise and irrelevant parts based on heuristic filtering
Citations and References: Converts page links into numbered reference lists with clean citations
Custom Strategies: Users can create Markdown generation strategies tailored to specific needs

📊 Structured Data Extraction

LLM-Driven Extraction: Supports all LLMs for structured data extraction
Chunking Strategies: Implements chunking (based on topic, regular expressions, sentence level) for targeted content processing
Cosine Similarity: Finds relevant content chunks based on user queries for semantic extraction
CSS Selector Extraction: Uses XPath and CSS selectors for rapid pattern extraction
Schema Definition: Defines custom schemas to extract structured JSON from repeating patterns

🔎 Crawling and Scraping Features

Media Support: Extracts images, audio, video, and responsive image formats
Dynamic Crawling: Executes JavaScript and waits for asynchronous/synchronous dynamic content extraction
Screenshot Functionality: Captures page screenshots during crawling for debugging or analysis
Raw Data Crawling: Directly processes raw HTML or local files
Comprehensive Link Extraction: Extracts internal, external links, and embedded iframe content
Customizable Hooks: Defines hooks at each step to customize crawler behavior
Caching Mechanism: Caches data to improve speed and avoid redundant retrieval
Lazy Loading Handling: Waits for images to fully load, ensuring no content is missed due to lazy loading

🚀 Deployment and Integration

Dockerized Setup: Optimized Docker image with FastAPI server for easy deployment
Secure Authentication: Built-in JWT token authentication to ensure API security
API Gateway: One-click deployment with secure token-authenticated API workflows
Scalable Architecture: Designed for large-scale production, optimizing server performance
Cloud Deployment: Provides ready-to-use deployment configurations for major cloud platforms

Installation

Python Package Installation

pip install -U crawl4ai

crawl4ai-setup

crawl4ai-doctor

Docker Deployment

docker pull unclecode/crawl4ai:0.6.0-rN
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN

# webUI：http://localhost:11235/playground

Basic Usage Examples

Simple Webpage Crawling

import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Command-Line Interface

crwl https://www.nbcnews.com/business -o markdown

crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

crwl https://www.example.com/products -q "Extract all product prices"

LLM Structured Data Extraction

import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def main():
    browser_config = BrowserConfig(verbose=True)
    run_config = CrawlerRunConfig(
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
            schema=OpenAIModelFee.schema(),
                      extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        cache_mode=CacheMode.BYPASS,
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            config=run_config
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Latest Version Features (v0.6.0)

🌍 World-Aware Crawling

Set geolocation, language, and timezone to get authentic region-specific content:

run_config = CrawlerRunConfig(
    url="https://browserleaks.com/geo",
    locale="en-US",
    timezone_id="America/Los_Angeles",
    geolocation=GeolocationConfig(
        latitude=34.0522,
        longitude=-118.2437,
        accuracy=10.0,
    )
)

📊 Table to DataFrame Extraction

Extract HTML tables directly to CSV or pandas DataFrame:

results = await crawler.arun(
    url="https://coinmarketcap.com/?page=1",
    config=crawl_config
)

raw_df = pd.DataFrame()
for result in results:
    if result.success and result.media["tables"]:
        raw_df = pd.DataFrame(
            result.media["tables"][0]["rows"],
            columns=result.media["tables"][0]["headers"],
        )
        break

🚀 Browser Pooling

Use pre-warmed browser instances when pages launch, reducing latency and memory usage

🔌 MCP Integration

Connect to AI tools like Claude Code via the Model Context Protocol:

claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse

Technical Architecture

Core Components

Asynchronous Crawler Engine: High-performance asynchronous architecture based on Playwright
Content Filtering Strategies: Multiple filtering algorithms, including pruning filters and BM25 filters
Extraction Strategies: Supports CSS selectors, LLMs, and custom extraction strategies
Markdown Generator: Intelligent content conversion to AI-friendly Markdown format
Browser Management: Complete browser lifecycle management and session control

Supported Extraction Methods

CSS Selector Extraction: Fast, precise structured data extraction
LLM Extraction: Uses large language models for intelligent content understanding
JavaScript Execution: Dynamic content processing and interaction
Regular Expressions: Pattern matching and text processing
XPath Selectors: Advanced DOM element location

Performance Advantages

6x Speed Increase: Compared to traditional crawler tools
Memory Optimization: Intelligent memory management and garbage collection
Concurrent Processing: Supports concurrent crawling of thousands of URLs
Caching Mechanism: Intelligent caching reduces redundant requests
Resource Management: Adaptive resource allocation and limitation

Application Scenarios

Data Science and Research

Academic paper and research data collection
Market research and competitive analysis
Social media data mining

AI and Machine Learning

Training data collection and preprocessing
RAG system content acquisition
Knowledge graph construction

Business Intelligence

Price monitoring and comparison
News and sentiment monitoring
Enterprise data aggregation

Content Management

Website migration and backup
Content aggregation and distribution
SEO analysis and optimization

Development Roadmap

Graph Crawler: Uses graph search algorithms for intelligent website traversal
Question-Driven Crawler: Natural language-driven webpage discovery and content extraction
Knowledge-Optimal Crawler: Maximizes knowledge acquisition while minimizing data extraction
Agent Crawler: Autonomous system for complex multi-step crawling operations
Automated Pattern Generator: Converts natural language into extraction patterns
Domain-Specific Crawlers: Pre-configured extractors for common platforms

Community and Support

Crawl4AI has an active open-source community support, welcome to contribute code, report issues, and make suggestions. The project follows the Apache 2.0 license, is fully open-source and free to use.

GitHub: https://github.com/unclecode/crawl4ai
Documentation: https://docs.crawl4ai.com/
Discord Community: https://discord.gg/jP8KfhDhyN

Summary

Crawl4AI represents the latest advancements in web crawling technology, especially in the context of the AI era. It not only provides all the functions of traditional crawlers but is also specifically optimized for modern AI applications, making it an ideal choice for data scientists, AI researchers, and developers. Through its open-source nature and active community, Crawl4AI is promoting the democratization and standardization of web data extraction technology.