Home
Login

Open-source, high-performance web crawler and data extraction tool optimized for LLMs and AI agents.

Apache-2.0Python 46.0kunclecode Last Updated: 2025-06-18

Crawl4AI - Open-Source Intelligent Web Crawler Optimized for LLMs

Project Overview

Crawl4AI is a high-speed, AI-ready web crawler tailored for LLMs, AI agents, and data pipelines. The project is fully open-source, flexible, and built for real-time performance, providing developers with unparalleled speed, accuracy, and ease of deployment.

Core Features

🤖 Built for LLMs

  • Generates intelligent, clean Markdown optimized for RAG and fine-tuning applications
  • Provides clean, structured content suitable for AI model processing
  • Supports all LLMs (open-source and proprietary) for structured data extraction

⚡ Lightning Speed

  • Delivers 6x faster results with real-time, cost-effective performance
  • Based on asynchronous architecture, supporting massive concurrent processing
  • Memory-adaptive scheduler dynamically adjusts concurrency based on system memory

🌐 Flexible Browser Control

  • Session management, proxy support, and custom hooks
  • Supports user-owned browsers, providing complete control and avoiding bot detection
  • Browser profile management, saving authentication status, cookies, and settings
  • Supports Chromium, Firefox, and WebKit multi-browsers

🧠 Heuristic Intelligence

  • Uses advanced algorithms for efficient extraction, reducing reliance on expensive models
  • BM25 algorithm filtering, extracting core information and removing irrelevant content
  • Intelligent content cleaning and noise reduction

Main Functional Modules

📝 Markdown Generation

  • Clean Markdown: Generates accurately formatted, structured Markdown
  • Adapted Markdown: Removes noise and irrelevant parts based on heuristic filtering
  • Citations and References: Converts page links into numbered reference lists with clean citations
  • Custom Strategies: Users can create Markdown generation strategies tailored to specific needs

📊 Structured Data Extraction

  • LLM-Driven Extraction: Supports all LLMs for structured data extraction
  • Chunking Strategies: Implements chunking (based on topic, regular expressions, sentence level) for targeted content processing
  • Cosine Similarity: Finds relevant content chunks based on user queries for semantic extraction
  • CSS Selector Extraction: Uses XPath and CSS selectors for rapid pattern extraction
  • Schema Definition: Defines custom schemas to extract structured JSON from repeating patterns

🔎 Crawling and Scraping Features

  • Media Support: Extracts images, audio, video, and responsive image formats
  • Dynamic Crawling: Executes JavaScript and waits for asynchronous/synchronous dynamic content extraction
  • Screenshot Functionality: Captures page screenshots during crawling for debugging or analysis
  • Raw Data Crawling: Directly processes raw HTML or local files
  • Comprehensive Link Extraction: Extracts internal, external links, and embedded iframe content
  • Customizable Hooks: Defines hooks at each step to customize crawler behavior
  • Caching Mechanism: Caches data to improve speed and avoid redundant retrieval
  • Lazy Loading Handling: Waits for images to fully load, ensuring no content is missed due to lazy loading

🚀 Deployment and Integration

  • Dockerized Setup: Optimized Docker image with FastAPI server for easy deployment
  • Secure Authentication: Built-in JWT token authentication to ensure API security
  • API Gateway: One-click deployment with secure token-authenticated API workflows
  • Scalable Architecture: Designed for large-scale production, optimizing server performance
  • Cloud Deployment: Provides ready-to-use deployment configurations for major cloud platforms

Installation

Python Package Installation

pip install -U crawl4ai

crawl4ai-setup

crawl4ai-doctor

Docker Deployment

docker pull unclecode/crawl4ai:0.6.0-rN
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN

# webUI:http://localhost:11235/playground

Basic Usage Examples

Simple Webpage Crawling

import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Command-Line Interface

crwl https://www.nbcnews.com/business -o markdown

crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

crwl https://www.example.com/products -q "Extract all product prices"

LLM Structured Data Extraction

import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def main():
    browser_config = BrowserConfig(verbose=True)
    run_config = CrawlerRunConfig(
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
            schema=OpenAIModelFee.schema(),
                      extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        cache_mode=CacheMode.BYPASS,
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            config=run_config
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Latest Version Features (v0.6.0)

🌍 World-Aware Crawling

Set geolocation, language, and timezone to get authentic region-specific content:

run_config = CrawlerRunConfig(
    url="https://browserleaks.com/geo",
    locale="en-US",
    timezone_id="America/Los_Angeles",
    geolocation=GeolocationConfig(
        latitude=34.0522,
        longitude=-118.2437,
        accuracy=10.0,
    )
)

📊 Table to DataFrame Extraction

Extract HTML tables directly to CSV or pandas DataFrame:

results = await crawler.arun(
    url="https://coinmarketcap.com/?page=1",
    config=crawl_config
)

raw_df = pd.DataFrame()
for result in results:
    if result.success and result.media["tables"]:
        raw_df = pd.DataFrame(
            result.media["tables"][0]["rows"],
            columns=result.media["tables"][0]["headers"],
        )
        break

🚀 Browser Pooling

Use pre-warmed browser instances when pages launch, reducing latency and memory usage

🔌 MCP Integration

Connect to AI tools like Claude Code via the Model Context Protocol:

claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse

Technical Architecture

Core Components

  • Asynchronous Crawler Engine: High-performance asynchronous architecture based on Playwright
  • Content Filtering Strategies: Multiple filtering algorithms, including pruning filters and BM25 filters
  • Extraction Strategies: Supports CSS selectors, LLMs, and custom extraction strategies
  • Markdown Generator: Intelligent content conversion to AI-friendly Markdown format
  • Browser Management: Complete browser lifecycle management and session control

Supported Extraction Methods

  1. CSS Selector Extraction: Fast, precise structured data extraction
  2. LLM Extraction: Uses large language models for intelligent content understanding
  3. JavaScript Execution: Dynamic content processing and interaction
  4. Regular Expressions: Pattern matching and text processing
  5. XPath Selectors: Advanced DOM element location

Performance Advantages

  • 6x Speed Increase: Compared to traditional crawler tools
  • Memory Optimization: Intelligent memory management and garbage collection
  • Concurrent Processing: Supports concurrent crawling of thousands of URLs
  • Caching Mechanism: Intelligent caching reduces redundant requests
  • Resource Management: Adaptive resource allocation and limitation

Application Scenarios

Data Science and Research

  • Academic paper and research data collection
  • Market research and competitive analysis
  • Social media data mining

AI and Machine Learning

  • Training data collection and preprocessing
  • RAG system content acquisition
  • Knowledge graph construction

Business Intelligence

  • Price monitoring and comparison
  • News and sentiment monitoring
  • Enterprise data aggregation

Content Management

  • Website migration and backup
  • Content aggregation and distribution
  • SEO analysis and optimization

Development Roadmap

  • Graph Crawler: Uses graph search algorithms for intelligent website traversal
  • Question-Driven Crawler: Natural language-driven webpage discovery and content extraction
  • Knowledge-Optimal Crawler: Maximizes knowledge acquisition while minimizing data extraction
  • Agent Crawler: Autonomous system for complex multi-step crawling operations
  • Automated Pattern Generator: Converts natural language into extraction patterns
  • Domain-Specific Crawlers: Pre-configured extractors for common platforms

Community and Support

Crawl4AI has an active open-source community support, welcome to contribute code, report issues, and make suggestions. The project follows the Apache 2.0 license, is fully open-source and free to use.

Summary

Crawl4AI represents the latest advancements in web crawling technology, especially in the context of the AI era. It not only provides all the functions of traditional crawlers but is also specifically optimized for modern AI applications, making it an ideal choice for data scientists, AI researchers, and developers. Through its open-source nature and active community, Crawl4AI is promoting the democratization and standardization of web data extraction technology.