Home
Login

A web scraping and browser automation library built specifically for Python, used to build reliable crawlers, supporting data extraction for AI, LLM, RAG, or GPT applications.

Apache-2.0Python 5.7kapifycrawlee-python Last Updated: 2025-06-23

Crawlee Python - Web Scraping and Browser Automation Library

Project Overview

Crawlee is a web scraping and browser automation library built specifically for Python, designed for building reliable crawlers. It can extract data for AI, LLM, RAG, or GPT applications, and download HTML, PDF, JPG, PNG, and other files from websites. Developed by Apify, it's their open-source web scraping library built on BeautifulSoup and Playwright, adopting an all-in-one web scraping approach.

Key Features

Core Functionality

  • Multi-Engine Support: Works with BeautifulSoup, Playwright, and native HTTP
  • Flexible Modes: Supports both headed and headless modes
  • Proxy Rotation: Built-in proxy rotation functionality
  • File Download: Supports downloading various file formats like HTML, PDF, JPG, PNG, etc.
  • AI Integration Optimization: Specifically optimized for data extraction for AI, LLM, RAG, and GPT applications

Technical Advantages

  • Type Hints: Modern design with Python type hints to help catch errors early
  • Stable and Reliable: Built by professional developers who scrape millions of pages daily
  • Easy to Use: Allows easy switching between different scraping libraries based on needs
  • Error Handling: Built-in robust error handling and retry mechanisms

Technical Architecture

Underlying Technology Stack

# Main Dependencies
- BeautifulSoup: Static HTML parsing
- Playwright: Dynamic JavaScript rendering page processing
- HTTP Client: Native HTTP request support

Integration Capabilities

  • Apify Platform Integration: Seamless integration with the Apify platform
  • Multiple Scraping Techniques: Supports various scraping techniques from static HTML parsing to dynamic JavaScript rendering

Use Cases

Main Application Areas

  1. AI Data Collection: Collecting training data for machine learning and AI applications
  2. RAG Systems: Providing data sources for Retrieval-Augmented Generation systems
  3. GPT Applications: Providing real-time data for various GPT applications
  4. Content Monitoring: Monitoring website content changes
  5. Data Analysis: Collecting data for business analysis

Comparison with Competitors

From Python's two main open-source options, Scrapy and Crawlee, Apify chose the latter, believing that beginners would prefer it because it allows creating crawlers with less code and less reading time.

Project Status

Release Information

  • Open Source License: Completely open source and free
  • Language Support: Python version (also a Node.js version)
  • Release Time: The Python version gained significant attention within weeks of its release
  • Maintenance Status: Actively maintained

Community Response

  • Gained widespread attention on GitHub
  • The Python version was launched due to the success of the JavaScript version and the demand from the Python community
  • Received positive feedback in technical communities like Hacker News

Installation and Quick Start

Installation Method

pip install crawlee

Basic Usage Example

from crawlee import BeautifulSoupCrawler

# Create a crawler instance
crawler = BeautifulSoupCrawler()

# Define a request handler
@crawler.router.default_handler
async def handler(context):
    # Extract data
    data = {
        'title': context.soup.find('title').get_text(),
        'url': context.request.url
    }
    
    # Save data
    await context.push_data(data)

# Run the crawler
await crawler.run(['https://example.com'])

Advanced Features

Proxy Support

# Configure proxy rotation
crawler = BeautifulSoupCrawler(
    proxy_configuration={
        'proxy_urls': ['http://proxy1:8000', 'http://proxy2:8000']
    }
)

Error Handling and Retries

# Automatic retry configuration
crawler = BeautifulSoupCrawler(
    max_requests_per_crawl=1000,
    request_timeout=30,
    retry_on_blocked=True
)

Summary

Crawlee Python is a modern, powerful web scraping library, particularly well-suited for scenarios requiring data collection for AI applications. It combines the strengths of multiple mature scraping technologies, offering a clean API and powerful features, making it an excellent choice for Python developers for web scraping. Whether it's simple data extraction or complex browser automation tasks, Crawlee provides a reliable solution.

Star History Chart