apify/crawlee-pythonPlease refer to the latest official releases for information GitHub Homepage

A web scraping and browser automation library built specifically for Python, used to build reliable crawlers, supporting data extraction for AI, LLM, RAG, or GPT applications.

Apache-2.0Python 5.7kapifycrawlee-python Last Updated: 2025-06-23

Crawlee Python - Web Scraping and Browser Automation Library

Project Overview

Crawlee is a web scraping and browser automation library built specifically for Python, designed for building reliable crawlers. It can extract data for AI, LLM, RAG, or GPT applications, and download HTML, PDF, JPG, PNG, and other files from websites. Developed by Apify, it's their open-source web scraping library built on BeautifulSoup and Playwright, adopting an all-in-one web scraping approach.

Key Features

Core Functionality

Multi-Engine Support: Works with BeautifulSoup, Playwright, and native HTTP
Flexible Modes: Supports both headed and headless modes
Proxy Rotation: Built-in proxy rotation functionality
File Download: Supports downloading various file formats like HTML, PDF, JPG, PNG, etc.
AI Integration Optimization: Specifically optimized for data extraction for AI, LLM, RAG, and GPT applications

Technical Advantages

Type Hints: Modern design with Python type hints to help catch errors early
Stable and Reliable: Built by professional developers who scrape millions of pages daily
Easy to Use: Allows easy switching between different scraping libraries based on needs
Error Handling: Built-in robust error handling and retry mechanisms

Technical Architecture

Underlying Technology Stack

# Main Dependencies
- BeautifulSoup: Static HTML parsing
- Playwright: Dynamic JavaScript rendering page processing
- HTTP Client: Native HTTP request support

Integration Capabilities

Apify Platform Integration: Seamless integration with the Apify platform
Multiple Scraping Techniques: Supports various scraping techniques from static HTML parsing to dynamic JavaScript rendering

Use Cases

Main Application Areas

AI Data Collection: Collecting training data for machine learning and AI applications
RAG Systems: Providing data sources for Retrieval-Augmented Generation systems
GPT Applications: Providing real-time data for various GPT applications
Content Monitoring: Monitoring website content changes
Data Analysis: Collecting data for business analysis

Comparison with Competitors

From Python's two main open-source options, Scrapy and Crawlee, Apify chose the latter, believing that beginners would prefer it because it allows creating crawlers with less code and less reading time.

Project Status

Release Information

Open Source License: Completely open source and free
Language Support: Python version (also a Node.js version)
Release Time: The Python version gained significant attention within weeks of its release
Maintenance Status: Actively maintained

Community Response

Gained widespread attention on GitHub
The Python version was launched due to the success of the JavaScript version and the demand from the Python community
Received positive feedback in technical communities like Hacker News

Installation and Quick Start

Installation Method

pip install crawlee

Basic Usage Example

from crawlee import BeautifulSoupCrawler

# Create a crawler instance
crawler = BeautifulSoupCrawler()

# Define a request handler
@crawler.router.default_handler
async def handler(context):
    # Extract data
    data = {
        'title': context.soup.find('title').get_text(),
        'url': context.request.url
    }
    
    # Save data
    await context.push_data(data)

# Run the crawler
await crawler.run(['https://example.com'])

Advanced Features

Proxy Support

# Configure proxy rotation
crawler = BeautifulSoupCrawler(
    proxy_configuration={
        'proxy_urls': ['http://proxy1:8000', 'http://proxy2:8000']
    }
)

Error Handling and Retries

# Automatic retry configuration
crawler = BeautifulSoupCrawler(
    max_requests_per_crawl=1000,
    request_timeout=30,
    retry_on_blocked=True
)

Summary

Crawlee Python is a modern, powerful web scraping library, particularly well-suited for scenarios requiring data collection for AI applications. It combines the strengths of multiple mature scraping technologies, offering a clean API and powerful features, making it an excellent choice for Python developers for web scraping. Whether it's simple data extraction or complex browser automation tasks, Crawlee provides a reliable solution.