Crawlee Python - Web Scraping and Browser Automation Library
Project Overview
Crawlee is a web scraping and browser automation library built specifically for Python, designed for building reliable crawlers. It can extract data for AI, LLM, RAG, or GPT applications, and download HTML, PDF, JPG, PNG, and other files from websites. Developed by Apify, it's their open-source web scraping library built on BeautifulSoup and Playwright, adopting an all-in-one web scraping approach.
Key Features
Core Functionality
- Multi-Engine Support: Works with BeautifulSoup, Playwright, and native HTTP
- Flexible Modes: Supports both headed and headless modes
- Proxy Rotation: Built-in proxy rotation functionality
- File Download: Supports downloading various file formats like HTML, PDF, JPG, PNG, etc.
- AI Integration Optimization: Specifically optimized for data extraction for AI, LLM, RAG, and GPT applications
Technical Advantages
- Type Hints: Modern design with Python type hints to help catch errors early
- Stable and Reliable: Built by professional developers who scrape millions of pages daily
- Easy to Use: Allows easy switching between different scraping libraries based on needs
- Error Handling: Built-in robust error handling and retry mechanisms
Technical Architecture
Underlying Technology Stack
# Main Dependencies
- BeautifulSoup: Static HTML parsing
- Playwright: Dynamic JavaScript rendering page processing
- HTTP Client: Native HTTP request support
Integration Capabilities
- Apify Platform Integration: Seamless integration with the Apify platform
- Multiple Scraping Techniques: Supports various scraping techniques from static HTML parsing to dynamic JavaScript rendering
Use Cases
Main Application Areas
- AI Data Collection: Collecting training data for machine learning and AI applications
- RAG Systems: Providing data sources for Retrieval-Augmented Generation systems
- GPT Applications: Providing real-time data for various GPT applications
- Content Monitoring: Monitoring website content changes
- Data Analysis: Collecting data for business analysis
Comparison with Competitors
From Python's two main open-source options, Scrapy and Crawlee, Apify chose the latter, believing that beginners would prefer it because it allows creating crawlers with less code and less reading time.
Project Status
Release Information
- Open Source License: Completely open source and free
- Language Support: Python version (also a Node.js version)
- Release Time: The Python version gained significant attention within weeks of its release
- Maintenance Status: Actively maintained
Community Response
- Gained widespread attention on GitHub
- The Python version was launched due to the success of the JavaScript version and the demand from the Python community
- Received positive feedback in technical communities like Hacker News
Installation and Quick Start
Installation Method
pip install crawlee
Basic Usage Example
from crawlee import BeautifulSoupCrawler
# Create a crawler instance
crawler = BeautifulSoupCrawler()
# Define a request handler
@crawler.router.default_handler
async def handler(context):
# Extract data
data = {
'title': context.soup.find('title').get_text(),
'url': context.request.url
}
# Save data
await context.push_data(data)
# Run the crawler
await crawler.run(['https://example.com'])
Advanced Features
Proxy Support
# Configure proxy rotation
crawler = BeautifulSoupCrawler(
proxy_configuration={
'proxy_urls': ['http://proxy1:8000', 'http://proxy2:8000']
}
)
Error Handling and Retries
# Automatic retry configuration
crawler = BeautifulSoupCrawler(
max_requests_per_crawl=1000,
request_timeout=30,
retry_on_blocked=True
)
Summary
Crawlee Python is a modern, powerful web scraping library, particularly well-suited for scenarios requiring data collection for AI applications. It combines the strengths of multiple mature scraping technologies, offering a clean API and powerful features, making it an excellent choice for Python developers for web scraping. Whether it's simple data extraction or complex browser automation tasks, Crawlee provides a reliable solution.
