Crawlee - Modern Web Scraping and Browser Automation Framework
Project Overview
Crawlee, developed by Apify, is a powerful Node.js web scraping and browser automation library designed for building reliable web crawlers. Supporting both JavaScript and TypeScript, it provides high-quality data extraction services for applications such as AI, Large Language Models (LLM), and Retrieval-Augmented Generation (RAG).
GitHub: https://github.com/apify/crawlee
Core Features
🚀 Unified Crawling Interface
- Multi-Engine Support: Unified interface supports HTTP requests and headless browser crawling.
- Flexible Choice: Choose the appropriate crawling method based on your needs.
🔄 Intelligent Queue Management
- Persistent Queue: Supports breadth-first and depth-first URL crawling queues.
- Auto-Scaling: Automatically adjusts the crawling scale based on system resources.
💾 Flexible Storage System
- Multi-Format Support: Pluggable storage for tabular data and files.
- Local/Cloud: Defaults to local
./storage
directory, with support for cloud storage.
🔒 Enterprise-Grade Anti-Detection
- Proxy Rotation: Integrated proxy rotation and session management.
- Human-Like Simulation: Simulates human behavior by default to bypass modern bot detection.
- Fingerprint Spoofing: Automatically generates realistic browser TLS fingerprints and request headers.
🛠 Developer-Friendly
- Native TypeScript Support: Complete type definitions and generics support.
- CLI Tool: Provides scaffolding for quick project creation.
- Lifecycle Hooks: Customizable lifecycle event handling.
- Docker Ready: Built-in Dockerfile for easy deployment.
Supported Crawling Methods
HTTP Crawling
- High Performance: Zero-configuration HTTP2 support, including proxies.
- Intelligent Parsing: Integrated Cheerio and JSDOM for fast HTML parsing.
- API Friendly: Also supports crawling JSON APIs.
Browser Automation
- Multi-Browser: Supports Chrome, Firefox, Webkit, and other browsers.
- JavaScript Rendering: Handles dynamic content and single-page applications.
- Screenshot Functionality: Supports page screenshots.
- Headless/Headful Mode: Flexible runtime mode selection.
- Unified Interface: Playwright and Puppeteer use the same API interface.
Quick Start
Create a Project Using CLI
npx crawlee create my-crawler
cd my-crawler
npm start
Basic Example Code
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
await Dataset.pushData({ title, url: request.loadedUrl });
await enqueueLinks();
},
// headless: false,
});
await crawler.run(['https://crawlee.dev']);
Install Dependencies
npm install crawlee playwright
Technical Architecture
Core Modules
- @crawlee/core: Core functionality module.
- @crawlee/types: TypeScript type definitions.
- @crawlee/utils: Utility functions.
Supported Libraries and Tools
- Playwright: Modern browser automation.
- Puppeteer: Chrome/Chromium automation.
- Cheerio: Fast HTML parsing.
- JSDOM: DOM manipulation and parsing.
Deployment and Integration
Local Development
- Default data is stored in the
./storage
directory.
- Supports customizing the storage location via configuration files.
- Complete configuration guide and documentation support.
Cloud Deployment
- Apify Platform: Since Crawlee is developed by Apify, it can be easily deployed to the Apify cloud platform.
- Docker Support: Built-in Docker configuration for containerized deployment.
- Cross-Platform: Can run in any environment that supports Node.js.
Version Management
- Stable Version: Install stable releases via npm.
- Beta Version: Supports installing beta versions to test new features.
npm install crawlee@3.12.3-beta.13
Applicable Scenarios
Data Science and AI
- Machine Learning Datasets: Collect training data for AI models.
- RAG Systems: Provide knowledge bases for Retrieval-Augmented Generation systems.
- LLM Training: Pre-training data collection for large language models.
Business Applications
- Competitive Analysis: Monitor competitors' products and pricing information.
- Market Research: Collect industry trends and market data.
- Content Aggregation: Automated collection of news, articles, and other content.
Technical Monitoring
- Website Monitoring: Regularly check for website changes.
- Price Tracking: Monitor e-commerce product prices.
- Data Backup: Regularly back up important web page content.
Summary
Crawlee is a comprehensive and modern web scraping framework, particularly suitable for enterprise-grade applications that require high reliability and anti-detection capabilities. Its unified API design, powerful anti-detection features, and complete ecosystem make it an ideal choice for modern data collection projects. Whether collecting data for AI projects or conducting business intelligence analysis, Crawlee provides a stable and reliable solution.