apify/crawleePlease refer to the latest official releases for information GitHub Homepage

A powerful Node.js web scraping and browser automation library with JavaScript and TypeScript support, capable of extracting data for AI, LLM, RAG, etc.

Apache-2.0TypeScript 18.0kapify Last Updated: 2025-06-19

Crawlee - Modern Web Scraping and Browser Automation Framework

Project Overview

Crawlee, developed by Apify, is a powerful Node.js web scraping and browser automation library designed for building reliable web crawlers. Supporting both JavaScript and TypeScript, it provides high-quality data extraction services for applications such as AI, Large Language Models (LLM), and Retrieval-Augmented Generation (RAG).

GitHub: https://github.com/apify/crawlee

Core Features

🚀 Unified Crawling Interface

Multi-Engine Support: Unified interface supports HTTP requests and headless browser crawling.
Flexible Choice: Choose the appropriate crawling method based on your needs.

🔄 Intelligent Queue Management

Persistent Queue: Supports breadth-first and depth-first URL crawling queues.
Auto-Scaling: Automatically adjusts the crawling scale based on system resources.

💾 Flexible Storage System

Multi-Format Support: Pluggable storage for tabular data and files.
Local/Cloud: Defaults to local ./storage directory, with support for cloud storage.

🔒 Enterprise-Grade Anti-Detection

Proxy Rotation: Integrated proxy rotation and session management.
Human-Like Simulation: Simulates human behavior by default to bypass modern bot detection.
Fingerprint Spoofing: Automatically generates realistic browser TLS fingerprints and request headers.

🛠 Developer-Friendly

Native TypeScript Support: Complete type definitions and generics support.
CLI Tool: Provides scaffolding for quick project creation.
Lifecycle Hooks: Customizable lifecycle event handling.
Docker Ready: Built-in Dockerfile for easy deployment.

Supported Crawling Methods

HTTP Crawling

High Performance: Zero-configuration HTTP2 support, including proxies.
Intelligent Parsing: Integrated Cheerio and JSDOM for fast HTML parsing.
API Friendly: Also supports crawling JSON APIs.

Browser Automation

Multi-Browser: Supports Chrome, Firefox, Webkit, and other browsers.
JavaScript Rendering: Handles dynamic content and single-page applications.
Screenshot Functionality: Supports page screenshots.
Headless/Headful Mode: Flexible runtime mode selection.
Unified Interface: Playwright and Puppeteer use the same API interface.

Quick Start

Create a Project Using CLI

npx crawlee create my-crawler
cd my-crawler
npm start

Basic Example Code

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);
        await Dataset.pushData({ title, url: request.loadedUrl });
        await enqueueLinks();
    },
    // headless: false,
});


await crawler.run(['https://crawlee.dev']);

Install Dependencies

npm install crawlee playwright

Technical Architecture

Core Modules

@crawlee/core: Core functionality module.
@crawlee/types: TypeScript type definitions.
@crawlee/utils: Utility functions.

Supported Libraries and Tools

Playwright: Modern browser automation.
Puppeteer: Chrome/Chromium automation.
Cheerio: Fast HTML parsing.
JSDOM: DOM manipulation and parsing.

Deployment and Integration

Local Development

Default data is stored in the ./storage directory.
Supports customizing the storage location via configuration files.
Complete configuration guide and documentation support.

Cloud Deployment

Apify Platform: Since Crawlee is developed by Apify, it can be easily deployed to the Apify cloud platform.
Docker Support: Built-in Docker configuration for containerized deployment.
Cross-Platform: Can run in any environment that supports Node.js.

Version Management

Stable Version: Install stable releases via npm.
Beta Version: Supports installing beta versions to test new features.

npm install crawlee@3.12.3-beta.13

Applicable Scenarios

Data Science and AI

Machine Learning Datasets: Collect training data for AI models.
RAG Systems: Provide knowledge bases for Retrieval-Augmented Generation systems.
LLM Training: Pre-training data collection for large language models.

Business Applications

Competitive Analysis: Monitor competitors' products and pricing information.
Market Research: Collect industry trends and market data.
Content Aggregation: Automated collection of news, articles, and other content.

Technical Monitoring

Website Monitoring: Regularly check for website changes.
Price Tracking: Monitor e-commerce product prices.
Data Backup: Regularly back up important web page content.

Summary

Crawlee is a comprehensive and modern web scraping framework, particularly suitable for enterprise-grade applications that require high reliability and anti-detection capabilities. Its unified API design, powerful anti-detection features, and complete ecosystem make it an ideal choice for modern data collection projects. Whether collecting data for AI projects or conducting business intelligence analysis, Crawlee provides a stable and reliable solution.