Home
Login

A powerful Node.js web scraping and browser automation library with JavaScript and TypeScript support, capable of extracting data for AI, LLM, RAG, etc.

Apache-2.0TypeScript 18.0kapify Last Updated: 2025-06-19

Crawlee - Modern Web Scraping and Browser Automation Framework

Project Overview

Crawlee, developed by Apify, is a powerful Node.js web scraping and browser automation library designed for building reliable web crawlers. Supporting both JavaScript and TypeScript, it provides high-quality data extraction services for applications such as AI, Large Language Models (LLM), and Retrieval-Augmented Generation (RAG).

GitHub: https://github.com/apify/crawlee

Core Features

🚀 Unified Crawling Interface

  • Multi-Engine Support: Unified interface supports HTTP requests and headless browser crawling.
  • Flexible Choice: Choose the appropriate crawling method based on your needs.

🔄 Intelligent Queue Management

  • Persistent Queue: Supports breadth-first and depth-first URL crawling queues.
  • Auto-Scaling: Automatically adjusts the crawling scale based on system resources.

💾 Flexible Storage System

  • Multi-Format Support: Pluggable storage for tabular data and files.
  • Local/Cloud: Defaults to local ./storage directory, with support for cloud storage.

🔒 Enterprise-Grade Anti-Detection

  • Proxy Rotation: Integrated proxy rotation and session management.
  • Human-Like Simulation: Simulates human behavior by default to bypass modern bot detection.
  • Fingerprint Spoofing: Automatically generates realistic browser TLS fingerprints and request headers.

🛠 Developer-Friendly

  • Native TypeScript Support: Complete type definitions and generics support.
  • CLI Tool: Provides scaffolding for quick project creation.
  • Lifecycle Hooks: Customizable lifecycle event handling.
  • Docker Ready: Built-in Dockerfile for easy deployment.

Supported Crawling Methods

HTTP Crawling

  • High Performance: Zero-configuration HTTP2 support, including proxies.
  • Intelligent Parsing: Integrated Cheerio and JSDOM for fast HTML parsing.
  • API Friendly: Also supports crawling JSON APIs.

Browser Automation

  • Multi-Browser: Supports Chrome, Firefox, Webkit, and other browsers.
  • JavaScript Rendering: Handles dynamic content and single-page applications.
  • Screenshot Functionality: Supports page screenshots.
  • Headless/Headful Mode: Flexible runtime mode selection.
  • Unified Interface: Playwright and Puppeteer use the same API interface.

Quick Start

Create a Project Using CLI

npx crawlee create my-crawler
cd my-crawler
npm start

Basic Example Code

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);
        await Dataset.pushData({ title, url: request.loadedUrl });
        await enqueueLinks();
    },
    // headless: false,
});


await crawler.run(['https://crawlee.dev']);

Install Dependencies

npm install crawlee playwright

Technical Architecture

Core Modules

  • @crawlee/core: Core functionality module.
  • @crawlee/types: TypeScript type definitions.
  • @crawlee/utils: Utility functions.

Supported Libraries and Tools

  • Playwright: Modern browser automation.
  • Puppeteer: Chrome/Chromium automation.
  • Cheerio: Fast HTML parsing.
  • JSDOM: DOM manipulation and parsing.

Deployment and Integration

Local Development

  • Default data is stored in the ./storage directory.
  • Supports customizing the storage location via configuration files.
  • Complete configuration guide and documentation support.

Cloud Deployment

  • Apify Platform: Since Crawlee is developed by Apify, it can be easily deployed to the Apify cloud platform.
  • Docker Support: Built-in Docker configuration for containerized deployment.
  • Cross-Platform: Can run in any environment that supports Node.js.

Version Management

  • Stable Version: Install stable releases via npm.
  • Beta Version: Supports installing beta versions to test new features.
npm install crawlee@3.12.3-beta.13

Applicable Scenarios

Data Science and AI

  • Machine Learning Datasets: Collect training data for AI models.
  • RAG Systems: Provide knowledge bases for Retrieval-Augmented Generation systems.
  • LLM Training: Pre-training data collection for large language models.

Business Applications

  • Competitive Analysis: Monitor competitors' products and pricing information.
  • Market Research: Collect industry trends and market data.
  • Content Aggregation: Automated collection of news, articles, and other content.

Technical Monitoring

  • Website Monitoring: Regularly check for website changes.
  • Price Tracking: Monitor e-commerce product prices.
  • Data Backup: Regularly back up important web page content.

Summary

Crawlee is a comprehensive and modern web scraping framework, particularly suitable for enterprise-grade applications that require high reliability and anti-detection capabilities. Its unified API design, powerful anti-detection features, and complete ecosystem make it an ideal choice for modern data collection projects. Whether collecting data for AI projects or conducting business intelligence analysis, Crawlee provides a stable and reliable solution.