Firecrawl Project Detailed Introduction
Project Overview
Firecrawl is an API service that receives a URL, crawls it, and converts it into clean markdown or structured data. It crawls all accessible subpages, providing clean data for each page. No sitemap is required.
Core Features
1. Web Scraping
- Scrapes a single URL and retrieves content in an LLM-ready format.
- Supports multiple output formats: markdown, structured data, screenshots, HTML.
- Extracts structured data via LLM.
2. Website Crawling
- Crawls all URLs of a website and returns content in an LLM-ready format.
- Discovers all accessible subpages without a sitemap.
- Supports custom crawl depth and exclusion rules.
3. Website Mapping
- Inputs a website and retrieves all website URLs - extremely fast.
- Supports searching for specific URL patterns.
4. Web Search
- Searches the web and retrieves full content from the results.
- Customizable search parameters (language, country, etc.).
- Option to retrieve various content formats from search results.
5. Data Extraction
- Uses AI to extract structured data from single pages, multiple pages, or entire websites.
- Supports defining extraction rules via prompts and JSON schemas.
- Supports wildcard URL patterns.
6. Batching
- New asynchronous endpoint for scraping thousands of URLs simultaneously.
- Submit batch scraping jobs and return a job ID to check the status.
Technical Features
LLM-Ready Formats
- Markdown: Clean document format.
- Structured Data: Extracted data in JSON format.
- Screenshots: Visual capture of the page.
- HTML: Raw HTML content.
- Links and Metadata: Page information extraction.
Handling Complex Situations
- Proxies and Anti-Bot Mechanisms: Bypasses access restrictions.
- Dynamic Content: Handles JavaScript-rendered content.
- Output Parsing: Intelligent content parsing.
- Orchestration: Complex process management.
Customization Capabilities
- Exclude Tags: Filters unwanted content.
- Authenticated Crawling: Crawls content requiring authentication using custom headers.
- Maximum Crawl Depth: Controls the crawl scope.
- Media Parsing: Supports PDF, DOCX, images.
Interactive Features (Actions)
Performs various actions before scraping content:
- Click: Clicks on page elements.
- Scroll: Page scrolling operations.
- Input: Text input.
- Wait: Waits for page loading.
- Press Key: Keyboard operations.
API Usage Examples
Crawl Website
curl -X POST https://api.firecrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer fc-YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"limit": 10,
"scrapeOptions": {
"formats": ["markdown", "html"]
}
}'
Scrape Single Page
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"formats" : ["markdown", "html"]
}'
Structured Data Extraction
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://www.mendable.ai/",
"formats": ["json"],
"jsonOptions": {
"schema": {
"type": "object",
"properties": {
"company_mission": {"type": "string"},
"supports_sso": {"type": "boolean"},
"is_open_source": {"type": "boolean"},
"is_in_yc": {"type": "boolean"}
},
"required": ["company_mission", "supports_sso", "is_open_source", "is_in_yc"]
}
}
}'
SDK Support
Python SDK
pip install firecrawl-py
from firecrawl.firecrawl import FirecrawlApp
from firecrawl.firecrawl import ScrapeOptions
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# Scrape Website
scrape_status = app.scrape_url(
'https://firecrawl.dev',
formats=["markdown", "html"]
)
print(scrape_status)
# Crawl Website
crawl_status = app.crawl_url(
'https://firecrawl.dev',
limit=100,
scrape_options=ScrapeOptions(formats=["markdown", "html"]),
poll_interval=30
)
print(crawl_status)
Node.js SDK
npm install @mendable/firecrawl-js
import FirecrawlApp, { CrawlParams, CrawlStatusResponse } from '@mendable/firecrawl-js';
const app = new FirecrawlApp({apiKey: "fc-YOUR_API_KEY"});
// Scrape Website
const scrapeResponse = await app.scrapeUrl('https://firecrawl.dev', {
formats: ['markdown', 'html'],
});
if (scrapeResponse) {
console.log(scrapeResponse)
}
// Crawl Website
const crawlResponse = await app.crawlUrl('https://firecrawl.dev', {
limit: 100,
scrapeOptions: {
formats: ['markdown', 'html'],
}
} satisfies CrawlParams, true, 30) satisfies CrawlStatusResponse;
Integration Support
LLM Framework Integration
- Langchain: Python and JavaScript versions.
- Llama Index: Data connector.
- Crew.ai: AI agent framework.
- Composio: Tool integration.
- PraisonAI: AI orchestration.
- Superinterface: Assistant features.
- Vectorize: Vectorization integration.
Low-Code Frameworks
- Dify: AI application building platform.
- Langflow: Visual AI flows.
- Flowise AI: No-code AI building.
- Cargo: Data integration.
- Pipedream: Workflow automation.
Other Integrations
- Zapier: Automated workflows.
- Pabbly Connect: Application integration.
License and Deployment
Open Source License
- Primarily uses GNU Affero General Public License v3.0 (AGPL-3.0).
- SDK and some UI components use the MIT license.
Hosted Service
- A hosted version is available at firecrawl.dev.
- Cloud solutions provide additional features and enterprise-level support.
Self-Hosting
- Supports local deployment.
- Currently under development, integrating custom modules into a monolithic repository.
- Can be run locally, but not yet fully ready for self-hosted deployment.
Use Cases
- AI Data Preparation: Provides clean training data for LLMs.
- Content Aggregation: Collects and organizes content from multiple websites.
- Competitive Analysis: Monitors competitor website changes.
- SEO Research: Analyzes website structure and content.
- Data Mining: Extracts structured information from websites.
- Document Generation: Converts website content into document formats.
Usage Notes
Users are responsible for complying with website policies when using Firecrawl for scraping, searching, and crawling. It is recommended that users comply with the applicable website's privacy policy and terms of use before initiating any scraping activities. By default, Firecrawl complies with the instructions specified in the website's robots.txt file when crawling.
Project Status
The project is currently under active development, with the team integrating custom modules into a monolithic repository. While not yet fully ready for self-hosted deployment, it can be run locally for development and testing. The project has an active community and continuous updates, making it a leading solution in the field of web data extraction.
