Firecrawl 是一个API服务,接收URL,爬取它,并将其转换为干净的markdown或结构化数据。它爬取所有可访问的子页面,为每个页面提供干净的数据。无需站点地图。
在抓取内容之前可以执行各种操作:
curl -X POST https://api.firecrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer fc-YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"limit": 10,
"scrapeOptions": {
"formats": ["markdown", "html"]
}
}'
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"formats" : ["markdown", "html"]
}'
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://www.mendable.ai/",
"formats": ["json"],
"jsonOptions": {
"schema": {
"type": "object",
"properties": {
"company_mission": {"type": "string"},
"supports_sso": {"type": "boolean"},
"is_open_source": {"type": "boolean"},
"is_in_yc": {"type": "boolean"}
},
"required": ["company_mission", "supports_sso", "is_open_source", "is_in_yc"]
}
}
}'
pip install firecrawl-py
from firecrawl.firecrawl import FirecrawlApp
from firecrawl.firecrawl import ScrapeOptions
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# 抓取网站
scrape_status = app.scrape_url(
'https://firecrawl.dev',
formats=["markdown", "html"]
)
print(scrape_status)
# 爬取网站
crawl_status = app.crawl_url(
'https://firecrawl.dev',
limit=100,
scrape_options=ScrapeOptions(formats=["markdown", "html"]),
poll_interval=30
)
print(crawl_status)
npm install @mendable/firecrawl-js
import FirecrawlApp, { CrawlParams, CrawlStatusResponse } from '@mendable/firecrawl-js';
const app = new FirecrawlApp({apiKey: "fc-YOUR_API_KEY"});
// 抓取网站
const scrapeResponse = await app.scrapeUrl('https://firecrawl.dev', {
formats: ['markdown', 'html'],
});
if (scrapeResponse) {
console.log(scrapeResponse)
}
// 爬取网站
const crawlResponse = await app.crawlUrl('https://firecrawl.dev', {
limit: 100,
scrapeOptions: {
formats: ['markdown', 'html'],
}
} satisfies CrawlParams, true, 30) satisfies CrawlStatusResponse;
用户在使用Firecrawl进行抓取、搜索和爬取时,有责任遵守网站的政策。建议用户在启动任何抓取活动之前,遵守适用网站的隐私政策和使用条款。默认情况下,Firecrawl在爬取时会遵守网站robots.txt文件中指定的指令。
该项目目前处于活跃开发状态,团队正在将自定义模块集成到单体仓库中。虽然尚未完全准备好进行自托管部署,但可以本地运行进行开发和测试。项目拥有活跃的社区和持续的更新,是网页数据提取领域的领先解决方案。