Crawl4AI是一个专为LLM、AI代理和数据管道量身定制的高速、AI就绪网页爬虫。项目完全开源、灵活且专为实时性能而构建,为开发者提供无与伦比的速度、精度和部署便利性。
pip install -U crawl4ai
crawl4ai-setup
crawl4ai-doctor
docker pull unclecode/crawl4ai:0.6.0-rN
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN
# webUI:http://localhost:11235/playground
import asyncio
from crawl4ai import *
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
crwl https://www.nbcnews.com/business -o markdown
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
crwl https://www.example.com/products -q "Extract all product prices"
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
config=run_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
设置地理位置、语言和时区,获取真实的地区特定内容:
run_config = CrawlerRunConfig(
url="https://browserleaks.com/geo",
locale="en-US",
timezone_id="America/Los_Angeles",
geolocation=GeolocationConfig(
latitude=34.0522,
longitude=-118.2437,
accuracy=10.0,
)
)
直接将HTML表格提取为CSV或pandas DataFrame:
results = await crawler.arun(
url="https://coinmarketcap.com/?page=1",
config=crawl_config
)
raw_df = pd.DataFrame()
for result in results:
if result.success and result.media["tables"]:
raw_df = pd.DataFrame(
result.media["tables"][0]["rows"],
columns=result.media["tables"][0]["headers"],
)
break
页面启动时使用预热的浏览器实例,降低延迟和内存使用
通过模型上下文协议连接到AI工具,如Claude Code:
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
Crawl4AI拥有活跃的开源社区支持,欢迎贡献代码、报告问题和提出建议。项目遵循Apache 2.0许可证,完全开源且免费使用。
Crawl4AI代表了网页爬虫技术的最新发展,特别是在AI时代的背景下。它不仅提供了传统爬虫的所有功能,还专门为现代AI应用进行了优化,使其成为数据科学家、AI研究人员和开发者的理想选择。通过其开源特性和活跃社区,Crawl4AI正在推动网页数据提取技术的民主化和标准化。