Crawl4AI是一個專為LLM、AI代理和數據管道量身定制的高速、AI就緒網頁爬蟲。項目完全開源、靈活且專為即時效能而構建,為開發者提供無與倫比的速度、精度和部署便利性。
pip install -U crawl4ai
crawl4ai-setup
crawl4ai-doctor
docker pull unclecode/crawl4ai:0.6.0-rN
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN
# webUI:http://localhost:11235/playground
import asyncio
from crawl4ai import *
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
crwl https://www.nbcnews.com/business -o markdown
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
crwl https://www.example.com/products -q "Extract all product prices"
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
config=run_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
設定地理位置、語言和時區,獲取真實的地區特定內容:
run_config = CrawlerRunConfig(
url="https://browserleaks.com/geo",
locale="en-US",
timezone_id="America/Los_Angeles",
geolocation=GeolocationConfig(
latitude=34.0522,
longitude=-118.2437,
accuracy=10.0,
)
)
直接將HTML表格提取為CSV或pandas DataFrame:
results = await crawler.arun(
url="https://coinmarketcap.com/?page=1",
config=crawl_config
)
raw_df = pd.DataFrame()
for result in results:
if result.success and result.media["tables"]:
raw_df = pd.DataFrame(
result.media["tables"][0]["rows"],
columns=result.media["tables"][0]["headers"],
)
break
頁面啟動時使用預熱的瀏覽器實例,降低延遲和記憶體使用
通過模型上下文協議連接到AI工具,如Claude Code:
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
Crawl4AI擁有活躍的開源社區支援,歡迎貢獻代碼、報告問題和提出建議。項目遵循Apache 2.0許可證,完全開源且免費使用。
Crawl4AI代表了網頁爬蟲技術的最新發展,特別是在AI時代的背景下。它不僅提供了傳統爬蟲的所有功能,還專門為現代AI應用進行了優化,使其成為資料科學家、AI研究人員和開發者的理想選擇。通過其開源特性和活躍社區,Crawl4AI正在推動網頁資料提取技術的民主化和標準化。