A TypeScript library that uses large language models to convert any webpage into structured data.
LLM Scraper Project Details
Project Overview
LLM Scraper is a TypeScript library that allows you to extract structured data from any webpage using Large Language Models. Developed by mishushakov and hosted on GitHub, this project is an innovative web data extraction solution.
Core Features
Key Functionality
- Multi-LLM Support: Supports local models (Ollama, GGUF), OpenAI, and Vercel AI SDK providers.
- Type Safety: Uses Zod to define schemas, providing complete TypeScript type safety.
- Based on Playwright: Built on the powerful Playwright framework.
- Streaming: Supports streaming objects.
- Code Generation: New code generation feature.
Data Format Support
The project supports 4 formatting modes:
html
- Loads raw HTMLmarkdown
- Loads markdown formattext
- Loads extracted text (using Readability.js)image
- Loads screenshots (multimodal only)
Technical Architecture
Core Principle
Under the hood, it uses function calls to transform pages into structured data. This approach leverages the understanding capabilities of Large Language Models to intelligently parse and extract webpage content.
Technology Stack
- TypeScript - Provides type safety and a great development experience
- Playwright - Web automation and content acquisition
- Zod - Schema validation and type inference
- AI SDK - Integration with various LLM providers
Installation and Usage
Install Dependencies
npm i zod playwright llm-scraper
LLM Initialization Examples
OpenAI
npm i @ai-sdk/openai
import { openai } from '@ai-sdk/openai'
const llm = openai.chat('gpt-4o')
Groq
npm i @ai-sdk/openai
import { createOpenAI from '@ai-sdk/openai'
const groq = createOpenAI({
baseURL: 'https://api.groq.com/openai/v1',
apiKey: process.env.GROQ_API_KEY,
})
const llm = groq('llama3-8b-8192')
Ollama
npm i ollama-ai-provider
import { ollama } from 'ollama-ai-provider'
const llm = ollama('llama3')
GGUF
import { LlamaModel } from 'node-llama-cpp'
const llm = new LlamaModel({
modelPath: 'model.gguf'
})
Basic Usage Example
Create a Scraper Instance
import LLMScraper from 'llm-scraper'
const scraper = new LLMScraper(llm)
HackerNews Data Extraction Example
import { chromium } from 'playwright'
import { z } from 'zod'
import { openai } from '@ai-sdk/openai'
import LLMScraper from 'llm-scraper'
// Launch a browser instance
const browser = await chromium.launch()
// Initialize the LLM provider
const llm = openai.chat('gpt-4o')
// Create a new LLMScraper
const scraper = new LLMScraper(llm)
// Open a new page
const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')
// Define the schema for the extracted content
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe('Top 5 stories on Hacker News'),
})
// Run the scraper
const { data } = await scraper.run(page, schema, {
format: 'html',
})
// Display the LLM results
console.log(data.top)
await page.close()
await browser.close()
Advanced Features
Streaming
Use the stream
function instead of the run
function to get partial object streams (Vercel AI SDK only):
// Run the scraper in streaming mode
const { stream } = await scraper.stream(page, schema)
// Stream the LLM results
for await (const data of stream) {
console.log(data.top)
}
Code Generation
Use the generate
function to generate reusable Playwright scripts:
// Generate code and run it on the page
const { code } = await scraper.generate(page, schema)
const result = await page.evaluate(code)
const data = schema.parse(result)
// Display the parsed results
console.log(data.news)
Application Scenarios
Applicable Fields
- Data Mining: Extract structured information from news websites, forums, etc.
- Market Research: Collect competitor product information.
- Content Aggregation: Automate content collection and organization.
- Monitoring Systems: Regularly check for website changes.
- Research Analysis: Academic research data collection.
Advantages and Features
- Intelligent Parsing: Leverages LLMs to understand complex page structures.
- Type Safety: Complete TypeScript support.
- Flexible Configuration: Supports multiple LLM providers.
- Easy Integration: Clean API design.
Summary
LLM Scraper is an innovative web data extraction tool that combines traditional web scraping techniques with modern AI capabilities. By leveraging the understanding capabilities of Large Language Models, it can more intelligently and accurately extract structured data from complex web pages, providing a new solution for data collection and analysis.