LLM Scraper is a TypeScript library that allows you to extract structured data from any webpage using Large Language Models. Developed by mishushakov and hosted on GitHub, this project is an innovative web data extraction solution.
The project supports 4 formatting modes:
html
- Loads raw HTMLmarkdown
- Loads markdown formattext
- Loads extracted text (using Readability.js)image
- Loads screenshots (multimodal only)Under the hood, it uses function calls to transform pages into structured data. This approach leverages the understanding capabilities of Large Language Models to intelligently parse and extract webpage content.
npm i zod playwright llm-scraper
npm i @ai-sdk/openai
import { openai } from '@ai-sdk/openai'
const llm = openai.chat('gpt-4o')
npm i @ai-sdk/openai
import { createOpenAI from '@ai-sdk/openai'
const groq = createOpenAI({
baseURL: 'https://api.groq.com/openai/v1',
apiKey: process.env.GROQ_API_KEY,
})
const llm = groq('llama3-8b-8192')
npm i ollama-ai-provider
import { ollama } from 'ollama-ai-provider'
const llm = ollama('llama3')
import { LlamaModel } from 'node-llama-cpp'
const llm = new LlamaModel({
modelPath: 'model.gguf'
})
import LLMScraper from 'llm-scraper'
const scraper = new LLMScraper(llm)
import { chromium } from 'playwright'
import { z } from 'zod'
import { openai } from '@ai-sdk/openai'
import LLMScraper from 'llm-scraper'
// Launch a browser instance
const browser = await chromium.launch()
// Initialize the LLM provider
const llm = openai.chat('gpt-4o')
// Create a new LLMScraper
const scraper = new LLMScraper(llm)
// Open a new page
const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')
// Define the schema for the extracted content
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe('Top 5 stories on Hacker News'),
})
// Run the scraper
const { data } = await scraper.run(page, schema, {
format: 'html',
})
// Display the LLM results
console.log(data.top)
await page.close()
await browser.close()
Use the stream
function instead of the run
function to get partial object streams (Vercel AI SDK only):
// Run the scraper in streaming mode
const { stream } = await scraper.stream(page, schema)
// Stream the LLM results
for await (const data of stream) {
console.log(data.top)
}
Use the generate
function to generate reusable Playwright scripts:
// Generate code and run it on the page
const { code } = await scraper.generate(page, schema)
const result = await page.evaluate(code)
const data = schema.parse(result)
// Display the parsed results
console.log(data.news)
LLM Scraper is an innovative web data extraction tool that combines traditional web scraping techniques with modern AI capabilities. By leveraging the understanding capabilities of Large Language Models, it can more intelligently and accurately extract structured data from complex web pages, providing a new solution for data collection and analysis.