mishushakov/llm-scraperPlease refer to the latest official releases for information GitHub Homepage

A TypeScript library that uses large language models to convert any webpage into structured data.

MITTypeScript 5.0kmishushakov Last Updated: 2025-05-18

LLM Scraper Project Details

Project Overview

LLM Scraper is a TypeScript library that allows you to extract structured data from any webpage using Large Language Models. Developed by mishushakov and hosted on GitHub, this project is an innovative web data extraction solution.

Core Features

Key Functionality

Multi-LLM Support: Supports local models (Ollama, GGUF), OpenAI, and Vercel AI SDK providers.
Type Safety: Uses Zod to define schemas, providing complete TypeScript type safety.
Based on Playwright: Built on the powerful Playwright framework.
Streaming: Supports streaming objects.
Code Generation: New code generation feature.

Data Format Support

The project supports 4 formatting modes:

html - Loads raw HTML
markdown - Loads markdown format
text - Loads extracted text (using Readability.js)
image - Loads screenshots (multimodal only)

Technical Architecture

Core Principle

Under the hood, it uses function calls to transform pages into structured data. This approach leverages the understanding capabilities of Large Language Models to intelligently parse and extract webpage content.

Technology Stack

TypeScript - Provides type safety and a great development experience
Playwright - Web automation and content acquisition
Zod - Schema validation and type inference
AI SDK - Integration with various LLM providers

Installation and Usage

Install Dependencies

npm i zod playwright llm-scraper

LLM Initialization Examples

OpenAI

npm i @ai-sdk/openai

import { openai } from '@ai-sdk/openai'
const llm = openai.chat('gpt-4o')

Groq

npm i @ai-sdk/openai

import { createOpenAI from '@ai-sdk/openai'
const groq = createOpenAI({
  baseURL: 'https://api.groq.com/openai/v1',
  apiKey: process.env.GROQ_API_KEY,
})
const llm = groq('llama3-8b-8192')

Ollama

npm i ollama-ai-provider

import { ollama } from 'ollama-ai-provider'
const llm = ollama('llama3')

GGUF

import { LlamaModel } from 'node-llama-cpp'
const llm = new LlamaModel({
  modelPath: 'model.gguf'
})

Basic Usage Example

Create a Scraper Instance

import LLMScraper from 'llm-scraper'
const scraper = new LLMScraper(llm)

HackerNews Data Extraction Example

import { chromium } from 'playwright'
import { z } from 'zod'
import { openai } from '@ai-sdk/openai'
import LLMScraper from 'llm-scraper'

// Launch a browser instance
const browser = await chromium.launch()

// Initialize the LLM provider
const llm = openai.chat('gpt-4o')

// Create a new LLMScraper
const scraper = new LLMScraper(llm)

// Open a new page
const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')

// Define the schema for the extracted content
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Top 5 stories on Hacker News'),
})

// Run the scraper
const { data } = await scraper.run(page, schema, {
  format: 'html',
})

// Display the LLM results
console.log(data.top)
await page.close()
await browser.close()

Advanced Features

Streaming

Use the stream function instead of the run function to get partial object streams (Vercel AI SDK only):

// Run the scraper in streaming mode
const { stream } = await scraper.stream(page, schema)

// Stream the LLM results
for await (const data of stream) {
  console.log(data.top)
}

Code Generation

Use the generate function to generate reusable Playwright scripts:

// Generate code and run it on the page
const { code } = await scraper.generate(page, schema)
const result = await page.evaluate(code)
const data = schema.parse(result)

// Display the parsed results
console.log(data.news)

Application Scenarios

Applicable Fields

Data Mining: Extract structured information from news websites, forums, etc.
Market Research: Collect competitor product information.
Content Aggregation: Automate content collection and organization.
Monitoring Systems: Regularly check for website changes.
Research Analysis: Academic research data collection.

Advantages and Features

Intelligent Parsing: Leverages LLMs to understand complex page structures.
Type Safety: Complete TypeScript support.
Flexible Configuration: Supports multiple LLM providers.
Easy Integration: Clean API design.

Summary

LLM Scraper is an innovative web data extraction tool that combines traditional web scraping techniques with modern AI capabilities. By leveraging the understanding capabilities of Large Language Models, it can more intelligently and accurately extract structured data from complex web pages, providing a new solution for data collection and analysis.