jina-ai/readerPlease refer to the latest official releases for information GitHub Homepage

A tool that converts any URL into an LLM-friendly input format, supporting web content extraction and intelligent search.

Apache-2.0TypeScript 8.9kjina-ai Last Updated: 2025-05-08

Jina AI Reader Project Detailed Introduction

Project Overview

Jina AI Reader is an open-source tool designed to convert any URL into an LLM-friendly input format. Developed and maintained by Jina AI, it is licensed under Apache-2.0 and provides high-quality web content extraction services for AI Agents and RAG (Retrieval-Augmented Generation) systems.

Core Features

1. Web Content Conversion (Read Function)

Main Function: Converts any URL into an LLM-friendly input format.
Usage: Add the prefix https://r.jina.ai/ before any URL.

Example:

Original URL: https://en.wikipedia.org/wiki/Artificial_intelligence
Converted URL: https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence

2. Smart Web Search (Search Function)

Main Function: Searches the web based on a query and returns results in an LLM-friendly format.
Usage: Add the prefix https://s.jina.ai/ before the query.
How it Works: Automatically searches the web, retrieves the top 5 results, accesses each URL, and applies content conversion.

Example:

Query: Who will win 2024 US presidential election?
Search URL: https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F

3. Advanced Features

Image Recognition and Description

Function: Automatically generates descriptions for images lacking alt tags.
Format: Image [idx]: [caption]
Activation: Use the request header x-with-generated-alt: true

PDF Document Support

Function: Directly reads and parses PDF documents.
Update Date: Added on May 30, 2024.

Site-Specific Search

Function: Restricts search results to a specific domain or website.
Usage: Set site=example.com in the query parameters.

Example:

curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'

Technical Architecture

Supported Webpage Types

Static Webpages: Traditional HTML pages.
Single-Page Applications (SPA): Modern web applications based on JavaScript frameworks.
Dynamic Content: Webpages relying on client-side rendering.

Underlying Technology

Rendering Engine: Based on Puppeteer and headless Chrome browser.
Development Language: TypeScript
License: Apache-2.0

API Configuration Options

Request Header Control

Basic Configuration

# Enable image descriptions
x-with-generated-alt: true

# Forward Cookie settings
x-set-cookie: [cookie_string]

# Bypass cache
x-no-cache: true

# Custom cache tolerance (seconds)
x-cache-tolerance: [seconds]

Proxy and Selector

# Specify proxy server
x-proxy-url: [proxy_url]

# Target element selector
x-target-selector: [css_selector]

# Wait for specific element to appear
x-wait-for-selector: [css_selector]

# Set timeout
x-timeout: [seconds]

Response Format Control

# Return Markdown format (bypass readability filtering)
x-respond-with: markdown

# Return raw HTML
x-respond-with: html

# Return plain text
x-respond-with: text

# Return webpage screenshot URL
x-respond-with: screenshot

Output Format

Streaming Output

# Enable streaming mode
curl -H "Accept: text/event-stream" https://r.jina.ai/[URL]

JSON Format

# Get JSON format response
curl -H "Accept: application/json" https://r.jina.ai/[URL]

JSON Response Structure:

{
  "url": "Original URL",
  "title": "Page Title",
  "content": "Extracted Content"
}

Special Scenario Handling

Single-Page Application (SPA) Support

Due to the special nature of SPAs, the following solutions are provided:

Hash Route Handling

For URLs containing #, use the POST method:

curl -X POST 'https://r.jina.ai/' -d 'url=https://example.com/#/route'

Pre-loaded Content Handling

For webpages displaying pre-loaded content:

Specify Timeout Waiting:

curl 'https://example.com/' -H 'x-timeout: 30'

Wait for Specific Element:

curl 'https://example.com/' -H 'x-wait-for-selector: #content'

Use Streaming Mode:

curl -H "Accept: text/event-stream" https://r.jina.ai/https://example.com/

Deployment and Usage

Production Environment Usage

Service Status: Free, stable, and scalable production-grade service.
Maintenance Status: Actively maintained as one of Jina AI's core products.
Service Addresses: https://r.jina.ai/ and https://s.jina.ai/

Application Scenarios

AI Agent Systems

Provides structured webpage content for AI Agents.
Supports Agents in collecting and analyzing webpage information.
Offers real-time web search capabilities.

RAG Systems

Converts webpage content into a vector database-friendly format.
Supports knowledge acquisition for retrieval-augmented generation.
Provides high-quality external knowledge sources.

Content Analysis

Webpage content extraction and cleaning.
Multimedia content understanding (image descriptions).
Document format standardization.

Performance and Limitations

Response Performance

Processing Time: Typically processes URLs and returns content within 2 seconds.
Complex Pages: Complex or dynamic pages may require more time.

Usage Limitations

Rate limits exist (please refer to the official documentation for specific limits).
Returned content retains the original language; translation services are not provided.

Jina AI Reader is a powerful open-source tool specifically designed for modern AI systems, addressing the format and quality issues faced by LLMs when processing web content. By simply adding a URL prefix, you can obtain high-quality, structured web content, making it an ideal tool for building AI Agents and RAG systems.