Home
Login

A tool that converts any URL into an LLM-friendly input format, supporting web content extraction and intelligent search.

Apache-2.0TypeScript 8.9kjina-ai Last Updated: 2025-05-08

Jina AI Reader Project Detailed Introduction

Project Overview

Jina AI Reader is an open-source tool designed to convert any URL into an LLM-friendly input format. Developed and maintained by Jina AI, it is licensed under Apache-2.0 and provides high-quality web content extraction services for AI Agents and RAG (Retrieval-Augmented Generation) systems.

Core Features

1. Web Content Conversion (Read Function)

  • Main Function: Converts any URL into an LLM-friendly input format.
  • Usage: Add the prefix https://r.jina.ai/ before any URL.
  • Example:
    Original URL: https://en.wikipedia.org/wiki/Artificial_intelligence
    Converted URL: https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence
    

2. Smart Web Search (Search Function)

  • Main Function: Searches the web based on a query and returns results in an LLM-friendly format.
  • Usage: Add the prefix https://s.jina.ai/ before the query.
  • How it Works: Automatically searches the web, retrieves the top 5 results, accesses each URL, and applies content conversion.
  • Example:
    Query: Who will win 2024 US presidential election?
    Search URL: https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F
    

3. Advanced Features

Image Recognition and Description

  • Function: Automatically generates descriptions for images lacking alt tags.
  • Format: Image [idx]: [caption]
  • Activation: Use the request header x-with-generated-alt: true

PDF Document Support

  • Function: Directly reads and parses PDF documents.
  • Update Date: Added on May 30, 2024.

Site-Specific Search

  • Function: Restricts search results to a specific domain or website.
  • Usage: Set site=example.com in the query parameters.
  • Example:
    curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'
    

Technical Architecture

Supported Webpage Types

  • Static Webpages: Traditional HTML pages.
  • Single-Page Applications (SPA): Modern web applications based on JavaScript frameworks.
  • Dynamic Content: Webpages relying on client-side rendering.

Underlying Technology

  • Rendering Engine: Based on Puppeteer and headless Chrome browser.
  • Development Language: TypeScript
  • License: Apache-2.0

API Configuration Options

Request Header Control

Basic Configuration

# Enable image descriptions
x-with-generated-alt: true

# Forward Cookie settings
x-set-cookie: [cookie_string]

# Bypass cache
x-no-cache: true

# Custom cache tolerance (seconds)
x-cache-tolerance: [seconds]

Proxy and Selector

# Specify proxy server
x-proxy-url: [proxy_url]

# Target element selector
x-target-selector: [css_selector]

# Wait for specific element to appear
x-wait-for-selector: [css_selector]

# Set timeout
x-timeout: [seconds]

Response Format Control

# Return Markdown format (bypass readability filtering)
x-respond-with: markdown

# Return raw HTML
x-respond-with: html

# Return plain text
x-respond-with: text

# Return webpage screenshot URL
x-respond-with: screenshot

Output Format

Streaming Output

# Enable streaming mode
curl -H "Accept: text/event-stream" https://r.jina.ai/[URL]

JSON Format

# Get JSON format response
curl -H "Accept: application/json" https://r.jina.ai/[URL]

JSON Response Structure:

{
  "url": "Original URL",
  "title": "Page Title",
  "content": "Extracted Content"
}

Special Scenario Handling

Single-Page Application (SPA) Support

Due to the special nature of SPAs, the following solutions are provided:

Hash Route Handling

For URLs containing #, use the POST method:

curl -X POST 'https://r.jina.ai/' -d 'url=https://example.com/#/route'

Pre-loaded Content Handling

For webpages displaying pre-loaded content:

  1. Specify Timeout Waiting:
curl 'https://example.com/' -H 'x-timeout: 30'
  1. Wait for Specific Element:
curl 'https://example.com/' -H 'x-wait-for-selector: #content'
  1. Use Streaming Mode:
curl -H "Accept: text/event-stream" https://r.jina.ai/https://example.com/

Deployment and Usage

Production Environment Usage

  • Service Status: Free, stable, and scalable production-grade service.
  • Maintenance Status: Actively maintained as one of Jina AI's core products.
  • Service Addresses: https://r.jina.ai/ and https://s.jina.ai/

Application Scenarios

AI Agent Systems

  • Provides structured webpage content for AI Agents.
  • Supports Agents in collecting and analyzing webpage information.
  • Offers real-time web search capabilities.

RAG Systems

  • Converts webpage content into a vector database-friendly format.
  • Supports knowledge acquisition for retrieval-augmented generation.
  • Provides high-quality external knowledge sources.

Content Analysis

  • Webpage content extraction and cleaning.
  • Multimedia content understanding (image descriptions).
  • Document format standardization.

Performance and Limitations

Response Performance

  • Processing Time: Typically processes URLs and returns content within 2 seconds.
  • Complex Pages: Complex or dynamic pages may require more time.

Usage Limitations

  • Rate limits exist (please refer to the official documentation for specific limits).
  • Returned content retains the original language; translation services are not provided.

Jina AI Reader is a powerful open-source tool specifically designed for modern AI systems, addressing the format and quality issues faced by LLMs when processing web content. By simply adding a URL prefix, you can obtain high-quality, structured web content, making it an ideal tool for building AI Agents and RAG systems.