BuilderIO/gpt-crawlerPlease refer to the latest official releases for information GitHub Homepage

An open-source tool for quickly creating custom GPT assistants by crawling websites to generate knowledge files.

ISCTypeScript 21.6kBuilderIO Last Updated: 2025-01-23

GPT-Crawler Project Details

Project Overview

GPT-Crawler is an open-source project developed by Builder.io, designed to generate knowledge files by crawling specified websites, thereby enabling the rapid creation of custom GPT assistants. This tool requires only one or more URLs to automatically scrape website content and generate data files suitable for training custom GPTs.

Core Features

Website Content Crawling: Automatically scrapes content from specified websites.
Knowledge File Generation: Converts scraped content into a format suitable for GPT training.
Flexible Configuration: Supports various configuration options, including crawling rules and page selectors.
Multiple Deployment Options: Supports local execution, containerized deployment, and API server mode.

Installation and Usage

Prerequisites

Node.js >= 16

Quick Start

git clone https://github.com/builderio/gpt-crawler
npm i

Configuration File

Edit the url and selector properties in the config.ts file to meet your needs.

Example Configuration:

export const defaultConfig: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

Configuration Options Explained

type Config = {
  /** The URL to start crawling from. If a sitemap is provided, it will be used and all pages within it will be downloaded. */
  url: string;
  /** The pattern used to match links on the page for subsequent crawling. */
  match: string;
  /** The selector used to grab the inner text. */
  selector: string;
  /** Do not crawl more than this many pages. */
  maxPagesToCrawl: number;
  /** The filename for the completed data. */
  outputFileName: string;
  /** Optional resource types to exclude. */
  resourceExclusions?: string[];
  /** Optional max file size (in megabytes). */
  maxFileSize?: number;
  /** Optional max tokens. */
  maxTokens?: number;
};

Running the Crawler

npm start

This will generate an output.json file.

Deployment Options

Containerized Deployment

Navigate to the containerapp directory and modify config.ts. The output file will be generated in the data folder.

API Server Mode

npm run start:server

The server runs on port 3000 by default.
Use the /crawl endpoint for POST requests.
API documentation is available at the /api-docs endpoint (using Swagger).
You can copy .env.example to .env to customize environment variables.

OpenAI Integration

Creating a Custom GPT (UI Access)

Go to https://chat.openai.com/
Click on your username in the bottom left corner.
Select "My GPTs" from the menu.
Select "Create a GPT".
Select "Configure".
Under "Knowledge", select "Upload a file" and upload the generated file.

Note: A paid ChatGPT plan may be required to create and use custom GPTs.

Creating an Assistant (API Access)

Go to https://platform.openai.com/assistants
Click "+ Create".
Select "upload" and upload the generated file.

Technical Features

TypeScript Development: Provides type safety and a better development experience.
Express.js Server: Provides a RESTful API interface.
Docker Support: Facilitates containerized deployment.
Flexible Selectors: Supports CSS selectors for precise content targeting.
Resource Filtering: Allows excluding unwanted resource types like images and videos.
Size Control: Supports limiting file size and token count.

Real-World Example

The project author used this tool to create a Builder.io assistant, which answers questions about how to use and integrate Builder.io by crawling Builder.io's documentation.

Advantages and Use Cases

Rapid Deployment: Create professional knowledge assistants in minutes.
Cost-Effective: Quickly generate AI assistants based on existing documentation.
Highly Customizable: Supports knowledge bases for specific domains or products.
Easy to Maintain: Can be re-crawled periodically to update the knowledge base.

Important Considerations

Ensure you have permission to crawl the target website.
Large files may need to be split for uploading.
Consider the website's crawling frequency limits.
It is recommended to test small-scale crawls first to verify the configuration.

Summary

GPT-Crawler provides a powerful and flexible solution for quickly creating professional AI assistants, especially suitable for scenarios requiring intelligent question-answering systems based on existing documentation or website content.