An open-source tool for quickly creating custom GPT assistants by crawling websites to generate knowledge files.
GPT-Crawler Project Details
Project Overview
GPT-Crawler is an open-source project developed by Builder.io, designed to generate knowledge files by crawling specified websites, thereby enabling the rapid creation of custom GPT assistants. This tool requires only one or more URLs to automatically scrape website content and generate data files suitable for training custom GPTs.
Core Features
- Website Content Crawling: Automatically scrapes content from specified websites.
- Knowledge File Generation: Converts scraped content into a format suitable for GPT training.
- Flexible Configuration: Supports various configuration options, including crawling rules and page selectors.
- Multiple Deployment Options: Supports local execution, containerized deployment, and API server mode.
Installation and Usage
Prerequisites
- Node.js >= 16
Quick Start
git clone https://github.com/builderio/gpt-crawler
npm i
Configuration File
Edit the url
and selector
properties in the config.ts
file to meet your needs.
Example Configuration:
export const defaultConfig: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
selector: `.docs-builder-container`,
maxPagesToCrawl: 50,
outputFileName: "output.json",
};
Configuration Options Explained
type Config = {
/** The URL to start crawling from. If a sitemap is provided, it will be used and all pages within it will be downloaded. */
url: string;
/** The pattern used to match links on the page for subsequent crawling. */
match: string;
/** The selector used to grab the inner text. */
selector: string;
/** Do not crawl more than this many pages. */
maxPagesToCrawl: number;
/** The filename for the completed data. */
outputFileName: string;
/** Optional resource types to exclude. */
resourceExclusions?: string[];
/** Optional max file size (in megabytes). */
maxFileSize?: number;
/** Optional max tokens. */
maxTokens?: number;
};
Running the Crawler
npm start
This will generate an output.json
file.
Deployment Options
Containerized Deployment
Navigate to the containerapp
directory and modify config.ts
. The output file will be generated in the data
folder.
API Server Mode
npm run start:server
- The server runs on port 3000 by default.
- Use the
/crawl
endpoint for POST requests. - API documentation is available at the
/api-docs
endpoint (using Swagger). - You can copy
.env.example
to.env
to customize environment variables.
OpenAI Integration
Creating a Custom GPT (UI Access)
- Go to https://chat.openai.com/
- Click on your username in the bottom left corner.
- Select "My GPTs" from the menu.
- Select "Create a GPT".
- Select "Configure".
- Under "Knowledge", select "Upload a file" and upload the generated file.
Note: A paid ChatGPT plan may be required to create and use custom GPTs.
Creating an Assistant (API Access)
- Go to https://platform.openai.com/assistants
- Click "+ Create".
- Select "upload" and upload the generated file.
Technical Features
- TypeScript Development: Provides type safety and a better development experience.
- Express.js Server: Provides a RESTful API interface.
- Docker Support: Facilitates containerized deployment.
- Flexible Selectors: Supports CSS selectors for precise content targeting.
- Resource Filtering: Allows excluding unwanted resource types like images and videos.
- Size Control: Supports limiting file size and token count.
Real-World Example
The project author used this tool to create a Builder.io assistant, which answers questions about how to use and integrate Builder.io by crawling Builder.io's documentation.
Advantages and Use Cases
- Rapid Deployment: Create professional knowledge assistants in minutes.
- Cost-Effective: Quickly generate AI assistants based on existing documentation.
- Highly Customizable: Supports knowledge bases for specific domains or products.
- Easy to Maintain: Can be re-crawled periodically to update the knowledge base.
Important Considerations
- Ensure you have permission to crawl the target website.
- Large files may need to be split for uploading.
- Consider the website's crawling frequency limits.
- It is recommended to test small-scale crawls first to verify the configuration.
Summary
GPT-Crawler provides a powerful and flexible solution for quickly creating professional AI assistants, especially suitable for scenarios requiring intelligent question-answering systems based on existing documentation or website content.