GPT-Crawler is an open-source project developed by Builder.io, designed to generate knowledge files by crawling specified websites, thereby enabling the rapid creation of custom GPT assistants. This tool requires only one or more URLs to automatically scrape website content and generate data files suitable for training custom GPTs.
git clone https://github.com/builderio/gpt-crawler
npm i
Edit the url
and selector
properties in the config.ts
file to meet your needs.
Example Configuration:
export const defaultConfig: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
selector: `.docs-builder-container`,
maxPagesToCrawl: 50,
outputFileName: "output.json",
};
type Config = {
/** The URL to start crawling from. If a sitemap is provided, it will be used and all pages within it will be downloaded. */
url: string;
/** The pattern used to match links on the page for subsequent crawling. */
match: string;
/** The selector used to grab the inner text. */
selector: string;
/** Do not crawl more than this many pages. */
maxPagesToCrawl: number;
/** The filename for the completed data. */
outputFileName: string;
/** Optional resource types to exclude. */
resourceExclusions?: string[];
/** Optional max file size (in megabytes). */
maxFileSize?: number;
/** Optional max tokens. */
maxTokens?: number;
};
npm start
This will generate an output.json
file.
Navigate to the containerapp
directory and modify config.ts
. The output file will be generated in the data
folder.
npm run start:server
/crawl
endpoint for POST requests./api-docs
endpoint (using Swagger)..env.example
to .env
to customize environment variables.Note: A paid ChatGPT plan may be required to create and use custom GPTs.
The project author used this tool to create a Builder.io assistant, which answers questions about how to use and integrate Builder.io by crawling Builder.io's documentation.
GPT-Crawler provides a powerful and flexible solution for quickly creating professional AI assistants, especially suitable for scenarios requiring intelligent question-answering systems based on existing documentation or website content.