ConardLi/easy-datasetPlease refer to the latest official releases for information GitHub Homepage

A powerful tool designed specifically for creating fine-tuning datasets for large language models, supporting intelligent document processing, question generation, and multi-format export.

NOASSERTIONJavaScript 9.1kConardLieasy-dataset Last Updated: 2025-07-02

Easy Dataset - LLM Fine-tuning Dataset Creation Tool

Project Overview

Easy Dataset is a professional tool designed specifically for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data, making the model fine-tuning process simple and efficient.

With Easy Dataset, you can transform your domain knowledge into structured datasets, compatible with all OpenAI-formatted LLM APIs, making the fine-tuning process more convenient and efficient.

Core Features

🧠 Intelligent Document Processing

Supports uploading Markdown files and automatically splitting them into meaningful segments
Intelligently identifies document structure and content hierarchy

❓ Intelligent Question Generation

Automatically extracts relevant questions from each text segment
Supports batch question generation to improve processing efficiency

💬 Answer Generation

Uses LLM APIs to generate comprehensive answers for each question
Supports custom system prompts to guide model responses

✏️ Flexible Editing

Allows editing of questions, answers, and datasets at any stage of the process
Provides an intuitive user interface for content management

📤 Multi-Format Export

Supports multiple dataset formats (Alpaca, ShareGPT)
Supports multiple file types (JSON, JSONL)

🔧 Broad Model Support

Compatible with all LLM APIs following the OpenAI format
Supports Ollama local model deployment

👥 User-Friendly Interface

Intuitive UI designed for both technical and non-technical users
Full Chinese and English internationalization support

Technical Architecture

Tech Stack

Frontend Framework: Next.js 14.1.0
UI Library: React 18.2.0
Component Library: Material UI 5.15.7
Database: Local file database
License: Apache License 2.0

Project Structure

easy-dataset/
├── app/                    # Next.js application directory
│   ├── api/               # API routes
│   │   ├── llm/          # LLM API integration
│   │   │   ├── ollama/   # Ollama API integration
│   │   │   └── openai/   # OpenAI API integration
│   │   └── projects/     # Project management API
│   │       └── [projectId]/
│   │           ├── chunks/     # Text chunk operations
│   │           ├── datasets/   # Dataset generation and management
│   │           ├── questions/  # Question management
│   │           └── split/      # Text splitting operations
│   └── projects/          # Frontend project pages
│       └── [projectId]/
│           ├── datasets/   # Dataset management interface
│           ├── questions/  # Question management interface
│           ├── settings/   # Project settings interface
│           └── text-split/ # Text processing interface
├── components/            # React components
│   ├── datasets/         # Dataset-related components
│   ├── home/            # Home page component
│   ├── projects/        # Project management components
│   ├── questions/       # Question management components
│   └── text-split/      # Text processing components
├── lib/                  # Core libraries and tools
│   ├── db/              # Database operations
│   ├── i18n/            # Internationalization
│   ├── llm/             # LLM integration
│   │   ├── common/      # LLM common tools
│   │   ├── core/        # Core LLM client
│   │   └── prompts/     # Prompt templates
│   └── text-splitter/   # Text splitting tool
├── locales/             # Internationalization resources
│   ├── en/             # English translation
│   └── zh-CN/          # Chinese translation
└── local-db/           # Local file database
    └── projects/       # Project data storage

Installation and Deployment

System Requirements

Node.js 18.x or higher
pnpm (recommended) or npm

Local Development

Clone the repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Install dependencies:

npm install

Start the development server:

npm run build
npm run start

Docker Deployment

Clone the repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Build the Docker image:

docker build -t easy-dataset .

Run the container:

docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset

Note: Replace {YOUR_LOCAL_DB_PATH} with the actual path where you want to store the local database.

Access the application: Open your browser and navigate to http://localhost:1717

Desktop Application Download

Windows	MacOS	Linux
Setup.exe	Intel / M	AppImage

Usage Flow

1. Create a Project

Click the "Create Project" button on the homepage
Enter the project name and description
Configure your preferred LLM API settings

2. Upload and Split Text

Upload your Markdown file in the "Text Split" section
Review the automatically split text segments
Adjust the splitting results as needed

3. Generate Questions

Navigate to the "Questions" section
Select the text segments for which you want to generate questions
Review and edit the generated questions
Organize questions using the tag tree

4. Generate a Dataset

Go to the "Datasets" section
Select the questions to include in the dataset
Generate answers using the configured LLM
Review and edit the generated answers

5. Export the Dataset

Click the "Export" button in the dataset section
Select your preferred format (Alpaca or ShareGPT)
Select the file format (JSON or JSONL)
Add custom system prompts if needed
Export your dataset

Featured Functions

Intelligent Prompt System

The project has built-in professional prompt templates for different languages:

Chinese question generation prompt
English question generation prompt
Chinese answer generation prompt
English answer generation prompt

Multi-LLM Support

Supports OpenAI API
Supports Ollama local deployment
Compatible with all OpenAI-formatted APIs

Flexible Data Formats

Alpaca format: Suitable for instruction fine-tuning
ShareGPT format: Suitable for dialogue training
JSON/JSONL output format selection