ConardLi/easy-datasetPlease refer to the latest official releases for information GitHub Homepage
A powerful tool designed specifically for creating fine-tuning datasets for large language models, supporting intelligent document processing, question generation, and multi-format export.
NOASSERTIONJavaScript 9.1kConardLieasy-dataset Last Updated: 2025-07-02
Easy Dataset - LLM Fine-tuning Dataset Creation Tool
Project Overview
Easy Dataset is a professional tool designed specifically for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data, making the model fine-tuning process simple and efficient.
With Easy Dataset, you can transform your domain knowledge into structured datasets, compatible with all OpenAI-formatted LLM APIs, making the fine-tuning process more convenient and efficient.
Core Features
🧠 Intelligent Document Processing
- Supports uploading Markdown files and automatically splitting them into meaningful segments
- Intelligently identifies document structure and content hierarchy
❓ Intelligent Question Generation
- Automatically extracts relevant questions from each text segment
- Supports batch question generation to improve processing efficiency
💬 Answer Generation
- Uses LLM APIs to generate comprehensive answers for each question
- Supports custom system prompts to guide model responses
✏️ Flexible Editing
- Allows editing of questions, answers, and datasets at any stage of the process
- Provides an intuitive user interface for content management
📤 Multi-Format Export
- Supports multiple dataset formats (Alpaca, ShareGPT)
- Supports multiple file types (JSON, JSONL)
🔧 Broad Model Support
- Compatible with all LLM APIs following the OpenAI format
- Supports Ollama local model deployment
👥 User-Friendly Interface
- Intuitive UI designed for both technical and non-technical users
- Full Chinese and English internationalization support
Technical Architecture
Tech Stack
- Frontend Framework: Next.js 14.1.0
- UI Library: React 18.2.0
- Component Library: Material UI 5.15.7
- Database: Local file database
- License: Apache License 2.0
Project Structure
easy-dataset/
├── app/ # Next.js application directory
│ ├── api/ # API routes
│ │ ├── llm/ # LLM API integration
│ │ │ ├── ollama/ # Ollama API integration
│ │ │ └── openai/ # OpenAI API integration
│ │ └── projects/ # Project management API
│ │ └── [projectId]/
│ │ ├── chunks/ # Text chunk operations
│ │ ├── datasets/ # Dataset generation and management
│ │ ├── questions/ # Question management
│ │ └── split/ # Text splitting operations
│ └── projects/ # Frontend project pages
│ └── [projectId]/
│ ├── datasets/ # Dataset management interface
│ ├── questions/ # Question management interface
│ ├── settings/ # Project settings interface
│ └── text-split/ # Text processing interface
├── components/ # React components
│ ├── datasets/ # Dataset-related components
│ ├── home/ # Home page component
│ ├── projects/ # Project management components
│ ├── questions/ # Question management components
│ └── text-split/ # Text processing components
├── lib/ # Core libraries and tools
│ ├── db/ # Database operations
│ ├── i18n/ # Internationalization
│ ├── llm/ # LLM integration
│ │ ├── common/ # LLM common tools
│ │ ├── core/ # Core LLM client
│ │ └── prompts/ # Prompt templates
│ └── text-splitter/ # Text splitting tool
├── locales/ # Internationalization resources
│ ├── en/ # English translation
│ └── zh-CN/ # Chinese translation
└── local-db/ # Local file database
└── projects/ # Project data storage
Installation and Deployment
System Requirements
- Node.js 18.x or higher
- pnpm (recommended) or npm
Local Development
- Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
- Install dependencies:
npm install
- Start the development server:
npm run build
npm run start
Docker Deployment
- Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
- Build the Docker image:
docker build -t easy-dataset .
- Run the container:
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset
Note: Replace
{YOUR_LOCAL_DB_PATH}
with the actual path where you want to store the local database.
- Access the application:
Open your browser and navigate to
http://localhost:1717
Desktop Application Download
Windows | MacOS | Linux |
---|---|---|
Setup.exe | Intel / M | AppImage |
Usage Flow
1. Create a Project
- Click the "Create Project" button on the homepage
- Enter the project name and description
- Configure your preferred LLM API settings
2. Upload and Split Text
- Upload your Markdown file in the "Text Split" section
- Review the automatically split text segments
- Adjust the splitting results as needed
3. Generate Questions
- Navigate to the "Questions" section
- Select the text segments for which you want to generate questions
- Review and edit the generated questions
- Organize questions using the tag tree
4. Generate a Dataset
- Go to the "Datasets" section
- Select the questions to include in the dataset
- Generate answers using the configured LLM
- Review and edit the generated answers
5. Export the Dataset
- Click the "Export" button in the dataset section
- Select your preferred format (Alpaca or ShareGPT)
- Select the file format (JSON or JSONL)
- Add custom system prompts if needed
- Export your dataset
Featured Functions
Intelligent Prompt System
The project has built-in professional prompt templates for different languages:
- Chinese question generation prompt
- English question generation prompt
- Chinese answer generation prompt
- English answer generation prompt
Multi-LLM Support
- Supports OpenAI API
- Supports Ollama local deployment
- Compatible with all OpenAI-formatted APIs
Flexible Data Formats
- Alpaca format: Suitable for instruction fine-tuning
- ShareGPT format: Suitable for dialogue training
- JSON/JSONL output format selection