Home
Login

A powerful tool designed specifically for creating fine-tuning datasets for large language models, supporting intelligent document processing, question generation, and multi-format export.

NOASSERTIONJavaScript 9.1kConardLieasy-dataset Last Updated: 2025-07-02

Easy Dataset - LLM Fine-tuning Dataset Creation Tool

Project Overview

Easy Dataset is a professional tool designed specifically for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data, making the model fine-tuning process simple and efficient.

With Easy Dataset, you can transform your domain knowledge into structured datasets, compatible with all OpenAI-formatted LLM APIs, making the fine-tuning process more convenient and efficient.

Core Features

🧠 Intelligent Document Processing

  • Supports uploading Markdown files and automatically splitting them into meaningful segments
  • Intelligently identifies document structure and content hierarchy

❓ Intelligent Question Generation

  • Automatically extracts relevant questions from each text segment
  • Supports batch question generation to improve processing efficiency

💬 Answer Generation

  • Uses LLM APIs to generate comprehensive answers for each question
  • Supports custom system prompts to guide model responses

✏️ Flexible Editing

  • Allows editing of questions, answers, and datasets at any stage of the process
  • Provides an intuitive user interface for content management

📤 Multi-Format Export

  • Supports multiple dataset formats (Alpaca, ShareGPT)
  • Supports multiple file types (JSON, JSONL)

🔧 Broad Model Support

  • Compatible with all LLM APIs following the OpenAI format
  • Supports Ollama local model deployment

👥 User-Friendly Interface

  • Intuitive UI designed for both technical and non-technical users
  • Full Chinese and English internationalization support

Technical Architecture

Tech Stack

  • Frontend Framework: Next.js 14.1.0
  • UI Library: React 18.2.0
  • Component Library: Material UI 5.15.7
  • Database: Local file database
  • License: Apache License 2.0

Project Structure

easy-dataset/
├── app/                    # Next.js application directory
│   ├── api/               # API routes
│   │   ├── llm/          # LLM API integration
│   │   │   ├── ollama/   # Ollama API integration
│   │   │   └── openai/   # OpenAI API integration
│   │   └── projects/     # Project management API
│   │       └── [projectId]/
│   │           ├── chunks/     # Text chunk operations
│   │           ├── datasets/   # Dataset generation and management
│   │           ├── questions/  # Question management
│   │           └── split/      # Text splitting operations
│   └── projects/          # Frontend project pages
│       └── [projectId]/
│           ├── datasets/   # Dataset management interface
│           ├── questions/  # Question management interface
│           ├── settings/   # Project settings interface
│           └── text-split/ # Text processing interface
├── components/            # React components
│   ├── datasets/         # Dataset-related components
│   ├── home/            # Home page component
│   ├── projects/        # Project management components
│   ├── questions/       # Question management components
│   └── text-split/      # Text processing components
├── lib/                  # Core libraries and tools
│   ├── db/              # Database operations
│   ├── i18n/            # Internationalization
│   ├── llm/             # LLM integration
│   │   ├── common/      # LLM common tools
│   │   ├── core/        # Core LLM client
│   │   └── prompts/     # Prompt templates
│   └── text-splitter/   # Text splitting tool
├── locales/             # Internationalization resources
│   ├── en/             # English translation
│   └── zh-CN/          # Chinese translation
└── local-db/           # Local file database
    └── projects/       # Project data storage

Installation and Deployment

System Requirements

  • Node.js 18.x or higher
  • pnpm (recommended) or npm

Local Development

  1. Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Install dependencies:
npm install
  1. Start the development server:
npm run build
npm run start

Docker Deployment

  1. Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Build the Docker image:
docker build -t easy-dataset .
  1. Run the container:
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset

Note: Replace {YOUR_LOCAL_DB_PATH} with the actual path where you want to store the local database.

  1. Access the application: Open your browser and navigate to http://localhost:1717

Desktop Application Download

Windows MacOS Linux
Setup.exe Intel / M AppImage

Usage Flow

1. Create a Project

  • Click the "Create Project" button on the homepage
  • Enter the project name and description
  • Configure your preferred LLM API settings

2. Upload and Split Text

  • Upload your Markdown file in the "Text Split" section
  • Review the automatically split text segments
  • Adjust the splitting results as needed

3. Generate Questions

  • Navigate to the "Questions" section
  • Select the text segments for which you want to generate questions
  • Review and edit the generated questions
  • Organize questions using the tag tree

4. Generate a Dataset

  • Go to the "Datasets" section
  • Select the questions to include in the dataset
  • Generate answers using the configured LLM
  • Review and edit the generated answers

5. Export the Dataset

  • Click the "Export" button in the dataset section
  • Select your preferred format (Alpaca or ShareGPT)
  • Select the file format (JSON or JSONL)
  • Add custom system prompts if needed
  • Export your dataset

Featured Functions

Intelligent Prompt System

The project has built-in professional prompt templates for different languages:

  • Chinese question generation prompt
  • English question generation prompt
  • Chinese answer generation prompt
  • English answer generation prompt

Multi-LLM Support

  • Supports OpenAI API
  • Supports Ollama local deployment
  • Compatible with all OpenAI-formatted APIs

Flexible Data Formats

  • Alpaca format: Suitable for instruction fine-tuning
  • ShareGPT format: Suitable for dialogue training
  • JSON/JSONL output format selection

Star History Chart