Stage 3: Data and Feature Engineering

An open-source natural language data management and annotation platform focused on data-centric NLP model building.

DataLabelingNLPWeakSupervisionGitHubTextFreeEnglish

Refinery - An Open-Source Natural Language Data Management Tool for Data Scientists

Project Overview

Refinery is an open-source data labeling and training data management platform developed by Kern AI, specifically designed for Natural Language Processing (NLP) tasks. The project aims to help data scientists "build better NLP models with a data-centric approach" and "treat training data as a software artifact."

Core Features

1. Data Labeling Capabilities

Manual and Programmatic Labeling: Supports classification and span labeling tasks
Semi-Automated Labeling: Automates part of the labeling work through heuristics
Multi-Task Support: A single project can handle multiple labeling tasks

2. Data Management Capabilities

Smart Data Browser: Allows filtering, sorting, and searching data by dimensions such as confidence, heuristic overlap, user, and annotations
Data Quality Monitoring: Identifies low-quality subsets within training data
Project Metrics Overview: Provides confidence distribution, label distribution, and confusion matrix

3. Machine Learning Integration

🤗 Hugging Face Integration: Automatically creates document-level and token-level embeddings
spaCy Integration: Leverages pre-trained language models
Neural Search: Similarity record retrieval and anomaly detection based on Qdrant

4. Heuristics and Weak Supervision

Labeling Functions: Create and manage rule-based automatic labeling logic
Weak Supervision: Integrates multiple noisy and imperfect heuristics
Knowledge Base Management: Create and manage lookup lists to support the labeling process

5. Collaboration Features

Team Workspaces: Multi-user environment (commercial version)
Role-Based Access Control: Manages user permissions
Crowdsourcing Integration: Supports external labeling workflows

Technical Architecture

Core Services

- embedder: Embedding generation service
- weak-supervisor: Weak supervision service
- tokenizer: Tokenization service
- neural-search: Neural search service
- ui: User interface
- gateway: API gateway

Third-Party Integrations

- PostgreSQL: Data storage
- Minio: Object storage
- MailHog: Mail service
- Ory Kratos: Identity management
- Ory Oathkeeper: Access control

Machine Learning Libraries

- scikit-learn: Traditional machine learning
- spaCy: Natural language processing
- Hugging Face Transformers: Pre-trained models
- Qdrant: Vector database

Installation and Usage

Quick Installation

# Install with pip
pip install kern-refinery

# Start the service
cd your-project-directory
refinery start

# Access the application
# Open http://localhost:4455 in your browser

Manual Installation

# Clone the repository
git clone https://github.com/code-kern-ai/refinery.git
cd refinery

# Start the service (Mac/Linux)
./start

# Start the service (Windows)
start.bat

# Stop the service
./stop  # Or stop.bat (Windows)

Data Format Support

Input Formats

JSON files
CSV files
Spreadsheets
Text files
Generic JSON format

Output Formats

[
  {
    "running_id": "0",
    "headline": "T. Rowe Price (TROW) Dips More Than Broader Markets",
    "date": "Jun-30-22 06:00PM  ",
    "headline__sentiment__MANUAL": null,
    "headline__sentiment__WEAK_SUPERVISION": "NEGATIVE",
    "headline__sentiment__WEAK_SUPERVISION__confidence": 0.62,
    "headline__entities__MANUAL": null,
    "headline__entities__WEAK_SUPERVISION": [
      "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "O", "O", "O", "O", "O"
    ],
    "headline__entities__WEAK_SUPERVISION__confidence": [
      0.98, 0.98, 0.98, 0.98, 0.98, 0.98, 0.00, 0.00, 0.00, 0.00, 0.00
    ]
  }
]

Python SDK

The project provides a comprehensive Python SDK, supporting:

Data upload and download
Project management
Exporting labeled data
Adapters for frameworks like Rasa

# Pull data
rsdk pull

# Push data
rsdk push <file_name>

Open-Source Bricks Library

Refinery integrates the open-source bricks library, offering:

Ready-to-use automated labeling functions
Text metadata extraction (language detection, sentence complexity, etc.)
Pre-built labeling function templates

Use Cases

Ideal User Groups

Individual NLP Project Developers: Researchers lacking sufficient labeled data
Team Collaboration Projects: Teams needing to manage and evaluate training data quality
Resource-Constrained Projects: Projects requiring optimization of labeling resources (manpower, budget, time)

Primary Use Cases

Sentiment Analysis
Named Entity Recognition
Text Classification
Information Extraction
Multilingual Text Processing

Business Model

Open-Source Version: Single-user version, completely free
Commercial Version: Multi-user environment, offering team collaboration features
Enterprise Solution: On-premise deployment and customized services

Community and Support

Discord Community: Technical discussions and support
GitHub Issues: Bug reports and feature requests
Documentation Center: Detailed user guides and tutorials
YouTube Channel: Video tutorials and demonstrations

Project Advantages

Data-Centric Approach: Focuses on improving training data quality rather than merely increasing data volume
Semi-Automated Labeling: Significantly reduces manual labeling effort
Scalable Architecture: Microservices architecture supports flexible deployment
Open-Source Transparency: Fully open-source, community-driven development
Enterprise-Grade Features: Supports large-scale deployment and team collaboration

Learning Resources

Refinery represents best practices in modern NLP data management, providing data scientists with a powerful and flexible tool to build high-quality training datasets.