Stage 3: Data and Feature Engineering

An open-source natural language data management and annotation platform focused on data-centric NLP model building.

DataLabelingNLPWeakSupervisionGitHubTextFreeEnglish

Refinery - An Open-Source Natural Language Data Management Tool for Data Scientists

Project Overview

Refinery is an open-source data labeling and training data management platform developed by Kern AI, specifically designed for Natural Language Processing (NLP) tasks. The project aims to help data scientists "build better NLP models with a data-centric approach" and "treat training data as a software artifact."

Core Features

1. Data Labeling Capabilities

  • Manual and Programmatic Labeling: Supports classification and span labeling tasks
  • Semi-Automated Labeling: Automates part of the labeling work through heuristics
  • Multi-Task Support: A single project can handle multiple labeling tasks

2. Data Management Capabilities

  • Smart Data Browser: Allows filtering, sorting, and searching data by dimensions such as confidence, heuristic overlap, user, and annotations
  • Data Quality Monitoring: Identifies low-quality subsets within training data
  • Project Metrics Overview: Provides confidence distribution, label distribution, and confusion matrix

3. Machine Learning Integration

  • 🤗 Hugging Face Integration: Automatically creates document-level and token-level embeddings
  • spaCy Integration: Leverages pre-trained language models
  • Neural Search: Similarity record retrieval and anomaly detection based on Qdrant

4. Heuristics and Weak Supervision

  • Labeling Functions: Create and manage rule-based automatic labeling logic
  • Weak Supervision: Integrates multiple noisy and imperfect heuristics
  • Knowledge Base Management: Create and manage lookup lists to support the labeling process

5. Collaboration Features

  • Team Workspaces: Multi-user environment (commercial version)
  • Role-Based Access Control: Manages user permissions
  • Crowdsourcing Integration: Supports external labeling workflows

Technical Architecture

Core Services

- embedder: Embedding generation service
- weak-supervisor: Weak supervision service
- tokenizer: Tokenization service
- neural-search: Neural search service
- ui: User interface
- gateway: API gateway

Third-Party Integrations

- PostgreSQL: Data storage
- Minio: Object storage
- MailHog: Mail service
- Ory Kratos: Identity management
- Ory Oathkeeper: Access control

Machine Learning Libraries

- scikit-learn: Traditional machine learning
- spaCy: Natural language processing
- Hugging Face Transformers: Pre-trained models
- Qdrant: Vector database

Installation and Usage

Quick Installation

# Install with pip
pip install kern-refinery

# Start the service
cd your-project-directory
refinery start

# Access the application
# Open http://localhost:4455 in your browser

Manual Installation

# Clone the repository
git clone https://github.com/code-kern-ai/refinery.git
cd refinery

# Start the service (Mac/Linux)
./start

# Start the service (Windows)
start.bat

# Stop the service
./stop  # Or stop.bat (Windows)

Data Format Support

Input Formats

  • JSON files
  • CSV files
  • Spreadsheets
  • Text files
  • Generic JSON format

Output Formats

[
  {
    "running_id": "0",
    "headline": "T. Rowe Price (TROW) Dips More Than Broader Markets",
    "date": "Jun-30-22 06:00PM  ",
    "headline__sentiment__MANUAL": null,
    "headline__sentiment__WEAK_SUPERVISION": "NEGATIVE",
    "headline__sentiment__WEAK_SUPERVISION__confidence": 0.62,
    "headline__entities__MANUAL": null,
    "headline__entities__WEAK_SUPERVISION": [
      "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "O", "O", "O", "O", "O"
    ],
    "headline__entities__WEAK_SUPERVISION__confidence": [
      0.98, 0.98, 0.98, 0.98, 0.98, 0.98, 0.00, 0.00, 0.00, 0.00, 0.00
    ]
  }
]

Python SDK

The project provides a comprehensive Python SDK, supporting:

  • Data upload and download
  • Project management
  • Exporting labeled data
  • Adapters for frameworks like Rasa
# Pull data
rsdk pull

# Push data
rsdk push <file_name>

Open-Source Bricks Library

Refinery integrates the open-source bricks library, offering:

  • Ready-to-use automated labeling functions
  • Text metadata extraction (language detection, sentence complexity, etc.)
  • Pre-built labeling function templates

Use Cases

Ideal User Groups

  1. Individual NLP Project Developers: Researchers lacking sufficient labeled data
  2. Team Collaboration Projects: Teams needing to manage and evaluate training data quality
  3. Resource-Constrained Projects: Projects requiring optimization of labeling resources (manpower, budget, time)

Primary Use Cases

  • Sentiment Analysis
  • Named Entity Recognition
  • Text Classification
  • Information Extraction
  • Multilingual Text Processing

Business Model

  • Open-Source Version: Single-user version, completely free
  • Commercial Version: Multi-user environment, offering team collaboration features
  • Enterprise Solution: On-premise deployment and customized services

Community and Support

  • Discord Community: Technical discussions and support
  • GitHub Issues: Bug reports and feature requests
  • Documentation Center: Detailed user guides and tutorials
  • YouTube Channel: Video tutorials and demonstrations

Project Advantages

  1. Data-Centric Approach: Focuses on improving training data quality rather than merely increasing data volume
  2. Semi-Automated Labeling: Significantly reduces manual labeling effort
  3. Scalable Architecture: Microservices architecture supports flexible deployment
  4. Open-Source Transparency: Fully open-source, community-driven development
  5. Enterprise-Grade Features: Supports large-scale deployment and team collaboration

Learning Resources

Refinery represents best practices in modern NLP data management, providing data scientists with a powerful and flexible tool to build high-quality training datasets.