Stage 3: Data and Feature Engineering
An open-source natural language data management and annotation platform focused on data-centric NLP model building.
Refinery - An Open-Source Natural Language Data Management Tool for Data Scientists
Project Overview
Refinery is an open-source data labeling and training data management platform developed by Kern AI, specifically designed for Natural Language Processing (NLP) tasks. The project aims to help data scientists "build better NLP models with a data-centric approach" and "treat training data as a software artifact."
Core Features
1. Data Labeling Capabilities
- Manual and Programmatic Labeling: Supports classification and span labeling tasks
- Semi-Automated Labeling: Automates part of the labeling work through heuristics
- Multi-Task Support: A single project can handle multiple labeling tasks
2. Data Management Capabilities
- Smart Data Browser: Allows filtering, sorting, and searching data by dimensions such as confidence, heuristic overlap, user, and annotations
- Data Quality Monitoring: Identifies low-quality subsets within training data
- Project Metrics Overview: Provides confidence distribution, label distribution, and confusion matrix
3. Machine Learning Integration
- 🤗 Hugging Face Integration: Automatically creates document-level and token-level embeddings
- spaCy Integration: Leverages pre-trained language models
- Neural Search: Similarity record retrieval and anomaly detection based on Qdrant
4. Heuristics and Weak Supervision
- Labeling Functions: Create and manage rule-based automatic labeling logic
- Weak Supervision: Integrates multiple noisy and imperfect heuristics
- Knowledge Base Management: Create and manage lookup lists to support the labeling process
5. Collaboration Features
- Team Workspaces: Multi-user environment (commercial version)
- Role-Based Access Control: Manages user permissions
- Crowdsourcing Integration: Supports external labeling workflows
Technical Architecture
Core Services
- embedder: Embedding generation service
- weak-supervisor: Weak supervision service
- tokenizer: Tokenization service
- neural-search: Neural search service
- ui: User interface
- gateway: API gateway
Third-Party Integrations
- PostgreSQL: Data storage
- Minio: Object storage
- MailHog: Mail service
- Ory Kratos: Identity management
- Ory Oathkeeper: Access control
Machine Learning Libraries
- scikit-learn: Traditional machine learning
- spaCy: Natural language processing
- Hugging Face Transformers: Pre-trained models
- Qdrant: Vector database
Installation and Usage
Quick Installation
# Install with pip
pip install kern-refinery
# Start the service
cd your-project-directory
refinery start
# Access the application
# Open http://localhost:4455 in your browser
Manual Installation
# Clone the repository
git clone https://github.com/code-kern-ai/refinery.git
cd refinery
# Start the service (Mac/Linux)
./start
# Start the service (Windows)
start.bat
# Stop the service
./stop # Or stop.bat (Windows)
Data Format Support
Input Formats
- JSON files
- CSV files
- Spreadsheets
- Text files
- Generic JSON format
Output Formats
[
{
"running_id": "0",
"headline": "T. Rowe Price (TROW) Dips More Than Broader Markets",
"date": "Jun-30-22 06:00PM ",
"headline__sentiment__MANUAL": null,
"headline__sentiment__WEAK_SUPERVISION": "NEGATIVE",
"headline__sentiment__WEAK_SUPERVISION__confidence": 0.62,
"headline__entities__MANUAL": null,
"headline__entities__WEAK_SUPERVISION": [
"STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "O", "O", "O", "O", "O"
],
"headline__entities__WEAK_SUPERVISION__confidence": [
0.98, 0.98, 0.98, 0.98, 0.98, 0.98, 0.00, 0.00, 0.00, 0.00, 0.00
]
}
]
Python SDK
The project provides a comprehensive Python SDK, supporting:
- Data upload and download
- Project management
- Exporting labeled data
- Adapters for frameworks like Rasa
# Pull data
rsdk pull
# Push data
rsdk push <file_name>
Open-Source Bricks Library
Refinery integrates the open-source bricks library, offering:
- Ready-to-use automated labeling functions
- Text metadata extraction (language detection, sentence complexity, etc.)
- Pre-built labeling function templates
Use Cases
Ideal User Groups
- Individual NLP Project Developers: Researchers lacking sufficient labeled data
- Team Collaboration Projects: Teams needing to manage and evaluate training data quality
- Resource-Constrained Projects: Projects requiring optimization of labeling resources (manpower, budget, time)
Primary Use Cases
- Sentiment Analysis
- Named Entity Recognition
- Text Classification
- Information Extraction
- Multilingual Text Processing
Business Model
- Open-Source Version: Single-user version, completely free
- Commercial Version: Multi-user environment, offering team collaboration features
- Enterprise Solution: On-premise deployment and customized services
Community and Support
- Discord Community: Technical discussions and support
- GitHub Issues: Bug reports and feature requests
- Documentation Center: Detailed user guides and tutorials
- YouTube Channel: Video tutorials and demonstrations
Project Advantages
- Data-Centric Approach: Focuses on improving training data quality rather than merely increasing data volume
- Semi-Automated Labeling: Significantly reduces manual labeling effort
- Scalable Architecture: Microservices architecture supports flexible deployment
- Open-Source Transparency: Fully open-source, community-driven development
- Enterprise-Grade Features: Supports large-scale deployment and team collaboration
Learning Resources
Refinery represents best practices in modern NLP data management, providing data scientists with a powerful and flexible tool to build high-quality training datasets.