Lightweight local AI inference server, only 5MB single binary, provides OpenAI API compatible interface, supports GGUF models and LoRA adapters.
Shimmy - Lightweight Local AI Inference Server
Project Overview
Shimmy is a 5.1MB single-binary local inference server that provides OpenAI API-compatible endpoints for GGUF models. It is designed as "invisible infrastructure" to make local AI development frictionless.
Core Features
🚀 Extremely Lightweight
- Binary Size: Only 5.1MB (compared to Ollama's 680MB)
- Startup Time: <100ms (compared to Ollama's 5-10 seconds)
- Memory Overhead: <50MB (compared to Ollama's 200MB+)
🔧 Zero-Configuration Operation
- Automatic Port Allocation: Avoids port conflicts
- Model Auto-Discovery: Supports multiple model sources
- Hugging Face Cache:
~/.cache/huggingface/hub/
- Ollama Models:
~/.ollama/models/
- Local Directory:
./models/
- Environment Variable:
SHIMMY_BASE_GGUF=path/to/model.gguf
- Hugging Face Cache:
🎯 Perfect Compatibility
- 100% OpenAI API Compatible: Directly replaces existing tools
- Out-of-the-Box: No modifications needed for tools like VSCode, Cursor, Continue.dev
- Cross-Platform Support: Linux, macOS, Windows
Technical Architecture
Core Technology Stack
- Language: Rust + Tokio (memory safety, asynchronous performance)
- Inference Engine: llama.cpp backend (industry-standard GGUF inference)
- API Design: OpenAI compatible (plug-and-play replacement)
Supported Model Formats
- GGUF Models: Primary supported format
- SafeTensors: Native support, 2x faster loading speed
- LoRA Adapters: First-class support, from training to production API in just 30 seconds
Installation and Usage
Quick Installation
Method 1: Install via Cargo
cargo install shimmy
Method 2: Download Pre-built Binary (Windows)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy.exe
Method 3: macOS Installation
# Install dependencies
brew install cmake rust
# Install shimmy
cargo install shimmy
Basic Usage
1. Download Models
# Download compatible models
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/
2. Start the Server
# Automatic port allocation
shimmy serve
# Manually specify port
shimmy serve --bind 127.0.0.1:11435
3. Configure AI Tools
VSCode Configuration:
{
"github.copilot.advanced": {
"serverUrl": "http://localhost:11435"
}
}
Continue.dev Configuration:
{
"models": [{
"title": "Local Shimmy",
"provider": "openai",
"model": "your-model-name",
"apiBase": "http://localhost:11435/v1"
}]
}
Command Line Tools
Basic Commands
shimmy serve # Start server (automatic port allocation)
shimmy serve --bind 127.0.0.1:8080 # Manual port binding
shimmy list # Show available models
shimmy discover # Refresh model discovery
shimmy generate --name X --prompt "Hi" # Test generation
shimmy probe model-name # Validate model loading
API Endpoints
Core Endpoints
GET /health
- Health checkPOST /v1/chat/completions
- OpenAI-compatible chatGET /v1/models
- List available modelsPOST /api/generate
- Shimmy native APIGET /ws/generate
- WebSocket streaming
Usage Example
# Test API
curl -X POST http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Performance Comparison
Metric | Shimmy | Ollama | llama.cpp |
---|---|---|---|
Binary Size | 5.1MB 🏆 | 680MB | 89MB |
Startup Time | <100ms 🏆 | 5-10s | 1-2s |
Memory Usage | 50MB 🏆 | 200MB+ | 100MB |
OpenAI API | 100% 🏆 | Partial Support | None |
Key Advantages
🔒 Privacy First
- Code remains on your local machine
- No data exfiltration risk
- Fully offline operation
💰 Cost-Effective
- No per-token billing
- Unlimited queries
- Install once, use forever
⚡ Excellent Performance
- Local inference, sub-second response times
- Low memory footprint
- Fast startup
🔄 Flexible Deployment
- Single binary file
- No external dependencies
- Cross-platform compatible
Extended Features
LoRA Adapter Support
Shimmy offers first-class LoRA adapter support, enabling rapid deployment from trained models to production APIs:
# Load LoRA adapter
shimmy serve --lora-adapter path/to/adapter
Hot Model Switching
Supports dynamic model switching at runtime without restarting the server.
GPU Acceleration
- macOS: Automatic Metal GPU acceleration
- Cross-platform: Supports various GPU backends
Community and Support
Community Resources
- Bug Reports: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: docs/
- Sponsorship: GitHub Sponsors
Summary
Shimmy is a revolutionary local AI inference solution that proves "less is often more." Through its extremely lightweight design and zero-configuration philosophy, Shimmy provides developers with a truly "ready-to-use" local AI infrastructure, while maintaining enterprise-grade performance and compatibility. Whether you are an AI application developer, researcher, or a privacy-conscious user, Shimmy is an excellent choice worth considering.