Michael-A-Kuykendall/shimmyView GitHub Homepage for Latest Official Releases

Lightweight local AI inference server, only 5MB single binary, provides OpenAI API compatible interface, supports GGUF models and LoRA adapters.

MITRustshimmyMichael-A-Kuykendall 2.8k Last Updated: October 04, 2025

Shimmy - Lightweight Local AI Inference Server

Project Overview

Shimmy is a 5.1MB single-binary local inference server that provides OpenAI API-compatible endpoints for GGUF models. It is designed as "invisible infrastructure" to make local AI development frictionless.

Core Features

🚀 Extremely Lightweight

Binary Size: Only 5.1MB (compared to Ollama's 680MB)
Startup Time: <100ms (compared to Ollama's 5-10 seconds)
Memory Overhead: <50MB (compared to Ollama's 200MB+)

🔧 Zero-Configuration Operation

Automatic Port Allocation: Avoids port conflicts
Model Auto-Discovery: Supports multiple model sources
- Hugging Face Cache: ~/.cache/huggingface/hub/
- Ollama Models: ~/.ollama/models/
- Local Directory: ./models/
- Environment Variable: SHIMMY_BASE_GGUF=path/to/model.gguf

🎯 Perfect Compatibility

100% OpenAI API Compatible: Directly replaces existing tools
Out-of-the-Box: No modifications needed for tools like VSCode, Cursor, Continue.dev
Cross-Platform Support: Linux, macOS, Windows

Technical Architecture

Core Technology Stack

Language: Rust + Tokio (memory safety, asynchronous performance)
Inference Engine: llama.cpp backend (industry-standard GGUF inference)
API Design: OpenAI compatible (plug-and-play replacement)

Supported Model Formats

GGUF Models: Primary supported format
SafeTensors: Native support, 2x faster loading speed
LoRA Adapters: First-class support, from training to production API in just 30 seconds

Installation and Usage

Quick Installation

Method 1: Install via Cargo

cargo install shimmy

Method 2: Download Pre-built Binary (Windows)

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy.exe

Method 3: macOS Installation

# Install dependencies
brew install cmake rust
# Install shimmy
cargo install shimmy

Basic Usage

1. Download Models

# Download compatible models
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

2. Start the Server

# Automatic port allocation
shimmy serve

# Manually specify port
shimmy serve --bind 127.0.0.1:11435

3. Configure AI Tools

VSCode Configuration:

{
  "github.copilot.advanced": {
    "serverUrl": "http://localhost:11435"
  }
}

Continue.dev Configuration:

{
  "models": [{
    "title": "Local Shimmy",
    "provider": "openai",
    "model": "your-model-name",
    "apiBase": "http://localhost:11435/v1"
  }]
}

Command Line Tools

Basic Commands

shimmy serve                        # Start server (automatic port allocation)
shimmy serve --bind 127.0.0.1:8080 # Manual port binding
shimmy list                         # Show available models
shimmy discover                     # Refresh model discovery
shimmy generate --name X --prompt "Hi" # Test generation
shimmy probe model-name             # Validate model loading

API Endpoints

Core Endpoints

GET /health - Health check
POST /v1/chat/completions - OpenAI-compatible chat
GET /v1/models - List available models
POST /api/generate - Shimmy native API
GET /ws/generate - WebSocket streaming

Usage Example

# Test API
curl -X POST http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Performance Comparison

Metric	Shimmy	Ollama	llama.cpp
Binary Size	5.1MB 🏆	680MB	89MB
Startup Time	<100ms 🏆	5-10s	1-2s
Memory Usage	50MB 🏆	200MB+	100MB
OpenAI API	100% 🏆	Partial Support	None

Key Advantages

🔒 Privacy First

Code remains on your local machine
No data exfiltration risk
Fully offline operation

💰 Cost-Effective

No per-token billing
Unlimited queries
Install once, use forever

⚡ Excellent Performance

Local inference, sub-second response times
Low memory footprint
Fast startup

🔄 Flexible Deployment

Single binary file
No external dependencies
Cross-platform compatible

Extended Features

LoRA Adapter Support

Shimmy offers first-class LoRA adapter support, enabling rapid deployment from trained models to production APIs:

# Load LoRA adapter
shimmy serve --lora-adapter path/to/adapter

Hot Model Switching

Supports dynamic model switching at runtime without restarting the server.

GPU Acceleration

macOS: Automatic Metal GPU acceleration
Cross-platform: Supports various GPU backends

Community and Support

Community Resources

Bug Reports: GitHub Issues
Discussions: GitHub Discussions
Documentation: docs/
Sponsorship: GitHub Sponsors

Summary

Shimmy is a revolutionary local AI inference solution that proves "less is often more." Through its extremely lightweight design and zero-configuration philosophy, Shimmy provides developers with a truly "ready-to-use" local AI infrastructure, while maintaining enterprise-grade performance and compatibility. Whether you are an AI application developer, researcher, or a privacy-conscious user, Shimmy is an excellent choice worth considering.