Lightweight local AI inference server, only 5MB single binary, provides OpenAI API compatible interface, supports GGUF models and LoRA adapters.

MITRustshimmyMichael-A-Kuykendall 2.8k Last Updated: October 04, 2025

Shimmy - Lightweight Local AI Inference Server

Project Overview

Shimmy is a 5.1MB single-binary local inference server that provides OpenAI API-compatible endpoints for GGUF models. It is designed as "invisible infrastructure" to make local AI development frictionless.

Core Features

🚀 Extremely Lightweight

  • Binary Size: Only 5.1MB (compared to Ollama's 680MB)
  • Startup Time: <100ms (compared to Ollama's 5-10 seconds)
  • Memory Overhead: <50MB (compared to Ollama's 200MB+)

🔧 Zero-Configuration Operation

  • Automatic Port Allocation: Avoids port conflicts
  • Model Auto-Discovery: Supports multiple model sources
    • Hugging Face Cache: ~/.cache/huggingface/hub/
    • Ollama Models: ~/.ollama/models/
    • Local Directory: ./models/
    • Environment Variable: SHIMMY_BASE_GGUF=path/to/model.gguf

🎯 Perfect Compatibility

  • 100% OpenAI API Compatible: Directly replaces existing tools
  • Out-of-the-Box: No modifications needed for tools like VSCode, Cursor, Continue.dev
  • Cross-Platform Support: Linux, macOS, Windows

Technical Architecture

Core Technology Stack

  • Language: Rust + Tokio (memory safety, asynchronous performance)
  • Inference Engine: llama.cpp backend (industry-standard GGUF inference)
  • API Design: OpenAI compatible (plug-and-play replacement)

Supported Model Formats

  • GGUF Models: Primary supported format
  • SafeTensors: Native support, 2x faster loading speed
  • LoRA Adapters: First-class support, from training to production API in just 30 seconds

Installation and Usage

Quick Installation

Method 1: Install via Cargo

cargo install shimmy

Method 2: Download Pre-built Binary (Windows)

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy.exe

Method 3: macOS Installation

# Install dependencies
brew install cmake rust
# Install shimmy
cargo install shimmy

Basic Usage

1. Download Models

# Download compatible models
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

2. Start the Server

# Automatic port allocation
shimmy serve

# Manually specify port
shimmy serve --bind 127.0.0.1:11435

3. Configure AI Tools

VSCode Configuration:

{
  "github.copilot.advanced": {
    "serverUrl": "http://localhost:11435"
  }
}

Continue.dev Configuration:

{
  "models": [{
    "title": "Local Shimmy",
    "provider": "openai",
    "model": "your-model-name",
    "apiBase": "http://localhost:11435/v1"
  }]
}

Command Line Tools

Basic Commands

shimmy serve                        # Start server (automatic port allocation)
shimmy serve --bind 127.0.0.1:8080 # Manual port binding
shimmy list                         # Show available models
shimmy discover                     # Refresh model discovery
shimmy generate --name X --prompt "Hi" # Test generation
shimmy probe model-name             # Validate model loading

API Endpoints

Core Endpoints

  • GET /health - Health check
  • POST /v1/chat/completions - OpenAI-compatible chat
  • GET /v1/models - List available models
  • POST /api/generate - Shimmy native API
  • GET /ws/generate - WebSocket streaming

Usage Example

# Test API
curl -X POST http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Performance Comparison

Metric Shimmy Ollama llama.cpp
Binary Size 5.1MB 🏆 680MB 89MB
Startup Time <100ms 🏆 5-10s 1-2s
Memory Usage 50MB 🏆 200MB+ 100MB
OpenAI API 100% 🏆 Partial Support None

Key Advantages

🔒 Privacy First

  • Code remains on your local machine
  • No data exfiltration risk
  • Fully offline operation

💰 Cost-Effective

  • No per-token billing
  • Unlimited queries
  • Install once, use forever

⚡ Excellent Performance

  • Local inference, sub-second response times
  • Low memory footprint
  • Fast startup

🔄 Flexible Deployment

  • Single binary file
  • No external dependencies
  • Cross-platform compatible

Extended Features

LoRA Adapter Support

Shimmy offers first-class LoRA adapter support, enabling rapid deployment from trained models to production APIs:

# Load LoRA adapter
shimmy serve --lora-adapter path/to/adapter

Hot Model Switching

Supports dynamic model switching at runtime without restarting the server.

GPU Acceleration

  • macOS: Automatic Metal GPU acceleration
  • Cross-platform: Supports various GPU backends

Community and Support

Community Resources

Summary

Shimmy is a revolutionary local AI inference solution that proves "less is often more." Through its extremely lightweight design and zero-configuration philosophy, Shimmy provides developers with a truly "ready-to-use" local AI infrastructure, while maintaining enterprise-grade performance and compatibility. Whether you are an AI application developer, researcher, or a privacy-conscious user, Shimmy is an excellent choice worth considering.

Star History Chart