Multi-LLM collaboration tool that queries multiple AI models, enables peer review, and synthesizes responses through a chairman model

Pythonllm-councilkarpathy 11.2k Last Updated: November 22, 2025

LLM Council - Multi-Model AI Collaboration Platform

Project Overview

LLM Council is an innovative open-source project created by Andrej Karpathy that transforms single-model AI interactions into collaborative, multi-model consensus systems. Instead of relying on a single LLM provider, this tool orchestrates multiple frontier AI models to work together, review each other's outputs, and produce synthesized responses through a democratic process.

Core Concept

The fundamental idea behind LLM Council is to leverage the strengths of different AI models while minimizing individual model biases. By creating an "AI advisory board," users receive more comprehensive, peer-reviewed answers to complex questions rather than depending on a single model's perspective.

Architecture & Workflow

Three-Stage Process

Stage 1: First Opinions

  • User query is dispatched simultaneously to all council member models via OpenRouter API
  • Each LLM generates its independent response without seeing others' outputs
  • Individual responses are displayed in a tab view for side-by-side comparison
  • Default council includes: GPT-5.1, Gemini 3.0 Pro, Claude Sonnet 4.5, and Grok 4

Stage 2: Anonymous Peer Review

  • Each model receives anonymized responses from all other council members
  • Models evaluate and rank each response based on accuracy and insight
  • Identity anonymization prevents bias and favoritism in evaluations
  • Cross-model evaluation reveals surprising patterns (models often rank competitors higher)

Stage 3: Chairman Synthesis

  • A designated Chairman LLM (configurable) reviews all original responses
  • Considers peer review rankings and evaluations
  • Produces a final synthesized answer incorporating the best elements
  • Delivers a comprehensive response to the user

Technical Stack

Backend

  • Framework: FastAPI (Python 3.10+)
  • HTTP Client: async httpx for non-blocking API calls
  • API Integration: OpenRouter API for multi-model access
  • Storage: JSON-based conversation persistence in data/conversations/
  • Package Management: uv for modern Python dependency management

Frontend

  • Framework: React with Vite for fast development and builds
  • Rendering: react-markdown for formatted output
  • UI: ChatGPT-like interface with tab views for model comparison
  • Dev Server: Vite dev server on port 5173

Key Features

Multi-Model Dispatching

  • Simultaneous query execution across multiple frontier models
  • Configurable council membership through backend/config.py
  • Support for models from OpenAI, Google, Anthropic, xAI, and more

Objective Peer Review

  • Anonymized response evaluation prevents model bias
  • Quantitative ranking system for accuracy and insight
  • Reveals interesting patterns in model preferences and strengths

Synthesized Consensus

  • Chairman model aggregates diverse perspectives
  • Produces coherent final answers incorporating multiple viewpoints
  • Balances verbosity, insight, and conciseness

Transparent Comparison

  • Side-by-side view of all individual responses
  • Complete visibility into peer review rankings
  • Users can form their own judgments alongside AI consensus

Conversation Persistence

  • Automatic saving of conversation history
  • JSON-based storage for easy data portability
  • Ability to review and analyze past council sessions

Installation & Setup

Prerequisites

  • Python 3.10 or higher
  • Node.js and npm
  • OpenRouter API key (requires purchased credits)

Backend Setup

# Install dependencies using uv
uv sync

Frontend Setup

# Navigate to frontend directory
cd frontend

# Install npm dependencies
npm install

cd ..

Configuration

  1. Create .env file in project root:
OPENROUTER_API_KEY=sk-or-v1-your-key-here
  1. Configure Council in backend/config.py:
COUNCIL_MODELS = [
    "openai/gpt-5.1",
    "google/gemini-3-pro-preview",
    "anthropic/claude-sonnet-4.5",
    "x-ai/grok-4",
]
CHAIRMAN_MODEL = "google/gemini-3-pro-preview"

Running the Application

Option 1: Quick Start Script

./start.sh

Option 2: Manual Start

# Terminal 1 - Backend
uv run python -m backend.main

# Terminal 2 - Frontend
cd frontend
npm run dev

Access the application at: http://localhost:5173

Use Cases

Reading & Literature Analysis

  • Karpathy's original use case: reading books with multiple AI perspectives
  • Different models emphasize different literary aspects
  • Comparative analysis of interpretation styles

Research & Analysis

  • Complex questions requiring multiple viewpoints
  • Technical documentation evaluation
  • Business strategy assessment

Content Evaluation

  • Legal document analysis
  • Scientific paper interpretation
  • Code review and technical writing

Model Comparison

  • Benchmarking different LLM capabilities
  • Understanding model strengths and weaknesses
  • Identifying bias patterns across providers

Interesting Findings

Model Self-Assessment

  • Models frequently select competitors' responses as superior to their own
  • Demonstrates surprising objectivity in peer review process
  • Reveals genuine differences in approach and quality

Ranking Patterns

In Karpathy's testing with book chapters:

  • Consensus Winner: GPT-5.1 consistently rated as most insightful
  • Consensus Loser: Claude consistently ranked lowest
  • Middle Tier: Gemini 3 Pro and Grok-4 between extremes

Human vs. AI Judgment Divergence

  • AI consensus may not align with human preferences
  • GPT-5.1 praised for insight but criticized by Karpathy as "too wordy"
  • Claude ranked lowest by peers but preferred by creator for terseness
  • Gemini appreciated for condensed, processed outputs
  • Suggests models may favor verbosity over conciseness

Project Philosophy

"Vibe Coded" Approach

  • Described as "99% vibe coded" Saturday hack project
  • Rapid development with AI assistance
  • No long-term support commitment from creator
  • "Code is ephemeral now and libraries are over" philosophy

Open Source & Inspiration

  • Provided as-is for community inspiration
  • Users encouraged to modify via their own LLMs
  • Represents reference architecture for AI orchestration
  • Demonstrates ensemble learning applied to language models

Enterprise Implications

Orchestration Middleware

  • Reveals the architecture of multi-model coordination
  • Addresses vendor lock-in concerns
  • Demonstrates feasibility of model-agnostic applications

Quality Control Layer

  • Peer review adds validation absent in single-model systems
  • Reduces individual model biases
  • Provides transparency in AI decision-making

Reference Implementation

  • Shows minimum viable architecture for ensemble AI
  • Guides build vs. buy decisions for enterprise platforms
  • Demystifies multi-model orchestration complexity

Limitations & Considerations

Cost

  • Requires OpenRouter API credits for all council members plus chairman
  • Multiple model calls per query increase operational costs
  • No free tier operation available

Speed

  • Three-stage process slower than single-model queries
  • Multiple API calls add latency
  • Trade-off between speed and quality/consensus

Model Availability

  • Dependent on OpenRouter model catalog
  • Requires active API keys and credits
  • Subject to model provider rate limits

Maintenance

  • Creator explicitly states no ongoing support
  • Community-driven improvements only
  • Users responsible for adaptations and updates

Technical Considerations

Anonymization Strategy

  • Random IDs (A, B, C, D) assigned to responses
  • Prevents identity-based bias in peer review
  • Maintains objectivity in evaluation process

API Integration

  • Single point of integration via OpenRouter
  • Abstracts away individual provider APIs
  • Simplifies multi-model coordination

Data Privacy

  • Local web application runs on user's machine
  • Conversations stored locally as JSON
  • API calls go through OpenRouter (third-party)

Community & Ecosystem

Related Projects

  • Swarms Framework: Implements LLMCouncil class inspired by this project
  • Hugging Face Spaces: Community deployments available
  • Medium/VentureBeat Coverage: Enterprise analysis and implications

Similar Approaches

  • Ensemble learning in machine learning
  • Mixture of Experts architectures
  • Multi-agent AI systems
  • Consensus protocols in distributed systems

Future Directions

While Karpathy explicitly states no planned improvements, potential community extensions could include:

  • Extended Model Support: Adding more council members from emerging providers
  • Custom Ranking Criteria: User-defined evaluation dimensions
  • Streaming Responses: Real-time display of model outputs
  • Advanced Synthesis: More sophisticated chairman algorithms
  • Cost Optimization: Intelligent model selection based on query type
  • Performance Analytics: Tracking model accuracy and preference patterns
  • Integration APIs: Embedding council functionality in other applications

Getting Started

  1. Clone the repository: git clone https://github.com/karpathy/llm-council
  2. Follow installation instructions above
  3. Configure your preferred council models
  4. Start querying and compare perspectives
  5. Experiment with different model combinations
  6. Analyze peer review patterns

Conclusion

LLM Council represents a pragmatic approach to addressing single-model limitations through ensemble orchestration. While presented as a casual weekend project, it offers valuable insights into multi-model architecture, peer review mechanisms, and the future of AI orchestration middleware. For developers, researchers, and enterprises exploring beyond single-provider solutions, this project provides both inspiration and a concrete reference implementation for building more robust, consensus-driven AI systems.

The project's minimalist approach—a few hundred lines of code achieving sophisticated multi-model coordination—demonstrates that the technical barriers to ensemble AI are lower than many assume. The real challenges lie not in routing prompts, but in governance, cost management, and determining when consensus truly improves outcomes over individual model responses.

Star History Chart