An autonomous AI research framework by Andrej Karpathy that lets an AI agent (Claude/Codex) iteratively modify, train, and evaluate a small LLM on a single GPU overnight — running ~100 experiments while you sleep.
autoresearch — Autonomous AI Agent LLM Research Framework
Overview
autoresearch is an experimental framework by Andrej Karpathy that automates the process of AI/ML research iteration. The core idea is elegantly simple: give an AI agent (such as Claude or Codex) a real, working LLM training codebase, and let it autonomously propose changes, run 5-minute training experiments, evaluate results, and iterate — all without human intervention.
Think of it as a minimal, self-contained AI research lab that runs overnight on a single GPU, producing a log of ~100 experiments and (hopefully) a progressively better language model by morning.
Background & Motivation
Karpathy opens the README with a sardonic vision of the future — a world where AI research is no longer done by humans but by "autonomous swarms of AI agents running across compute cluster megastructures." The repo, he writes, is "the story of how it all began."
The practical motivation is more grounded: research iteration is slow because humans need to eat, sleep, and attend meetings. This project replaces the human in the loop with an AI agent that can run experiments 24/7, making ML research dramatically faster and more automated.
How It Works
The workflow is a tight edit → train → evaluate → keep/discard loop:
- The agent reads
program.md— a Markdown file containing research instructions and context, written and maintained by the human researcher. - The agent modifies
train.py— the single Python file containing the GPT model architecture, optimizer (Muon + AdamW), and training loop. The agent can change anything: architecture, hyperparameters, optimizer settings, batch size, etc. - Training runs for exactly 5 minutes (wall-clock time, excluding startup/compilation).
- The metric
val_bpb(validation bits per byte) is computed — lower is better. It is vocab-size-independent, so architectural changes can be fairly compared. - The agent decides whether to keep the change or discard it, then repeats.
This yields approximately 12 experiments/hour and ~100 experiments overnight.
Repository Structure
autoresearch/
├── prepare.py # Fixed: one-time data prep, BPE tokenizer training, dataloader, eval utils
├── train.py # Editable by agent: GPT model, optimizer, training loop
├── program.md # Editable by human: agent instructions and research context
├── analysis.ipynb # Notebook for analyzing experiment results
├── pyproject.toml # Dependencies (managed via uv)
└── progress.png # Teaser image showing training progress
Key Files Explained
| File | Owner | Purpose |
|---|---|---|
prepare.py |
Human (fixed) | Downloads data shards, trains BPE tokenizer, provides dataloader & evaluation utilities |
train.py |
AI Agent | Full GPT implementation + training loop — the agent's sandbox |
program.md |
Human (iterable) | Research "skill" / instructions for the agent |
Design Philosophy
1. Single File to Modify
The agent only touches train.py. This keeps the scope manageable and diffs easy to review. It also constrains the agent's action space to meaningful ML changes rather than infrastructure/tooling changes.
2. Fixed Time Budget
Every experiment runs for exactly 5 minutes of wall-clock training time (startup excluded). This ensures:
- All experiments are directly comparable regardless of architectural changes (model size, batch size, etc.)
- The agent finds the best model configuration for your specific hardware
- Predictable throughput: ~12 experiments/hour
The trade-off: results are not portable across different compute platforms (an H100 run cannot be compared to an A100 run).
3. Self-Contained
No distributed training, no complex config files, no external research infrastructure. Just PyTorch + a handful of small packages, one GPU, one file, one metric. This makes the project easy to understand, fork, and build upon.
4. program.md as the Human Interface
Rather than coding a research agent from scratch, Karpathy uses program.md as a lightweight "skill" — a Markdown file that gives the AI agent context, goals, and constraints. The human iterates on program.md over time to improve the "research org code."
Technical Details
Model & Training
- Based on a simplified single-GPU version of nanochat
- Uses Muon + AdamW optimizer by default (though the agent can change this)
- Trains a GPT-style model from scratch on downloaded text data shards
- BPE tokenizer trained on the data itself via
prepare.py - Evaluation metric: val_bpb (validation bits per byte)
Requirements
- GPU: Single NVIDIA GPU (tested on H100)
- Python: 3.10+
- Package manager: uv
Quick Start
# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Install dependencies
uv sync
# 3. One-time data prep (~2 min)
uv run prepare.py
# 4. Run a single training experiment (~5 min)
uv run train.py
Running the Agent
Start Claude, Codex, or any capable coding agent in the repo directory (with file write permissions), then prompt:
Have a look at program.md and let's kick off a new experiment! Let's do the setup first.
The agent will read program.md, propose a change to train.py, run the training, evaluate the result, and iterate.
Significance & Impact
autoresearch is a proof-of-concept for automated machine learning research at the micro scale. While large-scale AutoML systems exist, this project is notable for:
- Simplicity: The entire meaningful codebase is ~3 files
- Transparency: Every agent decision is logged and reviewable
- Accessibility: Runs on a single consumer/research GPU
- Vision: It demonstrates the feasibility of AI agents conducting genuine ML research autonomously
It also serves as a template for "programming your AI research org via Markdown" — a paradigm that may become standard as AI coding agents grow more capable.
Notable Forks
- miolini/autoresearch-macos — macOS/MPS support
Summary
| Property | Value |
|---|---|
| Type | Autonomous AI Research Agent Framework |
| Primary Use | Overnight automated LLM training experimentation |
| Agent Interface | program.md (Markdown instructions) |
| Agent Action Space | train.py (GPT model + training loop) |
| Experiment Duration | 5 minutes (fixed) |
| Throughput | ~12 experiments/hour, ~100 overnight |
| Metric | val_bpb (validation bits per byte) |
| Hardware | Single NVIDIA GPU (H100 tested) |
| License | MIT |