karpathy/autoresearch View GitHub Homepage for Latest Official Releases

An autonomous AI research framework by Andrej Karpathy that lets an AI agent (Claude/Codex) iteratively modify, train, and evaluate a small LLM on a single GPU overnight — running ~100 experiments while you sleep.

Pythonautoresearchkarpathy 27.3k Last Updated: March 11, 2026

autoresearch — Autonomous AI Agent LLM Research Framework

Overview

autoresearch is an experimental framework by Andrej Karpathy that automates the process of AI/ML research iteration. The core idea is elegantly simple: give an AI agent (such as Claude or Codex) a real, working LLM training codebase, and let it autonomously propose changes, run 5-minute training experiments, evaluate results, and iterate — all without human intervention.

Think of it as a minimal, self-contained AI research lab that runs overnight on a single GPU, producing a log of ~100 experiments and (hopefully) a progressively better language model by morning.

Background & Motivation

Karpathy opens the README with a sardonic vision of the future — a world where AI research is no longer done by humans but by "autonomous swarms of AI agents running across compute cluster megastructures." The repo, he writes, is "the story of how it all began."

The practical motivation is more grounded: research iteration is slow because humans need to eat, sleep, and attend meetings. This project replaces the human in the loop with an AI agent that can run experiments 24/7, making ML research dramatically faster and more automated.

How It Works

The workflow is a tight edit → train → evaluate → keep/discard loop:

The agent reads program.md — a Markdown file containing research instructions and context, written and maintained by the human researcher.
The agent modifies train.py — the single Python file containing the GPT model architecture, optimizer (Muon + AdamW), and training loop. The agent can change anything: architecture, hyperparameters, optimizer settings, batch size, etc.
Training runs for exactly 5 minutes (wall-clock time, excluding startup/compilation).
The metric val_bpb (validation bits per byte) is computed — lower is better. It is vocab-size-independent, so architectural changes can be fairly compared.
The agent decides whether to keep the change or discard it, then repeats.

This yields approximately 12 experiments/hour and ~100 experiments overnight.

Repository Structure

autoresearch/
├── prepare.py       # Fixed: one-time data prep, BPE tokenizer training, dataloader, eval utils
├── train.py         # Editable by agent: GPT model, optimizer, training loop
├── program.md       # Editable by human: agent instructions and research context
├── analysis.ipynb   # Notebook for analyzing experiment results
├── pyproject.toml   # Dependencies (managed via uv)
└── progress.png     # Teaser image showing training progress

Key Files Explained

File	Owner	Purpose
`prepare.py`	Human (fixed)	Downloads data shards, trains BPE tokenizer, provides dataloader & evaluation utilities
`train.py`	AI Agent	Full GPT implementation + training loop — the agent's sandbox
`program.md`	Human (iterable)	Research "skill" / instructions for the agent

Design Philosophy

1. Single File to Modify

The agent only touches train.py. This keeps the scope manageable and diffs easy to review. It also constrains the agent's action space to meaningful ML changes rather than infrastructure/tooling changes.

2. Fixed Time Budget

Every experiment runs for exactly 5 minutes of wall-clock training time (startup excluded). This ensures:

All experiments are directly comparable regardless of architectural changes (model size, batch size, etc.)
The agent finds the best model configuration for your specific hardware
Predictable throughput: ~12 experiments/hour

The trade-off: results are not portable across different compute platforms (an H100 run cannot be compared to an A100 run).

3. Self-Contained

No distributed training, no complex config files, no external research infrastructure. Just PyTorch + a handful of small packages, one GPU, one file, one metric. This makes the project easy to understand, fork, and build upon.

4. `program.md` as the Human Interface

Rather than coding a research agent from scratch, Karpathy uses program.md as a lightweight "skill" — a Markdown file that gives the AI agent context, goals, and constraints. The human iterates on program.md over time to improve the "research org code."

Technical Details

Model & Training

Based on a simplified single-GPU version of nanochat
Uses Muon + AdamW optimizer by default (though the agent can change this)
Trains a GPT-style model from scratch on downloaded text data shards
BPE tokenizer trained on the data itself via prepare.py
Evaluation metric: val_bpb (validation bits per byte)

Requirements

GPU: Single NVIDIA GPU (tested on H100)
Python: 3.10+
Package manager: uv

Quick Start

# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. One-time data prep (~2 min)
uv run prepare.py

# 4. Run a single training experiment (~5 min)
uv run train.py

Running the Agent

Start Claude, Codex, or any capable coding agent in the repo directory (with file write permissions), then prompt:

Have a look at program.md and let's kick off a new experiment! Let's do the setup first.

The agent will read program.md, propose a change to train.py, run the training, evaluate the result, and iterate.

Significance & Impact

autoresearch is a proof-of-concept for automated machine learning research at the micro scale. While large-scale AutoML systems exist, this project is notable for:

Simplicity: The entire meaningful codebase is ~3 files
Transparency: Every agent decision is logged and reviewable
Accessibility: Runs on a single consumer/research GPU
Vision: It demonstrates the feasibility of AI agents conducting genuine ML research autonomously

It also serves as a template for "programming your AI research org via Markdown" — a paradigm that may become standard as AI coding agents grow more capable.

Notable Forks

miolini/autoresearch-macos — macOS/MPS support

Summary

Property	Value
Type	Autonomous AI Research Agent Framework
Primary Use	Overnight automated LLM training experimentation
Agent Interface	`program.md` (Markdown instructions)
Agent Action Space	`train.py` (GPT model + training loop)
Experiment Duration	5 minutes (fixed)
Throughput	~12 experiments/hour, ~100 overnight
Metric	val_bpb (validation bits per byte)
Hardware	Single NVIDIA GPU (H100 tested)
License	MIT