Karpathy's Autoresearch lets AI run 100 ML experiments while you sleep
Contents
Andrej Karpathy, the former Tesla AI lead and OpenAI co-founder, has released Autoresearch. In one line, it is a system where AI keeps running ML experiments for you while you sleep, and it has already drawn a lot of attention.
This article explains what it does and why it matters in plain language, even if you do not know much about machine learning.
What an ML experiment actually is
ML, or machine learning, is the technique used to build AI models by learning patterns from large amounts of data. LLMs such as ChatGPT and Claude are also trained by repeatedly feeding them huge text corpora so they learn to predict the next token.
That training process has countless knobs to turn.
| Knob | Meaning |
|---|---|
| Learning rate | How much the model updates itself after each batch of data. Too high and it becomes unstable; too low and learning is slow |
| Batch size | How many samples are processed at once. Larger batches are more stable but use more memory |
| Model architecture | The number of layers, layer widths, and attention heads in the neural network |
| Optimizer | The algorithm that decides how weights are updated. AdamW and Muon are examples |
An ML experiment is the act of changing those knobs, training, and comparing the results to see which configuration works best. Because each run can take minutes or hours, a human can only try a limited number of combinations in a day.
Autoresearch hands that whole loop to an AI agent.
How Autoresearch works
The overall flow
flowchart TD
A["The researcher writes the exploration plan in<br/>program.md"] --> B["The AI agent edits train.py"]
B --> C["Training runs on GPU<br/>for 5 minutes"]
C --> D["Results are evaluated<br/>val_bpb is measured"]
D --> E{Did the score<br/>improve?}
E -- Yes --> F["Keep the change and move<br/>to the next experiment"]
E -- No --> G["Discard the change and try<br/>a different approach"]
F --> B
G --> B
Because each experiment is fixed at five minutes, the system can run about 12 experiments per hour and roughly 100 while you sleep for eight hours. In the morning, the researcher only needs to check the log and see which configuration won.
Evaluation metric: val_bpb
Autoresearch uses val_bpb (validation bits per byte) as the score.
This measures how efficiently the model predicts text. Lower is better. For example, if a model can guess that the next letter after “Thank y” is probably “o”, its bpb becomes lower.
The advantage of bpb is that it does not depend on vocabulary size. A model trained with an 8,000-token vocabulary and another trained with 4,000 tokens can still be compared fairly. That makes it possible to evaluate experiments with very different architectures side by side.
The base: nanochat / nanoGPT
Autoresearch does not train a giant model from scratch. It uses nanochat, a lightweight GPT implementation.
| Project | What it is |
|---|---|
| nanoGPT | Karpathy’s earlier project for training GPT with minimal code. It showed that a GPT-2-sized model, which OpenAI trained in 2019 for about 48 of GPU time |
| nanochat | The successor to nanoGPT. It includes not only training, but also chat tuning and a chat UI |
Autoresearch simplifies nanochat’s training code into a single file on a single GPU so that an AI agent can edit it easily. It is a “real LLM training environment, but small” setup where the agent can still try meaningful experiments.
A simple three-file structure
The repository core is only three files.
| File | Role | Who touches it |
|---|---|---|
prepare.py | Downloads and preprocesses the data | Left untouched |
train.py | All model architecture and training logic | Edited by the AI agent |
program.md | Research instructions for the agent | Written by the human |
The decision to keep changes inside train.py is deliberate. If all changes are confined to one file, humans can review the diff more easily and understand exactly what the agent changed.
program.md is where the researcher writes natural-language instructions such as:
- “Try 4, 8, and 16 attention heads and compare the bpb results.”
- “Compare the effect of doubling and halving the learning rate.”
- “Compare runs with and without dropout, the regularization method that randomly disables neurons during training.”
That leaves only program.md as the human judgment point; everything else is delegated to the agent. Karpathy describes it as being like writing code for a research organization.
What the agent is allowed to change
Inside train.py, the agent can freely modify the following:
| Item | Meaning |
|---|---|
| Optimizer choice | Which algorithm is used to update the weights |
| Batch size | How much data is processed at once |
| Learning rate and schedule | How fast learning proceeds and how it changes across training stages |
| Model architecture | Depth, layer size, and number of attention heads |
| Activation function | The transformation used between neurons, such as ReLU or GELU |
| Normalization method | The technique used to stabilize outputs layer by layer |
What an optimizer is
Training a neural network means nudging millions of parameters closer to the right answer. The optimizer decides in which direction, and by how much, those parameters move.
| Optimizer | 특징 |
|---|---|
| AdamW | The standard optimizer used in most LLM training runs |
| Muon | A newer optimizer that appeared around 2025 and is said to converge faster than AdamW |
Autoresearch’s default setup uses a hybrid approach: Muon for the model’s weight matrices, and AdamW for embeddings, biases, and the rest.
The design philosophy: 1 GPU, 1 file, 1 metric
What Karpathy keeps emphasizing here is simplicity.
If you use distributed training across multiple GPUs, you can train bigger models faster. But once communication between GPUs and data sharding enter the picture, it becomes harder to trace cause and effect: what the agent changed and what actually happened.
Limiting the system to one GPU gives three benefits:
- The relationship between the agent’s change and the result is obvious
- The environment is simple, so bugs are less likely to creep in
- Fixed 5-minute runs make experiments comparable
The constraints are what make it practical.
The README opens with a characteristically Karpathy-esque prediction:
Research used to be done by flesh computers that ate and slept, occasionally synchronizing through a strange acoustic ritual called a group meeting. That era is long gone. Research is now done by swarms of AI agents operating autonomously on compute clusters.
It is half joke, half serious vision, and Autoresearch is presented as the first step in that direction.
Setup and how to run it
What you need:
- An NVIDIA GPU. H100 has been verified, though other GPUs can run it with different throughput
- Python 3.10 or later
uv, the Python package manager
# Install dependencies
uv sync
# Download data and train the tokenizer (first run only, about 2 minutes)
uv run prepare.py
# Run one training pass manually to verify the setup (about 5 minutes)
uv run train.py
Once that works, you can launch an AI agent such as Claude Code or Codex in the repository and tell it to read program.md and start experimenting.
If you want to try it on a smaller GPU such as a MacBook, community forks exist for macOS and Windows. Those versions usually swap the training data to GPT-4-generated short stories (TinyStories) and reduce the model size.
The license is MIT. Because this is code, not a pretrained model release, you are free to run experiments on your own GPU.
If you want to train your own writing, use a different approach
You might think that swapping the dataset would let you train on your own writing. In theory, yes, but Autoresearch trains a model from scratch, so it needs at least tens or hundreds of megabytes of text. A personal blog or a few drafts are nowhere near enough. TinyStories works only because it is a special case: very simple English sentences that can be learned from relatively little data.
If your real goal is “I want a model that writes like me,” fine-tuning is more practical. Take a pretrained model and continue training it on your own text. With LoRA or QLoRA, even a small dataset and a single GPU can shift the style.
The point of Autoresearch is not style learning. It is to automatically search for the best combination of model structure and hyperparameters. That is a different problem, so it should not be confused with fine-tuning.