Tech 7 min read

Karpathy's Autoresearch lets AI run 100 ML experiments while you sleep

IkesanContents

Andrej Karpathy, the former Tesla AI lead and OpenAI co-founder, has released Autoresearch. In one line, it is a system where AI keeps running ML experiments for you while you sleep, and it has already drawn a lot of attention.

This article explains what it does and why it matters in plain language, even if you do not know much about machine learning.

What an ML experiment actually is

ML, or machine learning, is the technique used to build AI models by learning patterns from large amounts of data. LLMs such as ChatGPT and Claude are also trained by repeatedly feeding them huge text corpora so they learn to predict the next token.

That training process has countless knobs to turn.

KnobMeaning
Learning rateHow much the model updates itself after each batch of data. Too high and it becomes unstable; too low and learning is slow
Batch sizeHow many samples are processed at once. Larger batches are more stable but use more memory
Model architectureThe number of layers, layer widths, and attention heads in the neural network
OptimizerThe algorithm that decides how weights are updated. AdamW and Muon are examples

An ML experiment is the act of changing those knobs, training, and comparing the results to see which configuration works best. Because each run can take minutes or hours, a human can only try a limited number of combinations in a day.

Autoresearch hands that whole loop to an AI agent.

How Autoresearch works

The overall flow

flowchart TD
    A["The researcher writes the exploration plan in<br/>program.md"] --> B["The AI agent edits train.py"]
    B --> C["Training runs on GPU<br/>for 5 minutes"]
    C --> D["Results are evaluated<br/>val_bpb is measured"]
    D --> E{Did the score<br/>improve?}
    E -- Yes --> F["Keep the change and move<br/>to the next experiment"]
    E -- No --> G["Discard the change and try<br/>a different approach"]
    F --> B
    G --> B

Because each experiment is fixed at five minutes, the system can run about 12 experiments per hour and roughly 100 while you sleep for eight hours. In the morning, the researcher only needs to check the log and see which configuration won.

Evaluation metric: val_bpb

Autoresearch uses val_bpb (validation bits per byte) as the score.

This measures how efficiently the model predicts text. Lower is better. For example, if a model can guess that the next letter after “Thank y” is probably “o”, its bpb becomes lower.

The advantage of bpb is that it does not depend on vocabulary size. A model trained with an 8,000-token vocabulary and another trained with 4,000 tokens can still be compared fairly. That makes it possible to evaluate experiments with very different architectures side by side.

The base: nanochat / nanoGPT

Autoresearch does not train a giant model from scratch. It uses nanochat, a lightweight GPT implementation.

ProjectWhat it is
nanoGPTKarpathy’s earlier project for training GPT with minimal code. It showed that a GPT-2-sized model, which OpenAI trained in 2019 for about 43,000,couldbereproducedforroughly43,000, could be reproduced for roughly 48 of GPU time
nanochatThe successor to nanoGPT. It includes not only training, but also chat tuning and a chat UI

Autoresearch simplifies nanochat’s training code into a single file on a single GPU so that an AI agent can edit it easily. It is a “real LLM training environment, but small” setup where the agent can still try meaningful experiments.

A simple three-file structure

The repository core is only three files.

FileRoleWho touches it
prepare.pyDownloads and preprocesses the dataLeft untouched
train.pyAll model architecture and training logicEdited by the AI agent
program.mdResearch instructions for the agentWritten by the human

The decision to keep changes inside train.py is deliberate. If all changes are confined to one file, humans can review the diff more easily and understand exactly what the agent changed.

program.md is where the researcher writes natural-language instructions such as:

  • “Try 4, 8, and 16 attention heads and compare the bpb results.”
  • “Compare the effect of doubling and halving the learning rate.”
  • “Compare runs with and without dropout, the regularization method that randomly disables neurons during training.”

That leaves only program.md as the human judgment point; everything else is delegated to the agent. Karpathy describes it as being like writing code for a research organization.

What the agent is allowed to change

Inside train.py, the agent can freely modify the following:

ItemMeaning
Optimizer choiceWhich algorithm is used to update the weights
Batch sizeHow much data is processed at once
Learning rate and scheduleHow fast learning proceeds and how it changes across training stages
Model architectureDepth, layer size, and number of attention heads
Activation functionThe transformation used between neurons, such as ReLU or GELU
Normalization methodThe technique used to stabilize outputs layer by layer

What an optimizer is

Training a neural network means nudging millions of parameters closer to the right answer. The optimizer decides in which direction, and by how much, those parameters move.

Optimizer특징
AdamWThe standard optimizer used in most LLM training runs
MuonA newer optimizer that appeared around 2025 and is said to converge faster than AdamW

Autoresearch’s default setup uses a hybrid approach: Muon for the model’s weight matrices, and AdamW for embeddings, biases, and the rest.

The design philosophy: 1 GPU, 1 file, 1 metric

What Karpathy keeps emphasizing here is simplicity.

If you use distributed training across multiple GPUs, you can train bigger models faster. But once communication between GPUs and data sharding enter the picture, it becomes harder to trace cause and effect: what the agent changed and what actually happened.

Limiting the system to one GPU gives three benefits:

  • The relationship between the agent’s change and the result is obvious
  • The environment is simple, so bugs are less likely to creep in
  • Fixed 5-minute runs make experiments comparable

The constraints are what make it practical.

The README opens with a characteristically Karpathy-esque prediction:

Research used to be done by flesh computers that ate and slept, occasionally synchronizing through a strange acoustic ritual called a group meeting. That era is long gone. Research is now done by swarms of AI agents operating autonomously on compute clusters.

It is half joke, half serious vision, and Autoresearch is presented as the first step in that direction.

Setup and how to run it

What you need:

  • An NVIDIA GPU. H100 has been verified, though other GPUs can run it with different throughput
  • Python 3.10 or later
  • uv, the Python package manager
# Install dependencies
uv sync

# Download data and train the tokenizer (first run only, about 2 minutes)
uv run prepare.py

# Run one training pass manually to verify the setup (about 5 minutes)
uv run train.py

Once that works, you can launch an AI agent such as Claude Code or Codex in the repository and tell it to read program.md and start experimenting.

If you want to try it on a smaller GPU such as a MacBook, community forks exist for macOS and Windows. Those versions usually swap the training data to GPT-4-generated short stories (TinyStories) and reduce the model size.

The license is MIT. Because this is code, not a pretrained model release, you are free to run experiments on your own GPU.

If you want to train your own writing, use a different approach

You might think that swapping the dataset would let you train on your own writing. In theory, yes, but Autoresearch trains a model from scratch, so it needs at least tens or hundreds of megabytes of text. A personal blog or a few drafts are nowhere near enough. TinyStories works only because it is a special case: very simple English sentences that can be learned from relatively little data.

If your real goal is “I want a model that writes like me,” fine-tuning is more practical. Take a pretrained model and continue training it on your own text. With LoRA or QLoRA, even a small dataset and a single GPU can shift the style.

The point of Autoresearch is not style learning. It is to automatically search for the best combination of model structure and hyperparameters. That is a different problem, so it should not be confused with fine-tuning.