karpathy/autoresearch — Repo Appraisal

Overview

What it is — A minimal autonomous ML research loop that instructs an AI agent to repeatedly modify a single training file (train.py), run a fixed 5-minute experiment, evaluate on val_bpb, and keep or discard the change — overnight, unattended.
Problem — Manual ML experimentation requires constant researcher attention for each iteration; this automates the modify-train-evaluate cycle so ~100 experiments can run while the researcher sleeps.
Who it's for — ML practitioners and researchers with a single NVIDIA GPU who want unsupervised architecture and hyperparameter search on a small LLM training setup, guided by a plain-text program.md brief.
Notable — Karpathy's signature three-file minimalism applied to autonomous research: the human edits program.md (the "research org"), the agent edits train.py, and a fixed wall-clock time budget makes all experiments directly comparable regardless of what the agent changes.

Verdict

	Rating	Summary
Quality	decent (15/24)	Exceptional adoption signal and thorough README undercut by no releases, no license, no tests, and activity that stalled two months post-creation.
PAI Relevance	WATCH (0.38)	The fixed-budget eval loop and `program.md` skill interface are worth studying, but Python-only with no callable CLI makes direct integration into PAI's TypeScript/Bun stack a non-starter.

Quality Assessment

15/24 — stale-risk / adequately-documented / solid

Health: 3/8 (stale-risk)

Failed:

H1: FAIL — No tagged releases exist; "Latest Release: none (none)"
H2: FAIL — No release at all, so recency condition is vacuously unmet
H4: FAIL — Last commit 2026-03-26; today is 2026-05-27, ~62 days ago, exceeds 30-day threshold
H7: FAIL — License field is "none"; no LICENSE file referenced in README
H8: FAIL — README contains no CI badge, no .github/workflows/ reference

Passed:

H3: PASS — Last commit ~62 days ago, well within the 6-month threshold
H5: PASS — archived: false confirmed
H6: PASS — 52 open issues: above zero (active project) and below 100 (not overwhelmed)

Documentation: 6/8 (adequately-documented)

Failed:

D5: FAIL — No heading matching API, Configuration, Options, Reference, Commands, or Parameters; "Design choices" is the nearest section but does not match the keywords
D7: FAIL — README links to tweets and a sibling GitHub repo (nanochat), not a dedicated docs site, wiki, or /docs directory

Passed:

D1: PASS — README is present and non-empty
D2: PASS — README is several thousand words, well above the 1000-byte threshold
D3: PASS — "Quick start" section provides step-by-step install instructions using uv
D4: PASS — Multiple fenced code blocks appear under the Quick start and "Running the agent" headings
D6: PASS — Purpose is stated clearly in the opening paragraph: "give an AI agent a small but real LLM training setup and let it experiment autonomously overnight"
D8: PASS — "Platform support" section explicitly calls out NVIDIA-GPU-only limitation and why CPU/MPS are not supported

Engineering Signals: 6/8 (solid)

Failed:

E1: FAIL — Primary language is Python; not in the typed-language list (TypeScript, Rust, Go, Java, Kotlin, C#, Swift, Scala, Haskell)
E4: FAIL — README contains no mention of tests, no test script in manifest, no CI config

Passed:

E2: PASS — pyproject.toml is listed explicitly in the project structure section
E3: PASS — README states "no external dependencies beyond PyTorch and a few small packages," well under the 30-dep threshold
E5: PASS — 83,713 stars far exceeds the ≥50 threshold
E6: PASS — ~83,713 stars over ~2.7 months ≈ 31,000 stars/month, far exceeds ≥2/month
E7: PASS — 12,162 forks far exceeds the ≥5 threshold
E8: PASS — Description is 67 characters and meaningfully describes the tool

PAI Relevance

Dimension	Score	Assessment
Harvest Value	1	The fixed-budget eval loop (5-min wall-clock, compare val_bpb) and `program.md` as a human-iterable "research org brief" are patterns worth studying for PAI's `Optimize` and `Science` skills, though neither is novel enough to directly rewrite a PAI subsystem.
Integration Readiness	0	Python-only with no exported CLI or structured output; invoked via `uv run train.py` as a standalone process. PAI is TypeScript/Bun with a subprocess-first integration model — no clean wrapping path exists without significant glue.
Overlap Risk	1	Partial overlap with PAI's existing `Optimize` (metric hill-climbing) and `Science` (hypothesis-driven investigation) skills; the ML training domain is distinct but the abstract loop pattern is already represented.
Gap Fill	1	PAI has no ML training infrastructure whatsoever — this addresses a genuine functional gap — but automated ML experiment loops sit at the far periphery of PAI's personal-AI-infrastructure mission.

Composite: 0.38

What Next

Evaluating the fixed-budget experiment discipline on any existing small-scale training job: Set a hard wall-clock cap (5 or 10 minutes) on your next hyperparameter trial instead of a fixed step count, log your validation metric after each run, and compare trials directly — this borrows autoresearch's core comparability insight without touching the repo at all, and tells you whether the discipline is useful in your workflow before the repo itself is stable enough to depend on.
Monitoring autoresearch's experiment logging maturity: The repo currently lacks structured logging of the agent's kept/discarded decisions, which is the main gap between "interesting prototype" and "usable research infrastructure." Watch the repo for a release or a structured experiment history file (a JSONL log of attempt → metric → decision); that addition is the signal that the loop produces auditable, replayable results rather than ephemeral overnight runs.
Bookmarking program.md as a research brief format: The idea of a plain-text, human-editable research brief that scopes an autonomous loop is transferable regardless of what happens to this repo. Save a copy of the current program.md template and revisit it in 3 months — if the repo has accumulated real experiment histories from the community, the brief format will have been stress-tested and refined, making it safe to adopt for your own research direction without having to infer the right structure from a single example.

Landscape Position

Category: AI Research & Papers

In this category: first entry — no prior repos appraised in this category

Standing: Establishes the category; the only appraised AI research infrastructure repo in the vault, making it the default reference point for autonomous ML research loop patterns.

Evidence Base

Density: 6/10 — Available: README (full 8KB), stars, forks, open issues, creation/commit timestamps, language, archived status, description, fork list with platform variants. Missing: dependency manifest contents, CI config, release notes, contributor graph, actual experiment logs or benchmark results.

Notes

The repo's viral adoption (83K stars in under 3 months from a high-credibility author) is a strong signal that the core idea resonates, but the implementation is explicitly a single-developer prototype — no releases, no license, no tests, and the commit history spans only 20 days (2026-03-06 to 2026-03-26). The community has forked it across platforms (macOS, Windows, AMD) which confirms genuine interest but also means the canonical repo may have already served its purpose as a seed. The program.md pattern — a plain-text brief that humans iterate instead of code — is the most transferable concept and worth tracking as it appears in other repos.