AI agents running research on single-GPU nanochat training automatically
Alpha-grade research demo. Active development window was roughly 20 days (2026-03-06 to 2026-03-26), with no formal releases and no commits in the ~47 days since. No versioning scheme, no changelog, no CI. The README's own framing — "this is the story of how it all began" — signals it was always intended as a proof-of-concept snapshot rather than a maintained product. Consistent with Karpathy's prior pattern (nanoGPT, llm.c): release a clean, influential demo and move on. The README notes license as MIT but the repo metadata records none — a minor but real hygiene gap.
README is unusually thorough for a research demo: concept rationale, architecture walkthrough (three files, what each does), quick-start commands, design decision justifications (fixed time budget, single-file scope), platform guidance for lower-compute setups with specific hyperparameter knobs, and a curated notable-forks section. No docs beyond the README and no API reference, but the codebase is intentionally minimal so this is appropriate. program.md as the human-facing "skill" file is a clever abstraction that is self-documenting.
Minimal surface area is a genuine quality signal: three files, single metric (val_bpb), no distributed training complexity. Uses modern uv for dependency management. No tests, no CI, no dependency manifest available for inspection. The "agent modifies train.py" contract is clean and reviewable. Code quality is hard to rate without seeing the source, but the architectural constraints enforce simplicity. Dependency hygiene is unknown due to missing manifest.
No commits after 2026-03-26. 51 open issues against 80K+ stars is a surprisingly low ratio — suggesting either very good triage early on or that issue engagement has stopped. No PRs merged recently visible, no response cadence data. This repo is effectively in "maintenance mode at zero" — stable as a reference but not evolving. Forks are where active development is happening (four listed, covering macOS, Windows, AMD).
80,555 stars and 11,725 forks are exceptional signals for a three-week-old demo. Multiple organized community forks covering major platforms appeared within days. Referenced directly in two Karpathy tweets with wide reach. The community-generated "Dummy's Guide" linked in the README demonstrates organic knowledge diffusion. This is one of the most-starred repos created in 2026 and clearly seeded a subfield.
Overall: 3.0/5
Category: Autonomous ML Research
Known alternatives in vault: geeknik--HyperTune (LLM Hyperparameter Optimization)
Differentiation: autoresearch operates at a higher abstraction level than HyperTune: rather than tuning hyperparameters within a fixed training pipeline, it hands the entire train.py — architecture, optimizer, batch size, everything — to an AI agent to modify freely within a fixed wall-clock budget. The val_bpb metric makes cross-architectural experiments directly comparable. The program.md "research org code" pattern is a novel interface primitive not present in HyperTune. HyperTune in contrast offers more structured search strategies and is designed for human-in-the-loop tuning rather than overnight autonomous runs.
Gap or crowd: This category has one prior entry (HyperTune, rated 2.9) and is explicitly flagged as needing alternatives. autoresearch fills the gap from a distinct angle — agentic free-form code modification rather than structured hyperparameter search — making it complementary rather than redundant.
Score: 4/5
Harvestable: (1) The program.md skill file pattern — using a Markdown file as a lightweight agent instruction set that the human iterates on while the agent iterates on code — is directly applicable to PAI skill authoring. (2) The fixed-budget experiment loop (modify → train 5 min → compare metric → keep/discard) is a general-purpose autonomous research loop template extractable for any evaluation-driven agent task. (3) The single-metric, vocab-size-independent evaluation (val_bpb) design discipline is a model for PAI benchmark design.
Integration path: Most immediately useful as a reference implementation for designing agent-driven self-improvement loops within a PAI system. The program.md pattern could be adopted as a PAI skill template format. With an H100 or equivalent, it could run as a scheduled overnight research job surfacing results to the vault each morning. The three-file contract (fixed utilities, agent-modified logic, human-authored instructions) maps cleanly onto PAI's tool/skill/hook architecture.
Overlap with existing: Partial conceptual overlap with geeknik--HyperTune on the "improving LLM training outcomes autonomously" axis, but the mechanisms are distinct enough that both warrant retention.
Adoption cost: Moderate — running autoresearch as-is requires an H100-class GPU. Extracting the program.md pattern and the fixed-budget loop design for PAI integration is trivial. Adapting the full pipeline to smaller compute requires referencing the community forks and the README's tuning guide, which is well-documented.
This is a high-signal, low-maintenance reference repo. Its primary value to a PAI vault is conceptual and structural rather than operational: the program.md as agent instruction primitive, the fixed-budget experiment contract, and the three-role file split (immutable utilities / agent-mutable logic / human-mutable instructions) are all patterns worth internalizing. The star/fork velocity confirms community resonance. The absence of a formal license in repo metadata (contradicting README's MIT claim) should be resolved before any derivative use. The repo is effectively feature-complete by design — its "abandoned" appearance is intentional, not neglect. Recommend tagging as a reference implementation rather than an active dependency.