No description
Alpha — repo is 20 days old at last commit, no releases tagged, roadmap items (ConvoMem, LoCoMo) explicitly filed as TODO. The benchmark suite itself is functional and reproducible (LongMemEval _s full 500-question run published 2026-05-07), but the project is pre-v1 by its own versioning (0.1.0). Two open issues, too young to assess issue-triage cadence.
README is detailed and actionable: purpose statement, results table with linked dated reports, 5-minute quickstart with exact shell commands, fictional corpus descriptions with size/content breakdowns, full repo layout, and a 12-category BrainBench Cat catalog with thresholds and status. Supplemented by docs/benchmarks/ dated markdown reports, docs/comparison-systems.md with cross-system baselines, and CLAUDE.md spec. The only gap is absence of a contributing guide and no CI badge.
TypeScript on Bun — a modern, well-typed stack. Dependency manifest is clean: two runtime deps (@anthropic-ai/sdk, gbrain via GitHub URL) and two dev deps. Schemas directory contains portable JSON Schema contracts. Test directory (test/eval/) exists. Primary language registers as HTML because benchmark reports dominate file count, obscuring the actual TypeScript runtime. Cannot assess CI status or coverage — no badge visible and no workflow files referenced in README.
Commits span the full repo lifetime (2026-04-22 to 2026-05-07) with visible dated benchmark reports as commit milestones. Author is clearly working actively. Pattern of dated report files (2026-05-07-longmemeval-s.md, 2026-04-25-brainbench-cat13b-source-swamp.md) suggests disciplined cadence. Too young (3 weeks) to measure long-term maintenance reliability.
112 stars in ~3 weeks is above-average for a niche eval tool targeting personal knowledge agent developers. 19 forks suggest practitioners are cloning to run benchmarks against their own stacks. No downstream dependents visible. Audience ceiling is narrow — this is tooling for builders of gbrain-compatible systems specifically.
Overall: 3.1/5
Category: Personal Knowledge Retrieval Benchmarking Known alternatives in vault: NorthwoodsSentinel--loam (Personal AI Memory, 2.2/5), UnluckyMycologist68--palimpsest (Personal AI Memory, 1.1/5) Differentiation: gbrain-evals occupies a distinct layer — it is a benchmark harness, not a knowledge system. loam and palimpsest are storage/retrieval implementations; this repo tests implementations like them. Unique features: sealed qrels enforcement (Day 9 protocol preventing gold-data leakage), 12-category BrainBench Cat taxonomy with explicit thresholds, reproducible fictional corpus generation (amara-life-v1 seeded at 42), multi-adapter comparison pattern across grep-only / vector / RRF-fusion / hybrid, and published cross-system baselines against MemPalace, Hindsight, Mastra, Stella, Contriever, and BM25. No vault repo provides evaluation infrastructure of any kind. Gap or crowd: Fills a genuine gap. The vault has zero retrieval evaluation tooling. This is the only repo in the vault that answers "how do you know your knowledge retrieval is any good?"
Score: 4/5
Harvestable: The 12-category BrainBench Cat taxonomy (with threshold values for P@5, R@5, recall, citation accuracy, latency p95, WER) is directly usable as a quality rubric for any PAI knowledge subsystem. The multi-adapter benchmarking pattern — running the same query set through grep-only, vector, and hybrid adapters in parallel — is extractable. The amara-life-v1 corpus generator (amara-life-gen.ts, seed=42, ~$4 Opus) is a reproducible synthetic-life fixture applicable to any personal memory evaluation. The sealed-qrels pattern (gold metadata never crossing the adapter boundary) is an architectural discipline worth adopting. LongMemEval runner and stratified sampling approach (--stratify 10) are directly applicable to any session-memory system.
Integration path: Two routes. (1) Conceptual harvest: adopt the BrainBench Cat taxonomy as a quality rubric for the PAI knowledge vault, run BrainBench Cat 1–4 against whatever retrieval layer the PAI uses — requires writing one adapter shim. (2) Direct run: if PAI uses gbrain as its retrieval backend, bun install + bun run eval:brainbench:smoke works out of the box. Route 1 is the realistic path for a system with its own retrieval stack.
Overlap with existing: loam and palimpsest overlap in subject matter (personal AI memory) but not function — neither provides evaluation infrastructure, so there is no functional duplication in the vault.
Adoption cost: Moderate. Harvesting the Cat taxonomy and metrics methodology is low-effort (documentation read + rubric transcription). Writing a new adapter to run BrainBench against a non-gbrain retrieval system requires implementing the runCatN harness interface — estimated 1–3 days of engineering. Full LongMemEval integration requires OpenAI embeddings key and a one-time 278MB dataset download but is otherwise scripted.
The gbrain dependency is pinned to github:garrytan/gbrain#master — a floating ref with no version lock. This is a stability risk for anyone cloning and expecting reproducible runs weeks from now; the upstream gbrain master could silently shift benchmark results. Worth noting if the PAI vault tracks this for longitudinal comparison. The fictional corpus (amara-life-v1) is a particularly clever asset: 50 emails + 300 Slack messages + planted contradictions and poison items mirrors realistic personal data messiness in ways synthetic QA datasets don't. The cost to regenerate ($4 Opus, 15 min, deterministic) is low enough to make it a practical test fixture for other PAI memory experiments. Star/fork ratio (112/19 ≈ 17%) is high for an eval repo, suggesting actual practitioner use rather than passive interest.