garrytan/gbrain-evals

No description

HTML112 starsPersonal Knowledge Retrieval BenchmarkingGitHub

Standalone Assessment

Maturity: 2/5

Alpha — repo is 20 days old at last commit, no releases tagged, roadmap items (ConvoMem, LoCoMo) explicitly filed as TODO. The benchmark suite itself is functional and reproducible (LongMemEval _s full 500-question run published 2026-05-07), but the project is pre-v1 by its own versioning (0.1.0). Two open issues, too young to assess issue-triage cadence.

Documentation: 4/5

README is detailed and actionable: purpose statement, results table with linked dated reports, 5-minute quickstart with exact shell commands, fictional corpus descriptions with size/content breakdowns, full repo layout, and a 12-category BrainBench Cat catalog with thresholds and status. Supplemented by docs/benchmarks/ dated markdown reports, docs/comparison-systems.md with cross-system baselines, and CLAUDE.md spec. The only gap is absence of a contributing guide and no CI badge.

Code Quality: 3/5

TypeScript on Bun — a modern, well-typed stack. Dependency manifest is clean: two runtime deps (@anthropic-ai/sdk, gbrain via GitHub URL) and two dev deps. Schemas directory contains portable JSON Schema contracts. Test directory (test/eval/) exists. Primary language registers as HTML because benchmark reports dominate file count, obscuring the actual TypeScript runtime. Cannot assess CI status or coverage — no badge visible and no workflow files referenced in README.

Maintenance: 4/5

Commits span the full repo lifetime (2026-04-22 to 2026-05-07) with visible dated benchmark reports as commit milestones. Author is clearly working actively. Pattern of dated report files (2026-05-07-longmemeval-s.md, 2026-04-25-brainbench-cat13b-source-swamp.md) suggests disciplined cadence. Too young (3 weeks) to measure long-term maintenance reliability.

Adoption: 3/5

112 stars in ~3 weeks is above-average for a niche eval tool targeting personal knowledge agent developers. 19 forks suggest practitioners are cloning to run benchmarks against their own stacks. No downstream dependents visible. Audience ceiling is narrow — this is tooling for builders of gbrain-compatible systems specifically.

Overall: 3.1/5

Competitive Positioning

Category: Personal Knowledge Retrieval Benchmarking Known alternatives in vault: NorthwoodsSentinel--loam (Personal AI Memory, 2.2/5), UnluckyMycologist68--palimpsest (Personal AI Memory, 1.1/5) Differentiation: gbrain-evals occupies a distinct layer — it is a benchmark harness, not a knowledge system. loam and palimpsest are storage/retrieval implementations; this repo tests implementations like them. Unique features: sealed qrels enforcement (Day 9 protocol preventing gold-data leakage), 12-category BrainBench Cat taxonomy with explicit thresholds, reproducible fictional corpus generation (amara-life-v1 seeded at 42), multi-adapter comparison pattern across grep-only / vector / RRF-fusion / hybrid, and published cross-system baselines against MemPalace, Hindsight, Mastra, Stella, Contriever, and BM25. No vault repo provides evaluation infrastructure of any kind. Gap or crowd: Fills a genuine gap. The vault has zero retrieval evaluation tooling. This is the only repo in the vault that answers "how do you know your knowledge retrieval is any good?"

PAI Fit

Score: 4/5 Harvestable: The 12-category BrainBench Cat taxonomy (with threshold values for P@5, R@5, recall, citation accuracy, latency p95, WER) is directly usable as a quality rubric for any PAI knowledge subsystem. The multi-adapter benchmarking pattern — running the same query set through grep-only, vector, and hybrid adapters in parallel — is extractable. The amara-life-v1 corpus generator (amara-life-gen.ts, seed=42, ~$4 Opus) is a reproducible synthetic-life fixture applicable to any personal memory evaluation. The sealed-qrels pattern (gold metadata never crossing the adapter boundary) is an architectural discipline worth adopting. LongMemEval runner and stratified sampling approach (--stratify 10) are directly applicable to any session-memory system. Integration path: Two routes. (1) Conceptual harvest: adopt the BrainBench Cat taxonomy as a quality rubric for the PAI knowledge vault, run BrainBench Cat 1–4 against whatever retrieval layer the PAI uses — requires writing one adapter shim. (2) Direct run: if PAI uses gbrain as its retrieval backend, bun install + bun run eval:brainbench:smoke works out of the box. Route 1 is the realistic path for a system with its own retrieval stack. Overlap with existing: loam and palimpsest overlap in subject matter (personal AI memory) but not function — neither provides evaluation infrastructure, so there is no functional duplication in the vault. Adoption cost: Moderate. Harvesting the Cat taxonomy and metrics methodology is low-effort (documentation read + rubric transcription). Writing a new adapter to run BrainBench against a non-gbrain retrieval system requires implementing the runCatN harness interface — estimated 1–3 days of engineering. Full LongMemEval integration requires OpenAI embeddings key and a one-time 278MB dataset download but is otherwise scripted.

Notes

The gbrain dependency is pinned to github:garrytan/gbrain#master — a floating ref with no version lock. This is a stability risk for anyone cloning and expecting reproducible runs weeks from now; the upstream gbrain master could silently shift benchmark results. Worth noting if the PAI vault tracks this for longitudinal comparison. The fictional corpus (amara-life-v1) is a particularly clever asset: 50 emails + 300 Slack messages + planted contradictions and poison items mirrors realistic personal data messiness in ways synthetic QA datasets don't. The cost to regenerate ($4 Opus, 15 min, deterministic) is low enough to make it a practical test fixture for other PAI memory experiments. Star/fork ratio (112/19 ≈ 17%) is high for an eval repo, suggesting actual practitioner use rather than passive interest.