garrytan/gbrain-evals — Repo Appraisal

Overview

What it is — A reproducible benchmark suite for personal-knowledge agent stacks, pairing an in-house 12-category BrainBench (run against a committed 240-page fictional corpus) with public benchmark runners for LongMemEval, all evaluated across multiple retrieval adapters.
Problem — Personal AI memory and knowledge retrieval systems lack shared, reproducible benchmarks with committed corpora and sealed gold labels; gbrain-evals solves this by shipping corpora, qrels, and multi-adapter runners in a single installable package.
Who it's for — Developers building or comparing personal knowledge agent stacks, primarily garrytan/gbrain users, but the cross-system comparison table includes MemPalace, Mastra, Stella, and Contriever baselines.
Notable — Claims SOTA 97.60% R@5 on the public LongMemEval _s split without an LLM in the retrieval loop, and demonstrates zero retrieval regression across 20 consecutive gbrain releases.

Verdict

	Rating	Summary
Quality	solid (17/24)	Actively committed, well-documented benchmark suite with strong adoption signals, let down only by no formal releases and HTML as the GitHub-reported primary language
PAI Relevance	integrate (0.50)	BrainBench's sealed-qrels enforcement, planted-perturbation corpus design, and multi-adapter harness pattern offer concrete methodology to extract for PAI's Evals skill

Composite is exactly at the 0.50 INTEGRATE threshold; the practical integration action is methodology study and pattern extraction rather than running gbrain-evals as-is against PAI, since all adapters call into gbrain/* subpath exports and would need full replacement.

Quality Assessment

17/24 — maintained / adequately-documented / solid

Health: 5/8 (maintained)

Failed:

H1: FAIL — no tagged releases; Latest Release is "none (none)"
H2: FAIL — no releases exist, so recency of release cannot be assessed
H8: FAIL — README contains no CI badge and no reference to .github/workflows/

Passed:

H3: PASS — last commit 2026-05-24, three days before appraisal date, well within 6 months
H4: PASS — last commit 2026-05-24 is within 30 days of appraisal date 2026-05-27
H5: PASS — archived: false
H6: PASS — 3 open issues; greater than 0 and fewer than 100
H7: PASS — MIT license present

Documentation: 6/8 (adequately-documented)

Failed:

D5: FAIL — no heading matching API, Configuration, Options, Reference, Commands, or Parameters; the eval:* scripts appear only in package.json, not a README heading
D8: FAIL — no Limitations, Caveats, Known Issues, or Trade-offs section; open TODOs are noted inline but not collected in a dedicated caveat block

Passed:

D1: PASS — README is present with extensive content
D2: PASS — README is several thousand words, far exceeding the 1000-byte threshold
D3: PASS — "5-minute quickstart" section gives explicit git clone and bun install steps
D4: PASS — multiple shell code blocks appear under "Run LongMemEval" and "Run BrainBench" headings
D6: PASS — opening sentence is "Public benchmarks for personal-knowledge agent stacks" — purpose clear within first 500 characters
D7: PASS — README links repeatedly into the docs/benchmarks/ directory tree and to individual benchmark reports

Engineering Signals: 6/8 (solid)

Failed:

E1: FAIL — GitHub reports primary language as HTML; committed benchmark report pages dominate the byte count even though all runner scripts are TypeScript
E8: FAIL — GitHub description field is "No description" — empty and not meaningful per the probe

Passed:

E2: PASS — package.json is committed and fully provided
E3: PASS — 2 direct deps (@anthropic-ai/sdk, gbrain) and 2 dev deps; well under the 30-dep ceiling
E4: PASS — package.json defines "test": "bun test test/eval/" and multiple eval:* runner scripts
E5: PASS — 186 stars exceeds the 50-star threshold
E6: PASS — created 2026-04-22, ~35 days elapsed; approximately 159 stars/month far exceeds the ≥2 threshold
E7: PASS — 33 forks exceeds the 5-fork threshold

PAI Relevance

Dimension	Score	Assessment
Harvest Value	1	The sealed-qrels enforcement pattern (gold `_facts` metadata that never crosses the adapter boundary), planted-perturbation corpus design (10 contradictions, 5 stale facts, 5 paraphrased-injection poison items), and 12-category BrainBench structure each offer concrete techniques worth extracting for PAI's Evals skill, which currently handles general agent evaluation rather than retrieval-quality benchmarking specifically.
Integration Readiness	1	All runner scripts are TypeScript and Bun-native (stack match), but every adapter imports from `gbrain/*` subpath exports; adapting the harness to PAI's Knowledge subsystem requires replacing the adapter layer — moderate glue, not drop-in.
Overlap Risk	1	Partial overlap with PAI's existing Evals skill (AI agent evaluation); gbrain-evals focuses specifically on retrieval-quality metrics (P@5, R@5, per-type F1, latency p95), a narrower sub-domain the Evals skill does not deeply cover.
Gap Fill	1	PAI has no dedicated retrieval-quality benchmark harness for its Knowledge subsystem; gbrain-evals addresses that functional gap but requires adaptation away from gbrain internals before it can measure PAI's retrieval pipeline.

Composite: 0.50

What Next

Capture-to-Knowledge Pipeline (currently at manual validation stage): Run gbrain-evals' BrainBench suite against the pipeline's retrieval layer now, before implementation solidifies — clone the repo, substitute your own committed corpus for the included 240-page fictional one, and capture R@5 as a baseline. Manual validation gets replaced by a repeatable numeric gate that fires on every architecture change rather than on an ad-hoc schedule.
Capture-to-Knowledge Pipeline (retrieval adapter selection): Use the multi-adapter comparison runners included in gbrain-evals to benchmark at least two candidate adapters (e.g., Contriever vs. whatever dense retriever the pipeline currently targets) against your actual knowledge corpus before finalising the architecture. The SOTA 97.60% R@5 result is claimed without an LLM in the retrieval loop — replicating or refuting that on your own document distribution produces a concrete, defensible architecture decision rather than a design assumption.
Capture-to-Knowledge Pipeline (regression gate for future releases): Wire the BrainBench runner into the pipeline's CI as a required check, mirroring gbrain's own zero-regression record across 20 releases. Any future change to chunking, embedding model, or index configuration that drops R@5 below the baseline fails loudly — this converts the current clean-room Haiku validation step from a one-time sanity check into a durable quality floor.

Landscape Position

Category: LLM & Prompt Tooling

In this category: mattpocock--evalite (decent, 15/24)

Standing: gbrain-evals scores higher than evalite on every dimension and addresses a more specific domain — personal knowledge agent retrieval benchmarking with committed corpora — versus evalite's general LLM output evaluation; both lack formal releases.

Evidence Base

Density: 7/10 — Available: full README, complete package.json, repo metadata (stars, forks, dates, license, open issues, archived flag), dependency manifest, scripts catalog, benchmark results table. Missing: actual TypeScript source files, CI configuration, CHANGELOG, contributor graph, test output examples.

Notes

The HTML-as-primary-language tag is misleading — it reflects committed benchmark report files rather than the TypeScript evaluation codebase; actual runner and adapter code is entirely .ts. The version string "v0.40.6.0" in results tracks the parent garrytan/gbrain release, not a gbrain-evals release tag. The claimed SOTA LongMemEval _s result (97.60% R@5) compares against MemPalace's published 96.6% rather than a fresh same-environment re-run of MemPalace, which is a methodology caveat worth noting when reading the cross-system comparison table.