mattpocock/evalite — Repo Appraisal

Overview

What it is — evalite is a TypeScript-native LLM evaluation framework built on vitest that lets developers write scorer-based test suites for LLM-powered functions, track per-run scores over time, and inspect traces in a local web UI.
Problem — LLM outputs are non-deterministic and shift with every prompt change, making it difficult to know whether a change is genuinely an improvement without a quantitative, repeatable eval harness.
Who it's for — TypeScript developers building LLM-powered applications who want a code-first, local, CI-integrable eval loop without leaving the Node/TypeScript ecosystem.
Notable — Built by Matt Pocock (Total TypeScript), reached v0.19.0 with 1572 stars in under 18 months, and ships a dedicated web UI for browsing eval history — uncommon for a TypeScript-native eval library.

Verdict

	Rating	Summary
Quality	decent (15/24)	Actively maintained, excellent engineering signals, but missing README inflates documentation penalty severely.
PAI Relevance	watch (0.38)	TypeScript-native and drop-in compatible, but PAI already has a dedicated Evals skill covering the same function.

Condition 5 (SKIP on overlap=2, gap_fill=0) is logically compelling but Condition 3 (WATCH, standalone >= 12) fires first by formula ordering. The repo is useful as a design reference even though it doesn't fill a PAI gap.

Quality Assessment

15/24 — actively-maintained / undocumented / high-discipline

Health: 7/8 (actively-maintained)

Failed:

H8: FAIL — No README available to check for CI badges or .github/workflows/ references; "ci" script in package.json is suggestive but not confirmatory.

Passed:

H1: PASS — evalite@0.19.0 released 2025-11-06.
H2: PASS — Latest release 6.7 months ago (Nov 2025), well within 12-month threshold.
H3: PASS — Last commit 2026-04-28, ~29 days ago, within 6 months.
H4: PASS — 2026-04-28 is 29 days before 2026-05-27, just under the 30-day cutoff.
H5: PASS — archived: false.
H6: PASS — 43 open issues; >0 and <100, indicates active triage.
H7: PASS — MIT license.

Documentation: 0/8 (undocumented)

Failed:

D1: FAIL — "No README available" per repository data.
D2: FAIL — No README content to assess length.
D3: FAIL — No README to check for install/setup instructions.
D4: FAIL — No README to check for usage examples.
D5: FAIL — No README to check for API/config reference headings.
D6: FAIL — No README first paragraph to evaluate.
D7: FAIL — No README to confirm external docs link, despite clear evidence of an apps/evalite-docs package in the monorepo.
D8: FAIL — No README to check for limitations or caveats section.

Passed:

(none)

Engineering Signals: 8/8 (high-discipline)

Failed:

(none)

Passed:

E1: PASS — Primary language is TypeScript.
E2: PASS — package.json (monorepo root manifest) is present and well-structured.
E3: PASS — Root manifest lists ~9 direct dependencies (changesets, tsconfig, types/node, husky, prettier, tsx, typescript, vitest, lint-staged); well under 30.
E4: PASS — "test" and "ci" scripts present; vitest in dependencies; dedicated evalite-tests workspace package confirms test infrastructure.
E5: PASS — 1572 stars.
E6: PASS — ~85 stars/month (1572 stars over ~18.5 months since 2024-11-12).
E7: PASS — 92 forks.
E8: PASS — "Evaluate your LLM-powered apps with TypeScript" is clear and descriptive.

PAI Relevance

Dimension	Score	Assessment
Harvest Value	1	The vitest-integration pattern (treating evals as test suites) and per-run trace capture with a local UI are design choices worth studying for PAI's existing Evals skill, but neither is a novel architecture PAI couldn't derive independently.
Integration Readiness	2	TypeScript-native, monorepo ships as an npm package (`bun add -D evalite`), exposes a CLI (`evalite watch`) — aligns perfectly with PAI's TypeScript+Bun+CLI-first stack.
Overlap Risk	2	PAI's Capability Manifest explicitly lists "Evals (AI agent evaluation)" under Development skills; this is near-complete functional overlap.
Gap Fill	0	PAI already has an Evals skill; evalite covers the same ground (LLM output scoring, test-driven prompt evaluation) without adding a capability PAI lacks.

Composite: 0.38

What Next

Any TypeScript project that calls an LLM API: Run bun add -D evalite and wrap one existing LLM function — a summarizer, classifier, or prompt template — in an evalite suite with a string-similarity or custom scorer. Run bunx evalite watch during prompt iteration. Every change to the system prompt produces a recorded pass-rate delta in the local UI instead of manual spot-checks, making regressions visible before they ship.
CI pipeline for an LLM-backed feature: Add an evalite suite to the repo alongside unit tests and wire npx evalite into the CI step. Configure a score threshold (e.g., fail if semantic similarity drops below 0.8) to gate merges. The outcome is an automated quality gate on LLM output that runs on every PR, the same way type-checks and linting do.
Designing a custom eval harness from scratch: Review evalite's scorer interface and per-run trace storage as a reference design before building anything bespoke. The pattern — a scorer function returns a { score, metadata } object, results accumulate by run ID, a local UI queries the SQLite store — is reusable as an architecture even if you implement it independently in a different runtime or language.

Landscape Position

Category: LLM & Prompt Tooling

In this category: first entry in this vault

Standing: evalite is the sole LLM & Prompt Tooling entry; no intra-category comparison is possible, though it clusters with garrytan--gbrain-evals in the cross-category evaluation overlap group.

Evidence Base

Density: 8/10 — Available: repository metadata (stars, forks, dates, license, topics), monorepo package.json manifest, latest release tag, language classification, topic tags, prior appraisal context, landscape summary. Missing: README content (primary gap responsible for 0/8 doc score), individual package manifests for evalite and evalite-ui, CI configuration files.

Notes

The 0/8 documentation score is an artifact of README unavailability, not genuine undocumented status — the monorepo includes a dedicated apps/evalite-docs documentation site and Matt Pocock typically ships thorough docs for his TypeScript tooling. Real-world documentation quality is almost certainly higher than the score reflects; a re-appraisal with README content would likely push standalone score to 18-20 (solid to excellent). The WATCH verdict is appropriate as a design reference: evalite's scorer API and trace model are clean exemplars even if the tool itself overlaps with PAI's Evals skill.