geeknik/HyperTune — Repo Appraisal

Overview

What it is — HyperTune is a Python CLI tool that systematically sweeps LLM hyperparameter combinations (temperature, top_p, penalties, etc.) across OpenAI, Anthropic, Google Gemini, and OpenRouter, scores each response using sentence-transformer embeddings, and renders visualisation dashboards of parameter impact.
Problem — Selecting optimal LLM hyperparameters for a given prompt is manual trial-and-error; this tool automates the search and surfaces what combinations actually improve coherence, relevance, and complexity.
Who it's for — Developers and researchers who regularly call LLM APIs and want empirical evidence for hyperparameter choices rather than relying on defaults or folklore.
Notable — The three-dimensional semantic scoring pipeline (coherence 40%, relevance 40%, complexity 20% via all-MiniLM-L6-v2) with automatic degenerate-output penalisation is a tighter evaluation design than most similar tools; JSON export and dual visualisation dashboards make results reproducible.

Verdict

	Rating	Summary
Quality	decent (15/24)	Well-documented and modestly adopted, held back by zero releases, a quiet three-month commit gap, and no CI infrastructure.
PAI Relevance	watch (0.50)	Addresses a real gap (systematic LLM hyperparameter search) but the Python-only runtime and partial overlap with PAI's Evals and Optimize skills keep it at arm's length.

Quality Assessment

15/24 — stale-risk / adequately-documented / solid

Health: 3/8 (stale-risk)

Failed:

H1: FAIL — No tagged releases exist after nearly two years of development.
H2: FAIL — No releases, so recency criterion is moot.
H4: FAIL — Last commit 2026-02-27; ~89 days ago, well beyond the 30-day window.
H6: FAIL — 0 open issues; probe requires >0 as a sign of active triage.
H8: FAIL — README mentions make test but contains no CI badge and no reference to .github/workflows/.

Passed:

H3: PASS — Last commit 2026-02-27 is ~3 months ago, within the 6-month threshold.
H5: PASS — Repository is not archived.
H7: PASS — MIT licence is present.

Documentation: 6/8 (adequately-documented)

Failed:

D7: FAIL — No link to an external docs site, wiki, or /docs directory.
D8: FAIL — No Limitations, Caveats, or Known Issues section; the Disclaimer covers misuse liability only.

Passed:

D1: PASS — README is present and substantive.
D2: PASS — README is several kilobytes with detailed prose, tables, and images.
D3: PASS — Dedicated Installation section with clone, venv, pip, and env-var steps.
D4: PASS — Multiple python cli.py code blocks under a Usage/Examples heading.
D5: PASS — Full CLI Options table listing every flag with descriptions; per-provider parameter lists included.
D6: PASS — First sentence: "HyperTune is an advanced tool for optimizing and analyzing text generation using multiple LLM providers."

Engineering Signals: 6/8 (solid)

Failed:

E1: FAIL — Python is not in the typed-language list.
E2: FAIL — Dependency manifest (requirements.txt / pyproject.toml) not available in the evidence; only a pip install command in the README.

Passed:

E3: PASS — Nine pip dependencies visible (openai, anthropic, google-genai, scikit-learn, nltk, matplotlib, seaborn, pandas, sentence-transformers); < 15 for a CLI tool.
E4: PASS — README explicitly documents make setup and make test targets.
E5: PASS — 102 stars exceeds the 50-star threshold.
E6: PASS — 102 stars over ~22.4 months ≈ 4.6 stars/month, above the 2/month floor.
E7: PASS — 22 forks exceeds the 5-fork threshold.
E8: PASS — Description "Helps you tune LLM hyperparameters" is 37 characters and meaningfully describes the tool.

PAI Relevance

Dimension	Score	Assessment
Harvest Value	1	The multi-dimensional semantic scoring design (coherence + relevance + complexity via sentence-transformer embeddings, with degenerate-output penalties) is worth studying as a reference for improving PAI's `Evals` skill, which currently lacks structured multi-axis LLM output scoring.
Integration Readiness	1	Python-only runtime requires a virtualenv wrapper; not `bun add`-able. However, the clean CLI (`python cli.py ... --output results.json`) and structured JSON export mean a PAI skill could invoke it as a subprocess with moderate adapter glue.
Overlap Risk	1	Partial overlap with PAI's `Evals` skill (AI agent output evaluation) and `Optimize` skill (metric hill-climbing); neither covers systematic API hyperparameter sweep across providers, so the overlap is incomplete.
Gap Fill	1	PAI has no dedicated hyperparameter-search capability for LLM API calls; `Optimize` handles general metric hill-climbing but does not enumerate provider-specific parameter spaces or generate cross-parameter visualisation.

Composite: 0.50

What Next

Any prompt you call repeatedly with default hyperparameters: Run HyperTune against one real, high-stakes prompt (summarisation, classification, or generation) using a narrow sweep — two temperature steps and two top_p steps against a single provider — then inspect the JSON output. This gives you empirical evidence for whether your current defaults are near-optimal or not, with no commitment to the tool as persistent infrastructure.
Evaluating LLM scoring pipeline design: Study the three-dimensional semantic scoring implementation (coherence 40%, relevance 40%, complexity 20% via all-MiniLM-L6-v2) with its degenerate-output penalty logic. The weighting rationale and penalty heuristics are documented enough to inform how you'd structure any bespoke response quality metric — without adopting HyperTune as a dependency.
Bookmark for re-evaluation in 3 months: HyperTune addresses a genuine gap — replacing hyperparameter folklore with reproducible sweeps — but it is a single-developer prototype with no visible CI or release history. Revisit when it has a tagged release, test coverage, and evidence of multi-user validation before treating it as a reliable evaluation component.

Landscape Position

Category: LLM & Prompt Tooling

In this category: mattpocock--evalite (decent, 15/24) is the only prior entry; karpathy--autoresearch (decent, 14/24) in AI Research overlaps on the "model tuning" axis.

Standing: HyperTune matches evalite on overall score (15/24) but targets a more specific problem (API hyperparameter search) rather than general LLM app evaluation; within the category it is effectively co-leading with evalite despite a weaker health profile.

Evidence Base

Density: 7/10 — Full README available (strong); star/fork/commit/creation timestamps present; language and licence confirmed; dependency manifest explicitly listed as not available; no release tags; no CI config visible; no code structure or file tree; no external usage data beyond stars and forks.

Notes

The model list in the README (GPT-5.2, GPT-5-nano, Claude Opus 4.5, Gemini 3 Pro, etc.) suggests the author has actively tracked frontier model releases through early 2026, which is a positive signal for ongoing relevance. The complete absence of releases after nearly two years is the main structural concern — it is unclear whether this reflects a "works for me, ship it" posture or quiet abandonment. The 0 open issues with no associated release lifecycle leans toward the latter. Worth revisiting if a v1.0 tag appears.