cactus-compute/needle — Repo Appraisal

Overview

What it is — Needle is a 26-million-parameter encoder-decoder language model, distilled from Gemini 3.1 using a novel "Simple Attention Network" architecture, purpose-built for single-shot tool/function calling on severely resource-constrained hardware.
Problem — Running capable AI tool-dispatch locally on consumer edge devices (phones, watches, glasses) without cloud inference latency or cost.
Who it's for — Developers building on-device personal AI applications that need structured function-call output with sub-second latency and a finetunable base, particularly within the Cactus mobile runtime ecosystem.
Notable — At 26M parameters it outperforms models 10–13× its size (FunctionGemma-270m, Qwen-0.6B, Granite-350m, LFM2.5-350m) on single-shot function calling, achieves 6000 tok/s prefill on-device, and ships a one-command playground (needle playground) that generates synthetic training data via Gemini, finetunes, and evaluates — all locally.

Verdict

	Rating	Summary
Quality	solid (16/24)	Excellent documentation and explosive early adoption undercut by no releases, no manifest, and no test infrastructure — a promising research artifact not yet a hardened library.
PAI Relevance	integrate (0.50)	Fills PAI's offline inference gap for tool routing; integrable via subprocess CLI with clean JSON output, though Python setup friction and mobile-first framing limit practical lift.

Quality Assessment

16/24 — maintained / well-documented / early-or-minimal

Health: 5/8 (maintained)

Failed:

H1: FAIL — Latest Release listed as "none (none)"; no tagged release exists.
H2: FAIL — Derived from H1; no release to date.
H8: FAIL — README contains no CI badge, no reference to .github/workflows/, no CI pipeline mentioned.

Passed:

H3: PASS — Last commit 2026-05-16, eleven days before appraisal date; well within 6 months.
H4: PASS — Same commit qualifies for the 30-day window.
H5: PASS — archived: false confirmed.
H6: PASS — 7 open issues; above zero (triaged) and well below 100 (not neglected).
H7: PASS — MIT license declared.

Documentation: 7/8 (well-documented)

Failed:

D8: FAIL — Caveats about model scope ("small models can be finicky," "Those models have more scope/capacity") appear inline in prose; no formal Limitations, Caveats, Known Issues, or Trade-offs section heading exists.

Passed:

D1: PASS — README is present and substantive.
D2: PASS — README is several KB; far exceeds 1000-byte threshold.
D3: PASS — Explicit git clone + source ./setup quickstart in second section.
D4: PASS — Python usage code block under "Usage (Python)" heading with full generate() call.
D5: PASS — "CLI" section enumerates all commands with parameters; "Data format" section defines schemas.
D6: PASS — First sentence names the architecture, parameter count, and lineage (Gemini 3.1 distillation).
D7: PASS — Links to docs/simple_attention_networks.md and HuggingFace weights page.

Engineering Signals: 4/8 (early-or-minimal)

Failed:

E1: FAIL — Primary language is Python; not in the typed-language list.
E2: FAIL — Dependency manifest explicitly noted as "Not available" in repo data.
E3: FAIL — No manifest available; dependency count cannot be verified.
E4: FAIL — README references needle eval (model accuracy evaluation) but no software test suite, test script, or CI test runner is mentioned.

Passed:

E5: PASS — 2502 stars; far exceeds the 50-star threshold.
E6: PASS — Created 2026-02-24; ~3.1 months of history yields ~807 stars/month, well above the 2/month floor.
E7: PASS — 165 forks; exceeds the 5-fork threshold.
E8: PASS — Description "26m function call model that runs on incredibly small devices" is specific, meaningful, and well above 20 characters.

PAI Relevance

Dimension	Score	Assessment
Harvest Value	1	The encoder-decoder SAN architecture (no FFN in encoder, cross-attention routing, gated residuals, tied embeddings) is a novel approach to minimal function dispatch worth studying as a design reference for PAI's Delegation skill, which currently routes to heavyweight cloud agents with no lightweight local fallback. Not directly portable to TypeScript but the structural ideas are extractable.
Integration Readiness	1	Python-only, but the `needle run --query "..." --tools '[...]'` CLI emits clean structured JSON and is trivially subprocess-callable from a PAI skill. Requires Python environment bootstrap alongside Bun, which is moderate adapter work but not a rewrite.
Overlap Risk	1	PAI's 27-agent roster (Claude-family, Forge/GPT-5.4, Anvil/Kimi-K2.6) already handles tool-calling via the Agents and Delegation skills. Needle overlaps on function dispatch but is differentiated by offline/local execution with no API dependency — partial rather than full overlap.
Gap Fill	1	PAI has no on-device or offline inference capability; all agents are cloud-dependent. A locally-hosted sub-30M routing model that selects tools before hitting a heavier cloud agent could reduce latency and cloud token spend for high-frequency dispatch decisions — a functional area with limited current coverage.

Composite: 0.50

What Next

Fabric Recommender (fab) pattern-dispatch step: The fab CLI's core operation — mapping a content snippet + intent to a Fabric pattern — is a single-shot function-calling problem with a bounded output set, exactly Needle's target task. Run pip install cactus-needle and needle playground, define the Fabric pattern catalog as the tool schema (one entry per pattern with name and description), let the playground generate synthetic (content, intent) → pattern-call training pairs via Gemini, finetune locally, then replace the current LLM inference hop with a Needle call. Pattern dispatch drops to ~50ms and stops consuming API credits for a task that doesn't require a frontier model.
Capture-to-Knowledge Pipeline routing classifier: The pipeline's triage step — deciding which knowledge node a capture belongs to — is a structured dispatch problem with a known, finite output space (the pipeline's capture taxonomy). Use needle playground to generate (raw capture, context) → destination-function training pairs from the existing taxonomy, finetune, and insert Needle as the first-pass classifier before the Haiku validation step. High-confidence captures route instantly with no API call; ambiguous ones still escalate to Haiku, reducing validation load to genuine edge cases and shrinking per-capture cost.
PAI local intent router for repetitive skill dispatch: Any PAI interaction that routes a natural-language command to a known skill currently makes a full cloud inference call even for commands that are structurally identical across invocations. Map PAI's skill registry to a Needle tool schema, run needle playground against a sample of logged skill-dispatch interactions to generate training pairs, finetune, and deploy Needle as a local first-pass router. High-confidence, high-frequency intents resolve in under 100ms with no network hop; the cloud model only handles novel or low-confidence requests, shrinking both latency and token spend for the most common interaction patterns.

Landscape Position

Category: LLM & Prompt Tooling

In this category: mattpocock--evalite (decent, 15/24, skip)

Standing: First model-as-artifact entry in this category; evalite addresses evaluation tooling around existing LLMs while Needle is the LLM itself plus its finetuning/deployment toolchain — functionally non-overlapping within the category.

Evidence Base

Density: 8/10 — README (full 8KB, high signal), repo metadata (stars, forks, issues, dates, license, archived status, topics all present), landscape context and prior appraisals available. Missing: dependency manifest (explicitly absent), CI configuration, test infrastructure details, release history, and actual model benchmark numbers beyond the qualitative comparisons in the README.

Notes

The repository is three months old with no formal release yet commands 2502 stars — adoption is primarily driven by the Cactus ecosystem and the novelty of beating models 10× its size on function calling. The SAN architecture (encoder over tools + cross-attention into decoder) is the genuine research contribution; the finetuning playground is unusually polished for a research prototype at this age. The cited citation block in the README with eight authors signals this is a team effort with a publication trajectory, which improves the chance of continued maintenance. The lack of a dependency manifest and test suite are the most significant engineering red flags at this stage.