NorthwoodsSentinel/pii-scrub

Overview

What it is — A Python CLI tool that detects PII in text via regex and spaCy NER, replaces it with deterministic Faker pseudonyms, and restores originals from a saved session file.
Problem — Sending meeting transcripts, call recordings, or documents to AI services exposes real names, emails, and organizations; this tool scrubs that data before transit and preserves round-trip restore.
Who it's for — Developers and practitioners who need to pre-process documents or transcripts before sending to LLMs or AI APIs, particularly those working with Gong, Otter.ai, or Teams output formats.
Notable — Transcript-aware chunking splits timestamped speaker-turn formats before running NER, solving a known spaCy limitation where long-form input returns zero entities; the Faker seed is deterministic per entity type + text, so pseudonyms are consistent across a session.

Verdict

	Rating	Summary
Quality	weak (10/24)	Competent README and solid documentation, but no releases, no test infrastructure, no manifest, 1 star, and last commit 58 days prior.
PAI Relevance	NOTE (0.75)	Fills a genuine PAI gap — no existing skill covers PII scrubbing before AI sends — but Python-only and too immature to act on now.

Composite (0.75) is the highest NOTE in the vault. The PAI gap is real and specific: PAI routes documents and transcripts to AI agents constantly but has zero PII handling. If this repo matures or a TS equivalent appears, it would be an INTEGRATE candidate immediately.

Quality Assessment

10/24 — stale-risk / adequately-documented / no-signals

Health: 3/8 (stale-risk)

Failed:

H1: FAIL — no tagged release exists
H2: FAIL — no release at all
H4: FAIL — last commit 2026-03-30, 58 days before appraisal date, exceeds 30-day window
H6: FAIL — 0 open issues; probe requires >0 to indicate active triage
H8: FAIL — README contains no CI badge and no reference to .github/workflows/

Passed:

H3: PASS — last commit 2026-03-30 is within 6 months of 2026-05-27
H5: PASS — repo is not archived
H7: PASS — MIT license declared

Documentation: 6/8 (adequately-documented)

Failed:

D5: FAIL — no heading containing "API", "Configuration", "Options", "Reference", "Commands", or "Parameters"; usage section uses narrative subheadings only
D7: FAIL — README links to blog and Substack but not a docs site, wiki, or /docs directory

Passed:

D1: PASS — README is present and substantial
D2: PASS — README clearly exceeds 1000 bytes with multiple sections and code blocks
D3: PASS — explicit "## Install" section with pip and spacy download commands
D4: PASS — multiple code blocks appear under "## Usage" subheadings
D6: PASS — first sentence: "Detect, pseudonymize, and restore personally identifiable information (PII) in text" clearly states purpose
D8: PASS — explicit "## Known limitations" section covering NER model accuracy, first-name collisions, and skip-term requirements

Engineering Signals: 1/8 (no-signals)

Failed:

E1: FAIL — Python is not in the typed-language list (TypeScript, Rust, Go, Java, Kotlin, C#, Swift, Scala, Haskell)
E2: FAIL — dependency manifest section reports "Not available"; no pyproject.toml or requirements.txt confirmed
E3: FAIL — no manifest to evaluate
E4: FAIL — README contains no mention of tests, test script, or CI configuration referencing tests
E5: FAIL — 1 star; needs ≥50
E6: FAIL — ~0.48 stars/month over 2.1-month lifespan; needs ≥2
E7: FAIL — 0 forks; needs ≥5

Passed:

E8: PASS — description is meaningful, specific, and well over 20 characters

PAI Relevance

Dimension	Score	Assessment
Harvest Value	1	The transcript-aware NER chunking pattern (splitting timestamped speaker-turn blocks before entity extraction) is a concrete technique worth studying for any PAI skill that processes long-form documents; the deterministic Faker seed approach for session-stable pseudonyms is tidy but not novel.
Integration Readiness	1	Python-only with a CLI interface (`python3 pii_scrub.py`) that outputs structured JSON session files; could be subprocess-called from a PAI skill with adapter code, but requires Python on the host and is not `bun add`-able.
Overlap Risk	0	No PAI skill, tool, or hook covers PII detection or pseudonymization; the security infrastructure (Silas agent, security hooks) handles access control and audit, not content scrubbing before AI transit.
Gap Fill	2	PAI routes transcripts and documents to AI agents as a core workflow and has no mechanism to scrub PII before those sends; this addresses a clear functional gap in the Capability Manifest.

Composite: 0.75

What Next

Pre-processing Gong, Otter.ai, or Teams transcripts before LLM summarization or analysis: Install via pip install and run pii-scrub scrub <transcript.txt> --session session.json against a real export, then inspect the session JSON to count detected vs. missed entities on your specific format — the transcript-aware chunking pattern genuinely solves spaCy's zero-entity problem on long speaker-turn input, but weak test coverage means false-negative rates on your transcript dialect are unknown until you measure them yourself. You get a go/no-go signal on whether the tool's regex + NER coverage matches your data before committing to it.
Round-trip pseudonymization for document review workflows: Use the CLI's two-step pattern — pii-scrub scrub before sending to an LLM API, pii-scrub restore on the output to re-inject real names — as a shell-script wrapper around any existing LLM call that currently receives raw sensitive documents. The deterministic Faker seed means pseudonyms are stable across a session, so coreference in the model output survives restore. Validate the round-trip on five real documents before relying on it: the restore step only works for entities the scrub step caught, so any missed entity passes through in plain text.

Landscape Position

Category: Security & Privacy

In this category: elder-plinius--ST3GG (decent 14/24, watch) — steganography suite

Standing: First PII-handling entry in the category; ST3GG covers steganography for a different privacy use case, so there is no functional overlap within the category.

Evidence Base

Density: 6/10 — README (full, 8KB), repo metadata (stars, forks, dates, license, language, archive status), landscape context, and prior appraisal score available; dependency manifest not available, no CI config, no source file listing, no release notes, no commit history detail.

Notes

The gap fill score (2) combined with zero overlap makes this the strongest PAI relevance signal of any NOTE-rated repo in the vault. The constraint is entirely on quality: single-developer, no releases, no tests, no manifest, 6 days of commit history. If NorthwoodsSentinel ships a TypeScript/Bun port (consistent with their loam and brook projects) or wraps this in a clean subprocess-friendly binary, it would clear the INTEGRATE threshold without requiring any re-evaluation of PAI fit.