NorthwoodsSentinel/pii-scrub

Detect, pseudonymize, and restore PII in text. Python port of jcfischer/pii-pseudonymizer with transcript-aware chunking.

Python1 starsPII Scrubbing and PseudonymizationGitHub
Quality: note 11/24
PAI: note 0.63

Verdict

NOTE — Catalog as a harvestable preprocessing utility for scrubbing PII before LLM submission; revisit if the author adds packaging, tests, or community uptake pushes it toward WATCH.

Borderline NOTE/WATCH — standalone score of 11 sits one point below the WATCH threshold of 12. A dependency manifest or CI setup would close that gap.

Standalone Assessment

11/24 — stale-risk / adequately-documented / no-signals

Health: 3/8 (stale-risk)

Failed: H1: FAIL — no tagged release exists (latest_release: none) H2: FAIL — no release at all, cannot satisfy recency condition H4: FAIL — last commit 2026-03-30, today 2026-05-17 is ~48 days ago, exceeds 30-day window H6: FAIL — 0 open issues, probe requires >0 as a sign of active triage H8: FAIL — README contains no CI badge, GitHub Actions reference, or .github/workflows mention

Passed: H3: PASS — last commit 2026-03-30 is ~48 days ago, well within 6-month window H5: PASS — archived: false H7: PASS — MIT license present

Documentation: 6/8 (adequately-documented)

Failed: D5: FAIL — no heading matching "API", "Configuration", "Options", "Reference", "Commands", or "Parameters"; "Filtering false positives" and "Session files" are reference-style but headings don't match the probe keywords D7: FAIL — README links to author blog and Substack but no dedicated docs site, wiki, or /docs directory for the tool itself

Passed: D1: PASS — README content is present and substantial D2: PASS — README is clearly several KB with multiple sections and code blocks D3: PASS — "pip install spacy faker" and "python3 -m spacy download en_core_web_sm" installation block present D4: PASS — ## Usage section contains five distinct CLI code blocks covering pseudo, transcript, restore, detect, and pattern-only modes D6: PASS — first sentence "Detect, pseudonymize, and restore personally identifiable information (PII) in text. Built for processing meeting transcripts..." clearly describes purpose within first 200 chars D8: PASS — "## Known limitations" section explicitly lists four caveats including model accuracy, first-name collision, false-positive terms, and Faker seed behavior

Engineering Signals: 2/8 (no-signals)

Failed: E1: FAIL — primary language is Python, not in typed-language list (TypeScript, Rust, Go, Java, Kotlin, C#, Swift, Scala, Haskell) E2: FAIL — dependency manifest listed as "Not available"; no pyproject.toml, requirements.txt, or equivalent found E4: FAIL — README contains no mention of testing, pytest, or CI test step E5: FAIL — 1 star, requires ≥50 E6: FAIL — ~0.55 stars/month (1 star over ~1.8 months since 2026-03-24), requires ≥2 E7: FAIL — 0 forks, requires ≥5

Passed: E3: PASS — README explicitly states "No other dependencies. Python 3.10+" with only spacy and faker as runtime deps, well under the <15 CLI threshold E8: PASS — description "Detect, pseudonymize, and restore PII in text. Python port of jcfischer/pii-pseudonymizer with transcript-aware chunking." is meaningful, >20 characters, and not a restatement of the repo name

PAI Fit

Dimension Score Assessment
Harvest Value 1 Transcript-aware chunking to work around NER's limitation on long-form input is a concrete pattern worth studying; deterministic Faker seeding (by entity type + original text) is also interesting for reproducible pseudonymization pipelines. Neither rises to novel architecture but both are immediately applicable design patterns.
Integration Readiness 1 CLI is clean (python3 pii_scrub.py pseudo/restore/detect) and the module is importable, but the repo ships no pip-installable package — consuming it requires cloning and managing the script directly, adding moderate friction beyond the two-dep install.
Overlap Risk 0 No existing vault repo covers PII detection, pseudonymization, or round-trip restore; crowding index is 0 and the landscape summary lists no related category. This is a clear functional gap in the vault.
Gap Fill 1 PII scrubbing is not a declared gap in the landscape Gaps section, but the vault contains several personal-AI memory and AI preprocessing repos (loam, mempalace) where sanitizing sensitive text before LLM submission is a natural complementary step; the need is real even if not explicitly flagged.

Composite: 0.63

Competitive Positioning

Category: PII Scrubbing and Pseudonymization Crowding: 0 repos in vault (first-in-category) Alternatives: first in this category vs. top alternative: no prior vault entry exists for comparison; closest contextual neighbor is NorthwoodsSentinel--loam (personal AI memory substrate) which handles data storage but has no sanitization layer Landscape impact: filling a gap — no PII scrubbing tool exists in the vault, and the broader PAI ecosystem lacks a preprocessing step for removing sensitive data before LLM submission

Evidence Base

Density: 6/10 — Available: README (detailed, multi-section), description, language, license, stars, forks. Missing: topics (none set), dependency manifest (not available), release metadata (no releases), CI/test configuration signals.

Notes

The repo is a Python port of a TypeScript/Bun original (jcfischer/pii-pseudonymizer), which means the core design is borrowed rather than invented here. The genuine addition is the transcript-aware chunking for speaker-turn formats — a pragmatic fix for a real NER limitation. The NER_SKIP_TERMS extensibility point is a clean pattern for domain-specific false-positive suppression. Session files containing the full translation table are a sensible round-trip mechanism, though the security warning to delete them is appropriately surfaced. The author (NorthwoodsSentinel) has two other vault entries (loam, brook), suggesting a coherent personal toolkit in progress rather than a throwaway script.