Product
A calibrated prevalence bound you can defend.
Not "we find your PII" (every scanner does that, cheaper). The artifact is a misclassification-corrected corpus-prevalence estimate with a confidence interval, stratified and drift-corrected. A calibrated, documented bound, built to align with EU regulator guidance on identifiability (Art. 29 WP216) and AI-audit accuracy (ICO), a bound a DPO can stand behind once validated on your corpus. Zero of thirteen incumbent scanners emit this.
Packaging
Three tiers. The first two are the flagship engagement.
Prevalence Diagnostic
Sampled, stratified, misclassification-corrected corpus-prevalence audit with a CI, the GDPR 3-tier taxonomy (direct / quasi-indirect / Art. 9), a document-level singling-out verdict, and local/BYOK execution. A DPA-defensible report.
EngageDrift-Monitoring Retainer
The drift sentinel as a service. We re-run the sample on cadence, detect when the live judge-score distribution diverges from calibration, re-calibrate, and re-issue the bound. Your audit number stays valid as the corpus shifts.
Engageaudit-that-ships
Per-commit fairness artifact: a tamper-evident, commit-SHA-keyed, framework-mapped pass/fail record fired on push. Sells the artifact and the cadence, not the math.
Buy (Stripe — placeholder)buy.stripe.com/test_PLACEHOLDER_REPLACE_ME with the real link.
Do NOT wire a live Stripe backend here.
Why this, not Google SDP
A raw scan count is uncorrected. The corrected bound is the statement you can defend.
Google Cloud Sensitive Data Protection profiles at $0.03/GB and returns a per-finding
likelihood. At petabyte scale that is billions of individual findings you cannot adjudicate,
and a DPO cannot stand behind count(findings). We sample, divide out the
detector's own false-positive and false-negative rate, and bound the corpus with a CI. The
human-adjudication cost is roughly fixed in corpus size. That is the edge, and it only exists
at scale.
Scope a diagnostic
Tell us the corpus and the regulatory exposure. We come back with a sampling plan, a gold-label budget, and a fixed price.
All demo data is synthetic. We never ship real PII to detect PII.
Validated so far on synthetic and Presidio-style corpora on dev splits, not yet on a live customer corpus. The first engagement is that validation, and we sell it as such. The interval bounds sampling error under the stated method; it does not bound disagreement over what counts as personal data under the GDPR tiers, which is settled with your annotators in the diagnostic.