-
Notifications
You must be signed in to change notification settings - Fork 879
Description
Is your feature request related to a problem? Please describe.
Users often struggle to get started with customizing Microsoft Presidio for their specific data types and use cases. This makes it difficult to evaluate and adapt the tool for production scenarios. For example, someone working with chat logs containing financial data or clinical notes may not know the best way to configure Presidio for optimal performance. This lack of guidance creates friction and slows down adoption. Furthermore, it would demonstrate the desired use for presidio, which is to customize it and not use in its vanilla form.
Describe the solution you'd like
We’d like to provide a curated list of “recipes” tailored to common data privacy and de-identification scenarios (e.g., chat conversations with financial data, clinical notes, REST API logs in JSON format). Each recipe would be an end-to-end, reproducible example and would include:
-
A method for generating synthetic data that mimics the real-world scenario.
-
Evaluation metrics (e.g., precision, recall, F₂-score, and latency) for different Presidio configurations—such as out-of-the-box, with custom recognizers, or with best-effort tuning.
Illustrative example:
Presidio's performance on common scenarios
This table benchmarks Microsoft Presidio’s de-identification accuracy and performance across representative data domains and implementation levels—from out-of-the-box use to custom pipelines with transformers or LLMs. Each cell includes precision, recall, F₂ score, and average latency per sample.
Clicking on Notebook takes you to a reproducible Jupyter notebook containing the exact configuration, dataset (if available), evaluation logic, and performance metrics used for that experiment.
| Domain / Scenario | 1. Out-of-the-box (spaCy) |
2. Augmented (+ custom recognizers) |
3. Custom Model (own ML/Transformer) |
4. Hybrid “Best-Effort” (ensemble/LLM+gliNER) |
|---|---|---|---|---|
| Financial (Chatbot) | • P 0.78 R 0.65 F₂ 0.71 • Latency 12 ms/sample • Notebook |
• P 0.85 R 0.78 F₂ 0.81 • Latency 18 ms/sample • Notebook |
• P 0.92 R 0.88 F₂ 0.90 • Latency 45 ms/sample • Notebook |
• P 0.95 R 0.93 F₂ 0.94 • Latency 150 ms/sample • Notebook |
| Medical (Clinical Notes) | • P 0.70 R 0.60 F₂ 0.65 • Latency 15 ms/sample • Notebook |
• P 0.80 R 0.75 F₂ 0.77 • Latency 22 ms/sample • Notebook |
• P 0.88 R 0.82 F₂ 0.85 • Latency 50 ms/sample • Notebook |
• P 0.93 R 0.90 F₂ 0.91 • Latency 160 ms/sample • Notebook |
| Retail (JSON REST) | • P 0.82 R 0.70 F₂ 0.75 • Latency 10 ms/sample • Notebook |
• P 0.88 R 0.80 F₂ 0.84 • Latency 16 ms/sample • Notebook |
• P 0.93 R 0.87 F₂ 0.90 • Latency 40 ms/sample • Notebook |
• P 0.96 R 0.94 F₂ 0.95 • Latency 140 ms/sample • Notebook |
| Multilingual (e.g., Spanish) | • P 0.65 R 0.55 F₂ 0.60 • Latency 20 ms/sample • Notebook |
• P 0.75 R 0.70 F₂ 0.72 • Latency 28 ms/sample • Notebook |
• P 0.85 R 0.80 F₂ 0.82 • Latency 55 ms/sample • Notebook |
• P 0.92 R 0.88 F₂ 0.90 • Latency 170 ms/sample • Notebook |
Legend:
- P = Precision
- R = Recall
- F = F₂ score (weighted recall-heavy F-score)
- Latency = Average processing time per sample (milliseconds per record)
This would give users a concrete starting point for customization and performance benchmarking.
Describe alternatives you've considered
- Expanding the current documentation and samples with more general usage guidance, but this lacks the contextual depth and reproducibility of recipe-based examples.
- Providing pre-trained models or templates, but they may not align closely with users' specific domains without example-driven guidance.
Additional context
The goal is to help users bridge the gap between generic documentation and production-ready deployment. These recipes would serve as educational tools and performance baselines for different domains. Ideally, they would live in a dedicated section of the Presidio GitHub repo or documentation site, and we could encourage contributions from the community over time.