Skip to content

Create a generic dataset that include some PHI / PII - to evaluate the three first modes #1810

@RonShakutai

Description

@RonShakutai

Is your feature request related to a problem? Please describe.
Create a curated benchmark dataset (benchmark_samples.yaml or similar) containing:

15-20 annotated text samples with varying complexity (simple one-liners, medium clinical notes, long discharge summaries)
Diverse PHI/PII entity types: PERSON, EMAIL, PHONE, SSN, dates, medical IDs (MRN, NPI, accession numbers), locations, organizations
Ground truth annotations with expected entity spans and types
Samples specifically designed to test edge cases: inverted names ("dr nakamura kenji"), hyphenated names, inline IDs, HIPAA-specific entities (ages 90+)
A benchmark script that reports precision/recall/F1 per mode with side-by-side comparison

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Sample categories should include:

Simple: single entity detection
Medium: clinical text with multiple entity types
Long/Complex: full medical documents (ultrasound reports, discharge summaries)
LLM-specific: entities only detectable by SLM (e.g., ages over 89 for HIPAA)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions