-
Notifications
You must be signed in to change notification settings - Fork 879
Description
Is your feature request related to a problem? Please describe.
Create a curated benchmark dataset (benchmark_samples.yaml or similar) containing:
15-20 annotated text samples with varying complexity (simple one-liners, medium clinical notes, long discharge summaries)
Diverse PHI/PII entity types: PERSON, EMAIL, PHONE, SSN, dates, medical IDs (MRN, NPI, accession numbers), locations, organizations
Ground truth annotations with expected entity spans and types
Samples specifically designed to test edge cases: inverted names ("dr nakamura kenji"), hyphenated names, inline IDs, HIPAA-specific entities (ages 90+)
A benchmark script that reports precision/recall/F1 per mode with side-by-side comparison
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Sample categories should include:
Simple: single entity detection
Medium: clinical text with multiple entity types
Long/Complex: full medical documents (ultrasound reports, discharge summaries)
LLM-specific: entities only detectable by SLM (e.g., ages over 89 for HIPAA)