-
Notifications
You must be signed in to change notification settings - Fork 879
Add guide for customizing Presidio Docker images #1792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This document provides a comprehensive guide on how to build and customize Presidio Docker images to support additional languages and configurations, including prerequisites, steps for modification, and troubleshooting tips.
|
@SAIRAMSSSS please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
omri374
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is a great start! Left some comments for discussion.
|
|
||
| Navigate to `presidio-analyzer/Dockerfile` and add your desired spaCy language models. | ||
|
|
||
| ### Example: Adding Spanish Support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presidio supports the installation of spacy, stanza and transformers models using the NLP config, so there is no need to explicitly add those to the Dockerfile. Have you given this a try?
| **Problem**: Adding 10+ languages at once can cause the Docker image to run out of memory during build or runtime. | ||
|
|
||
| **Solutions**: | ||
| - Use smaller spaCy models (e.g., `es_core_news_sm` instead of `es_core_news_lg`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a caveat about smaller models likely being less accurate in detecting PII in the text
| docker run -d -p 5002:3000 --memory="4g" presidio-analyzer-custom:latest | ||
| ``` | ||
| - Build images with only the languages you actually need | ||
| - Consider using transformers models which can be more memory-efficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is true. Do you have a concrete example?
| - `md` (medium): ~40MB, balanced | ||
| - `lg` (large): ~500MB+, most accurate but resource-intensive | ||
|
|
||
| **Recommendation**: Start with `md` models for a good balance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our recommendation is to start with the large models
| # Install spaCy language models | ||
| RUN python -m spacy download en_core_web_lg | ||
| RUN python -m spacy download es_core_news_md | ||
| RUN python -m spacy download fr_core_news_md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you download models but not configure the NER model configuration, presidio will ignore those models.
Summary
This PR adds comprehensive documentation for building and customizing Presidio Docker images to support additional languages.
Changes
docs/docker_customization.mdwith detailed instructions on:Addresses Issue
Closes #1663
This documentation fulfills the request for more elaborate instructions on building custom Docker images for Presidio, specifically covering:
Testing
Documentation has been reviewed for accuracy and completeness. All code examples follow Presidio's existing patterns.