- Notifications
You must be signed in to change notification settings - Fork5
EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports
License
aphp/eds-pseudo
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily testedon clinical reports at AP-HP's Clinical Data Warehouse (EDS).
The model is built on top ofedsnlp, and consists in ahybrid model (rule-based + deep learning) for which we providerules (eds-pseudo/pipes
)and a training recipetrain.py
.
We also provide some fictitioustemplates (templates.txt
) and a script togenerate a syntheticdatasetgenerate_dataset.py
.
The entities that are detected are listed below.
Label | Description |
---|---|
ADRESSE | Street address, eg33 boulevard de Picpus |
DATE | Any absolute date other than a birthdate |
DATE_NAISSANCE | Birthdate |
HOPITAL | Hospital name, egHôpital Rothschild |
IPP | Internal AP-HP identifier for patients, displayed as a number |
MAIL | Email address |
NDA | Internal AP-HP identifier for visits, displayed as a number |
NOM | Any last name (patients, doctors, third parties) |
PRENOM | Any first name (patients, doctors, etc) |
SECU | Social security number |
TEL | Any phone number |
VILLE | Any city |
ZIP | Any zip code |
The public pretrained model is available on the HuggingFace model hub atAP-HP/eds-pseudo-public and was trained on synthetic data(seegenerate_dataset.py
). You can alsotest it directly on thedemo.
Install the latest version of edsnlp
pip install"edsnlp[ml]" -U
Get access to the model atAP-HP/eds-pseudo-public
Create and copy a huggingface tokenhttps://huggingface.co/settings/tokens?new_token=true
Register the token (only once) on your machine
importhuggingface_hubhuggingface_hub.login(token=YOUR_TOKEN,new_session=False,add_to_git_credential=True)
Load the model
importedsnlpnlp=edsnlp.load("AP-HP/eds-pseudo-public",auto_update=True)doc=nlp("En 2015, M. Charles-François-Bienvenu ""Myriel était évêque de Digne. C’était un vieillard ""d’environ soixante-quinze ans ; il occupait le ""siège de Digne depuis 2006.")forentindoc.ents:print(ent,ent.label_,str(ent._.date))
To apply the model on many documents using one or more GPUs, refer to the documentationofedsnlp.
If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:
git clone https://github.com/aphp/eds-pseudo.gitcd eds-pseudo
And install the dependencies. We recommend pinning the library version in your projects, or use a strict package managerlikePoetry.
poetry install
importedsnlpnlp=edsnlp.blank("eds")# Some text cleaningnlp.add_pipe("eds.normalizer")# Various simple rulesnlp.add_pipe("eds_pseudo.simple_rules",config={"pattern_keys": ["TEL","MAIL","SECU","PERSON"]},)# Address detectionnlp.add_pipe("eds_pseudo.addresses")# Date detectionnlp.add_pipe("eds_pseudo.dates")# Contextual rules (requires a dict of info about the patient)nlp.add_pipe("eds_pseudo.context")# Apply it to a textdoc=nlp("En 2015, M. Charles-François-Bienvenu ""Myriel était évêque de Digne. C’était un vieillard ""d’environ soixante-quinze ans ; il occupait le ""siège de Digne depuis 2006.")forentindoc.ents:print(ent,ent.label_)# 2015 DATE# Charles-François-Bienvenu NOM# Myriel PRENOM# 2006 DATE
Before training a model, you should update theconfigs/config.cfg andpyproject.toml files tofit your needs.
Put your data in thedata/dataset
folder (or edit the pathsconfigs/config.cfg
file to pointtodata/gen_dataset/train.jsonl
).
Then, run the training script
python scripts/train.py --config configs/config.cfg --seed 43
This will train a model and save it inartifacts/model-last
. You can evaluate it on the test set (defaultstodata/dataset/test.jsonl
) with:
python scripts/evaluate.py --config configs/config.cfg
To package it, run:
python scripts/package.py
This will create adist/eds-pseudo-aphp-***.whl
file that you can install withpip install dist/eds-pseudo-aphp-***
.
You can use it in your code:
importedsnlp# Either from the model path directlynlp=edsnlp.load("artifacts/model-last")# Or from the wheel fileimporteds_pseudo_aphpnlp=eds_pseudo_aphp.load()
Visit thedocumentation for more information!
Please find our publication at the following link:https://doi.org/mkfv.
If you use EDS-Pseudo, please cite us as below:
@article{eds_pseudo, title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse}, author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain}, journal={Methods of Information in Medicine}, year={2024}, publisher={Georg Thieme Verlag KG}}
We would like to thankAssistance Publique – Hôpitaux de ParisandAP-HP Foundation for funding this project.
About
EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports