- Notifications
You must be signed in to change notification settings - Fork6
EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports
License
aphp/eds-pseudo
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily testedon clinical reports at AP-HP's Clinical Data Warehouse (EDS).
The model is built on top ofedsnlp, and consists in ahybrid model (rule-based + deep learning) for which we providerules (eds-pseudo/pipes
)and a training recipetrain.py
.
We also provide some fictitioustemplates (templates.txt
) and a script togenerate a syntheticdatasetgenerate_dataset.py
.
The entities that are detected are listed below.
Label | Description |
---|---|
ADRESSE | Street address, eg33 boulevard de Picpus |
DATE | Any absolute date other than a birthdate |
DATE_NAISSANCE | Birthdate |
HOPITAL | Hospital name, egHôpital Rothschild |
IPP | Internal AP-HP identifier for patients, displayed as a number |
MAIL | Email address |
NDA | Internal AP-HP identifier for visits, displayed as a number |
NOM | Any last name (patients, doctors, third parties) |
PRENOM | Any first name (patients, doctors, etc) |
SECU | Social security number |
TEL | Any phone number |
VILLE | Any city |
ZIP | Any zip code |
The public pretrained model is available on the HuggingFace model hub atAP-HP/eds-pseudo-public and was trained on synthetic data(seegenerate_dataset.py
). You can alsotest it directly on thedemo.
Install the latest version of edsnlp
pip install"edsnlp[ml]" -U
Get access to the model atAP-HP/eds-pseudo-public
Create and copy a huggingface tokenhttps://huggingface.co/settings/tokens?new_token=true
Register the token (only once) on your machine
importhuggingface_hubhuggingface_hub.login(token=YOUR_TOKEN,new_session=False,add_to_git_credential=True)
Load the model
importedsnlpnlp=edsnlp.load("AP-HP/eds-pseudo-public",auto_update=True)doc=nlp("En 2015, M. Charles-François-Bienvenu ""Myriel était évêque de Digne. C’était un vieillard ""d’environ soixante-quinze ans ; il occupait le ""siège de Digne depuis 2006.")forentindoc.ents:print(ent,ent.label_,str(ent._.date))
To apply the model on many documents using one or more GPUs, refer to the documentationofedsnlp.
If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:
git clone https://github.com/aphp/eds-pseudo.gitcd eds-pseudo
And install the dependencies. We recommend pinning the library version in your projects, or use a strict package managerlikePoetry.
poetry install
importedsnlpnlp=edsnlp.blank("eds")# Some text cleaningnlp.add_pipe("eds.normalizer")# Various simple rulesnlp.add_pipe("eds_pseudo.simple_rules",config={"pattern_keys": ["TEL","MAIL","SECU","PERSON"]},)# Address detectionnlp.add_pipe("eds_pseudo.addresses")# Date detectionnlp.add_pipe("eds_pseudo.dates")# Contextual rules (requires a dict of info about the patient)nlp.add_pipe("eds_pseudo.context")# Apply it to a textdoc=nlp("En 2015, M. Charles-François-Bienvenu ""Myriel était évêque de Digne. C’était un vieillard ""d’environ soixante-quinze ans ; il occupait le ""siège de Digne depuis 2006.")forentindoc.ents:print(ent,ent.label_)# 2015 DATE# Charles-François-Bienvenu NOM# Myriel PRENOM# 2006 DATE
Before training a model, you should update theconfigs/config.cfg andpyproject.toml files tofit your needs.
Put your data in thedata/dataset
folder (or edit the pathsconfigs/config.cfg
file to pointtodata/gen_dataset/train.jsonl
).
Then, run the training script
python scripts/train.py --config configs/config.cfg --seed 43
This will train a model and save it inartifacts/model-last
. You can evaluate it on the test set (defaultstodata/dataset/test.jsonl
) with:
python scripts/evaluate.py --config configs/config.cfg
To package it, run:
python scripts/package.py
This will create adist/eds-pseudo-aphp-***.whl
file that you can install withpip install dist/eds-pseudo-aphp-***
.
You can use it in your code:
importedsnlp# Either from the model path directlynlp=edsnlp.load("artifacts/model-last")# Or from the wheel fileimporteds_pseudo_aphpnlp=eds_pseudo_aphp.load()
Visit thedocumentation for more information!
Please find our publication at the following link:https://doi.org/mkfv.
If you use EDS-Pseudo, please cite us as below:
@article{eds_pseudo, title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse}, author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain}, journal={Methods of Information in Medicine}, year={2024}, publisher={Georg Thieme Verlag KG}}
We would like to thankAssistance Publique – Hôpitaux de ParisandAP-HP Foundation for funding this project.
About
EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.