Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports

License

NotificationsYou must be signed in to change notification settings

aphp/eds-pseudo

Repository files navigation

TestsDocumentationCodecovPoetryDVCDemo

EDS-Pseudo

The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily testedon clinical reports at AP-HP's Clinical Data Warehouse (EDS).

The model is built on top ofedsnlp, and consists in ahybrid model (rule-based + deep learning) for which we providerules (eds-pseudo/pipes)and a training recipetrain.py.

We also provide some fictitioustemplates (templates.txt) and a script togenerate a syntheticdatasetgenerate_dataset.py.

The entities that are detected are listed below.

LabelDescription
ADRESSEStreet address, eg33 boulevard de Picpus
DATEAny absolute date other than a birthdate
DATE_NAISSANCEBirthdate
HOPITALHospital name, egHôpital Rothschild
IPPInternal AP-HP identifier for patients, displayed as a number
MAILEmail address
NDAInternal AP-HP identifier for visits, displayed as a number
NOMAny last name (patients, doctors, third parties)
PRENOMAny first name (patients, doctors, etc)
SECUSocial security number
TELAny phone number
VILLEAny city
ZIPAny zip code

Downloading the public pre-trained model

The public pretrained model is available on the HuggingFace model hub atAP-HP/eds-pseudo-public and was trained on synthetic data(seegenerate_dataset.py). You can alsotest it directly on thedemo.

  1. Install the latest version of edsnlp

    pip install"edsnlp[ml]" -U
  2. Get access to the model atAP-HP/eds-pseudo-public

  3. Create and copy a huggingface tokenhttps://huggingface.co/settings/tokens?new_token=true

  4. Register the token (only once) on your machine

    importhuggingface_hubhuggingface_hub.login(token=YOUR_TOKEN,new_session=False,add_to_git_credential=True)
  5. Load the model

    importedsnlpnlp=edsnlp.load("AP-HP/eds-pseudo-public",auto_update=True)doc=nlp("En 2015, M. Charles-François-Bienvenu ""Myriel était évêque de Digne. C’était un vieillard ""d’environ soixante-quinze ans ; il occupait le ""siège de Digne depuis 2006.")forentindoc.ents:print(ent,ent.label_,str(ent._.date))

To apply the model on many documents using one or more GPUs, refer to the documentationofedsnlp.

Installation to reproduce

If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:

git clone https://github.com/aphp/eds-pseudo.gitcd eds-pseudo

And install the dependencies. We recommend pinning the library version in your projects, or use a strict package managerlikePoetry.

poetry install

How to use without machine learning

importedsnlpnlp=edsnlp.blank("eds")# Some text cleaningnlp.add_pipe("eds.normalizer")# Various simple rulesnlp.add_pipe("eds_pseudo.simple_rules",config={"pattern_keys": ["TEL","MAIL","SECU","PERSON"]},)# Address detectionnlp.add_pipe("eds_pseudo.addresses")# Date detectionnlp.add_pipe("eds_pseudo.dates")# Contextual rules (requires a dict of info about the patient)nlp.add_pipe("eds_pseudo.context")# Apply it to a textdoc=nlp("En 2015, M. Charles-François-Bienvenu ""Myriel était évêque de Digne. C’était un vieillard ""d’environ soixante-quinze ans ; il occupait le ""siège de Digne depuis 2006.")forentindoc.ents:print(ent,ent.label_)# 2015 DATE# Charles-François-Bienvenu NOM# Myriel PRENOM# 2006 DATE

How to train

Before training a model, you should update theconfigs/config.cfg andpyproject.toml files tofit your needs.

Put your data in thedata/dataset folder (or edit the pathsconfigs/config.cfg file to pointtodata/gen_dataset/train.jsonl).

Then, run the training script

python scripts/train.py --config configs/config.cfg --seed 43

This will train a model and save it inartifacts/model-last. You can evaluate it on the test set (defaultstodata/dataset/test.jsonl) with:

python scripts/evaluate.py --config configs/config.cfg

To package it, run:

python scripts/package.py

This will create adist/eds-pseudo-aphp-***.whl file that you can install withpip install dist/eds-pseudo-aphp-***.

You can use it in your code:

importedsnlp# Either from the model path directlynlp=edsnlp.load("artifacts/model-last")# Or from the wheel fileimporteds_pseudo_aphpnlp=eds_pseudo_aphp.load()

Documentation

Visit thedocumentation for more information!

Publication

Please find our publication at the following link:https://doi.org/mkfv.

If you use EDS-Pseudo, please cite us as below:

@article{eds_pseudo,  title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse},  author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},  journal={Methods of Information in Medicine},  year={2024},  publisher={Georg Thieme Verlag KG}}

Acknowledgement

We would like to thankAssistance Publique – Hôpitaux de ParisandAP-HP Foundation for funding this project.


[8]ページ先頭

©2009-2025 Movatter.jp