- Notifications
You must be signed in to change notification settings - Fork3
racerandom/JaMIE
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
In the field of Japanese medical information extraction, few analyzing tools are available and relation extraction is still an under-explored topic. In this paper, we first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We design a system with three components for jointly recognizing medical entities, classifying entity modalities, and extracting relations.
git clonehttps://github.com/racerandom/JaMIE.git
cd JaMIE \
pip install -r requirements.txt
mecab (juman-dict) by default
jumanpp
NICT-BERT (NICT_BERT-base_JapaneseWikipedia_32K_BPE)
The Train/Test phrases require all train, dev, test file converted to CONLL-style before Train/Test.You also need to convert raw text to CONLL-style for prediction, but please make sure the file extension is .xml.
python data_converter.py \
--mode xml2conll \
--xml $XML_FILES_DIR \
--conll $OUTPUT_CONLL_DIR \
--cv_num 0 \ # 0 presents to generate single conll file, 5 presents 5-fold cross-validation
--doc_level \ # generate document-level ([SEP] denotes sentence boundaries) or sentence-level conll files
--segmenter mecab \ # please use mecab and NICT bert currently
--bert_dir $PRETRAINED_BERT # Pre-trained BERT or Trained model
CUDA_VISIBLE_DEVICES=$GPU_ID python clinical_joint.py \
--pretrained_model $PRETRAINED_BERT \ # downloaded pre-trained NICT BERT
--train_file $TRAIN_FILE \
--dev_file $DEV_FILE \
--dev_output $DEV_OUT \
--saved_model $MODEL_DIR_TO_SAVE \ # the place to save the model
--enc_lr 2e-5 \
--batch_size 4 \ # depends on your GPU memory
--warmup_epoch 2 \
--num_epoch 20 \
--do_train \
--fp16 (apex required)
We share the models trained on radiography interpretation reports of Lung Cancer (LC) and general medical reports of Idiopathic Pulmonary Fibrosis (IPF):
- The trained model of radiography interpretation reports of Lung Cancer (肺がん読影所見)
- The trained model of case reports of Idiopathic Pulmonary Fibrosis (IPF診療録)
You can either train a new model on your own training data or use our shared model for test.
CUDA_VISIBLE_DEVICES=$GPU_ID python clinical_joint.py \
--saved_model $SAVED_MODEL \ # Where the trained model placed
--test_file $TEST_FILE \
--test_output $TEST_OUT \
--batch_size 4
python data_converter.py \
--mode conll2xml \
--xml $XML_OUT_DIR \
--conll $TEST_OUT
We offer the links of both English and Japanese annotation guidelines.
Recognition accuracy can be improved by leverage more training data or more robust pre-trained models. We are working on making the code compatible withJapanese DeBERTa
If you have any questions related to the code or papers, please feel free to send a mail to Fei Cheng:feicheng@i.kyoto-u.ac.jp orkevin.cheng.fei@gmail.com
If you use our code in your research, please cite the following papers:
@inproceedings{cheng-etal-2022-jamie,title={JaMIE: A Pipeline Japanese Medical Information Extraction System with Novel Relation Annotation},author={Fei Cheng, Shuntaro Yada, Ribeka Tanaka, Eiji Aramaki, Sadao Kurohashi},booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022)},year={2022}}@inproceedings{cheng2021jamie,title={JaMIE: A Pipeline Japanese Medical Information Extraction System},author={Fei Cheng, Shuntaro Yada, Ribeka Tanaka, Eiji Aramaki, Sadao Kurohashi},booktitle={arXiv},year={2021}}@inproceedings{yada-etal-2020-towards,title={Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases},author={Shuntaro Yada, Ayami Joh, Ribeka Tanaka, Fei Cheng, Eiji Aramaki, Sadao Kurohashi},booktitle={Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC 2020)},year={2020}}@inproceedings{cheng-etal-2020-dynamically,title={Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning},author={Fei Cheng, Masayuki Asahara, Ichiro Kobayashi, Sadao Kurohashi},booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Findings Volume},year={2020}}
About
A Japanese Medical Information Extraction Toolkit
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.