- Notifications
You must be signed in to change notification settings - Fork0
Trials of pre-trained BERT models for the medical domain in Japanese.
License
ou-medinfo/medbertjp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
They are designed to be adapted to the Japanese medical domain.
The medical corpora were scraped for academic use fromToday's diagnosis and treatment: premium, which consists of 15 digital references for clinicians in Japanese published byIGAKU-SHOIN Ltd..
The general corpora were extracted from a Wikipedia dump file (jawiki-20190901) onhttps://dumps.wikimedia.org/jawiki/.
- medBERTjp - MeCab-IPAdic
- pre-trained model followingMeCab-IPAdic-tokenized Japanese BERT model.
- Japanese tokenizer:MeCab + Byte Pair Encoding (BPE)
- ipadic-py, or manual install of IPAdic is required.
- max_seq_length=128
- medBERTjp - Unidic-2.3.0
- medBERTjp - MeCab-IPAdic-NEologd-JMeDic
- Japanese tokenizer:MeCab + BPE
- install of bothmecab-ipadic-NEologd andJ-MeDic (MANBYO_201907_Dic-utf8.dic) is required.
- max_seq_length=128
- medBERTjp - SentencePiece
(Old:v0.1-sp)- Japanese tokenizer:SentencePiece followingSentencepiece Japanese BERT model
- use customized tokenization for the medical domain by SentencePiece
- max_seq_length=128
For just using the models:
- Transformers (>=2.11.0)
- fugashi, a Cython wrapper forMeCab
- ipadic,unidic-py,mecab-ipadic-NEologd, andJ-MeDic: if required.
- SentencePiece would be automatically installed withTransformers.
Please check code examples oftokenization_example.ipynb
, or try to useexample_google_colab.ipynb
onGoogle Colab.
This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), "Innovative AI Hospital System" (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)).
The pretrained models are distributed under aCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
They are freely available for academic purpose or individual research, but restricted for commecial use.
The codes in this repository are licensed under the Apache License, Version2.0.
About
Trials of pre-trained BERT models for the medical domain in Japanese.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.