Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese - ACL SRW 2023

License

NotificationsYou must be signed in to change notification settings

hitachi-nlp/compare-ja-tokenizer

Repository files navigation

This is the official implementation of the paper titled: "How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese". To reproduce our results, please follow the following instructions.

1. Requirements

  • Python >= 3.9
  • PyTorch 1.8.1
  • Transformers 4.24.0.dev0

2. Installation

2.1 PyTorch

pip install torch==1.8.1+cu101 torchvision==0.9.1+cu101 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

2.2 Transformers

git clone https://github.com/huggingface/transformers.gitcd transformerspip install -e.cd ..

2.3 Other Python packages

pip install -r requirements.txt

2.4 Japanese Morphological Analyzers

Here, we install required packages under${HOME}/usr, but you can choose your preferred location by modifying--prefix.

2.4.1 MeCab

  • Model

    git clone https://github.com/taku910/mecab.gitcd mecab/mecab./configure --prefix=${HOME}/usr --with-charset=UTF8makemake installcd ../..
  • Dictionary

    wget"https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM" -O mecab-ipadic-2.70-20070801.tar.gztar xvzf mecab-ipadic-2.7.0-20070801.tar.gzcd mecab-ipadic-2.7.0-20070801./configure --with-mecab-config=$HOME/usr/bin/mecab-config --with-charset=UTF8 --prefix=$HOME/usrmakemake installcd ..

2.4.2 Juman++

wget"https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz"tar xvJf jumanpp-2.0.0-rc3.tar.xzcd jumanpp-2.0.0-rc3mkdir build&&cd buildcurl -LO https://github.com/catchorg/Catch2/releases/download/v2.13.8/catch.hppmv catch.hpp ../libs/cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$HOME/usrmakemake installecho'export PATH=$PATH:$HOME/usr'>>~/.bashrcecho'export PATH=$PATH:$HOME/usr/bin'>>~/.bashrccd ..

2.4.3 Sudachi

pip install sudachipypip install sudachidict_core

2.4.4 Vaporetto

Seehttps://github.com/daac-tools/vaporetto for more details.

cd data/dictwget https://github.com/daac-tools/vaporetto/releases/download/v0.5.0/bccwj-suw+unidic+tag.tar.xztar xf ./bccwj-suw+unidic+tag.tar.xzcd ../..

3. Preprocessing data for tokenizer training

Please seepreprocessing_for_tokenizers.

4. Training tokenizers

Please seetokenizer.

5. Preprocessing data for pretraining

Please seepreprocessing_for_pretraining.

6. Pretraining

Please seepretraining.

7. Fine-tuning

7.1 JGLUE

First, please clone the JGLUE repository and download the JGLUE dataset under./data, followinghttps://github.com/yahoojapan/JGLUE.

7.1.1 MARC-ja

Please seemarc-ja.

7.1.2 JSTS

Please seejsts.

7.1.3 JNLI

Please seejnli.

7.1.4 JSQuAD

Please seejsquad.

7.1.5 JCommonsenseQA

Please seejcommonsenseqa.

7.2 NER

Please seener.

7.3 UD

Please seedependency_parsing.

Pretrained Weights

The pretrained weights are available on the Hugging Face Hub.

BPEUnigramWordPiece
MeCabbert-base-japanese_mecab-bpebert-base-japanese_mecab-unigrambert-base-japanese_mecab-wordpiece
Juman++bert-base-japanese_jumanpp-bpebert-base-japanese_jumanpp-unigrambert-base-japanese_jumanpp-wordpiece
Sudachibert-base-japanese_sudachi-bpebert-base-japanese_sudachi-unigrambert-base-japanese_sudachi-wordpiece
Vaporettobert-base-japanese_vaporetto-bpebert-base-japanese_vaporetto-unigrambert-base-japanese_vaporetto-wordpiece
Nothingbert-base-japanese_nothing-bpebert-base-japanese_nothing-unigrambert-base-japanese_nothing-wordpiece

Dictionary files

The trained dictionary files are available from this repository.

BPEUnigramWordPiece
MeCabmecab_bpe.jsonmecab_unigram.jsonmecab_wordpiece.json
Juman++jumanpp_bpe.jsonjumanpp_unigram.jsonjumanpp_wordpiece.json
Sudachisudachi_bpe.jsonsudachi_unigram.jsonsudachi_wordpiece.json
Vaporettovaporetto_bpe.jsonvaporetto_unigram.jsonvaporetto_wordpiece.json
Nothingnothing_bpe.jsonnothing_unigram.jsonnothing_wordpiece.json

How to load our dictionary files

Because we use the customised tokenizers, we cannot useAutoTokenizer.from_pretrained() to load a dictionary file.
To load the file and construct a tokenizer, please use the following script. You must callbuild_tokenizer() to generate a tokenizer.

fromtypingimportOptionalfromtokenizersimportTokenizerfromtokenizersimportNormalizedString,PreTokenizedStringfromtokenizers.processorsimportBertProcessingfromtokenizers.pre_tokenizersimportPreTokenizerfromtransformersimportPreTrainedTokenizerFastfrompyknpimportJumanfromMeCabimportTaggerfromsudachipyimporttokenizerfromsudachipyimportdictionaryimportvaporettoimportmojimojiimporttracebackimporttextspanclassJumanPreTokenizer:def__init__(self):self.juman=Juman("jumanpp",multithreading=True)deftokenize(self,sequence:str)->list[str]:text=mojimoji.han_to_zen(sequence).rstrip()try:result=self.juman.analysis(text)except:traceback.print_exc()text=""result=self.juman.analysis(text)return [mrph.midasiformrphinresult.mrph_list()]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classMecabPreTokenizer:def__init__(self,mecab_dict_path:Optional[str]=None):mecab_option= (f"-Owakati -d{mecab_dict_path}"ifmecab_dict_pathisnotNoneelse"-Owakati")self.mecab=Tagger(mecab_option)deftokenize(self,sequence:str)->list[str]:returnself.mecab.parse(sequence).strip().split(" ")defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classSudachiPreTokenizer:def__init__(self,mecab_dict_path:Optional[str]=None):self.sudachi=dictionary.Dictionary().create()deftokenize(self,sequence:str)->list[str]:return [token.surface()fortokeninself.sudachi.tokenize(sequence)]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classVaporettoPreTokenizer:def__init__(self,unidic_path:str):withopen(unidic_path,'rb')asfp:model=fp.read()self.tokenizer=vaporetto.Vaporetto(model,predict_tags=False)deftokenize(self,sequence:str)->list[str]:tokens=self.tokenizer.tokenize(sequence)return [token.surface()fortokenintokens]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)defbuild_tokenizer(dict_path:str,pretokenizer_type:str=None,vaporetto_model_path:str=None)->PreTrainedTokenizerFast:# load a tokenizertokenizer=Tokenizer.from_file(dict_path)# load a pre-tokenizerifpretokenizer_type=='mecab':pre_tokenizer=MecabPreTokenizer()elifpretokenizer_type=='jumanpp':pre_tokenizer=JumanPreTokenizer()elifpretokenizer_type=='vaporetto':pre_tokenizer=VaporettoPreTokenizer(vaporetto_model_path)elifpretokenizer_type=='sudachi':pre_tokenizer=SudachiPreTokenizer()elifpretokenizer_type=='nothing':pre_tokenizer=Noneelse:raiseNotImplementedError()tokenizer.post_processor=BertProcessing(cls=("[CLS]",tokenizer.token_to_id('[CLS]')),sep=("[SEP]",tokenizer.token_to_id('[SEP]'))    )# convert to PreTrainedTokenizerFasttokenizer=PreTrainedTokenizerFast(tokenizer_object=tokenizer,unk_token='[UNK]',cls_token='[CLS]',sep_token='[SEP]',pad_token='[PAD]',mask_token='[MASK]'    )# set a pre-tokenizerifpre_tokenizerisnotNone:tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(pre_tokenizer)returntokenizer

Citation

@inproceedings{fujii-etal-2023-how,      title={How does the task complexity of masked pretraining objectives affect downstream performance?},       author={Takuro Fujii and Koki Shibata and Atsuki Yamaguchi and Terufumi Morishita and Yasuhiro Sogawa},      booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",      month = July,      year = "2023",      address = "Toronto, Canada",      publisher = "Association for Computational Linguistics",}

License

This work is licensed under aCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License unless specified.

About

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese - ACL SRW 2023

Resources

License

Stars

Watchers

Forks


[8]ページ先頭

©2009-2025 Movatter.jp