hitachi-nlp/compare-ja-tokenizerPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star5

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese - ACL SRW 2023

License

View license

5 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
dependency_parsing		dependency_parsing
jcommonsenseqa		jcommonsenseqa
jnli		jnli
jsquad		jsquad
jsts		jsts
log		log
marc-ja		marc-ja
ner		ner
preprocessing_for_pretraining		preprocessing_for_pretraining
preprocessing_for_tokenizers		preprocessing_for_tokenizers
pretraining		pretraining
tokenizer		tokenizer
weights		weights
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

This is the official implementation of the paper titled: "How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese". To reproduce our results, please follow the following instructions.

1. Requirements

Python >= 3.9
PyTorch 1.8.1
Transformers 4.24.0.dev0

2. Installation

2.1 PyTorch

pip install torch==1.8.1+cu101 torchvision==0.9.1+cu101 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

2.2 Transformers

git clone https://github.com/huggingface/transformers.gitcd transformerspip install -e.cd ..

2.3 Other Python packages

pip install -r requirements.txt

2.4 Japanese Morphological Analyzers

Here, we install required packages under${HOME}/usr, but you can choose your preferred location by modifying--prefix.

2.4.1 MeCab

Model

git clone https://github.com/taku910/mecab.gitcd mecab/mecab./configure --prefix=${HOME}/usr --with-charset=UTF8makemake installcd ../..

Dictionary

wget"https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM" -O mecab-ipadic-2.70-20070801.tar.gztar xvzf mecab-ipadic-2.7.0-20070801.tar.gzcd mecab-ipadic-2.7.0-20070801./configure --with-mecab-config=$HOME/usr/bin/mecab-config --with-charset=UTF8 --prefix=$HOME/usrmakemake installcd ..

2.4.2 Juman++

wget"https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz"tar xvJf jumanpp-2.0.0-rc3.tar.xzcd jumanpp-2.0.0-rc3mkdir build&&cd buildcurl -LO https://github.com/catchorg/Catch2/releases/download/v2.13.8/catch.hppmv catch.hpp ../libs/cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$HOME/usrmakemake installecho'export PATH=$PATH:$HOME/usr'>>~/.bashrcecho'export PATH=$PATH:$HOME/usr/bin'>>~/.bashrccd ..

2.4.3 Sudachi

pip install sudachipypip install sudachidict_core

2.4.4 Vaporetto

Seehttps://github.com/daac-tools/vaporetto for more details.

cd data/dictwget https://github.com/daac-tools/vaporetto/releases/download/v0.5.0/bccwj-suw+unidic+tag.tar.xztar xf ./bccwj-suw+unidic+tag.tar.xzcd ../..

3. Preprocessing data for tokenizer training

Please seepreprocessing_for_tokenizers.

4. Training tokenizers

Please seetokenizer.

5. Preprocessing data for pretraining

Please seepreprocessing_for_pretraining.

6. Pretraining

Please seepretraining.

7. Fine-tuning

7.1 JGLUE

First, please clone the JGLUE repository and download the JGLUE dataset under./data, followinghttps://github.com/yahoojapan/JGLUE.

7.1.1 MARC-ja

Please seemarc-ja.

7.1.2 JSTS

Please seejsts.

7.1.3 JNLI

Please seejnli.

7.1.4 JSQuAD

Please seejsquad.

7.1.5 JCommonsenseQA

Please seejcommonsenseqa.

7.2 NER

Please seener.

7.3 UD

Please seedependency_parsing.

Pretrained Weights

The pretrained weights are available on the Hugging Face Hub.

	BPE	Unigram	WordPiece
MeCab	bert-base-japanese_mecab-bpe	bert-base-japanese_mecab-unigram	bert-base-japanese_mecab-wordpiece
Juman++	bert-base-japanese_jumanpp-bpe	bert-base-japanese_jumanpp-unigram	bert-base-japanese_jumanpp-wordpiece
Sudachi	bert-base-japanese_sudachi-bpe	bert-base-japanese_sudachi-unigram	bert-base-japanese_sudachi-wordpiece
Vaporetto	bert-base-japanese_vaporetto-bpe	bert-base-japanese_vaporetto-unigram	bert-base-japanese_vaporetto-wordpiece
Nothing	bert-base-japanese_nothing-bpe	bert-base-japanese_nothing-unigram	bert-base-japanese_nothing-wordpiece

Dictionary files

The trained dictionary files are available from this repository.

	BPE	Unigram	WordPiece
MeCab	mecab_bpe.json	mecab_unigram.json	mecab_wordpiece.json
Juman++	jumanpp_bpe.json	jumanpp_unigram.json	jumanpp_wordpiece.json
Sudachi	sudachi_bpe.json	sudachi_unigram.json	sudachi_wordpiece.json
Vaporetto	vaporetto_bpe.json	vaporetto_unigram.json	vaporetto_wordpiece.json
Nothing	nothing_bpe.json	nothing_unigram.json	nothing_wordpiece.json

How to load our dictionary files

Because we use the customised tokenizers, we cannot useAutoTokenizer.from_pretrained() to load a dictionary file.
To load the file and construct a tokenizer, please use the following script. You must callbuild_tokenizer() to generate a tokenizer.

fromtypingimportOptionalfromtokenizersimportTokenizerfromtokenizersimportNormalizedString,PreTokenizedStringfromtokenizers.processorsimportBertProcessingfromtokenizers.pre_tokenizersimportPreTokenizerfromtransformersimportPreTrainedTokenizerFastfrompyknpimportJumanfromMeCabimportTaggerfromsudachipyimporttokenizerfromsudachipyimportdictionaryimportvaporettoimportmojimojiimporttracebackimporttextspanclassJumanPreTokenizer:def__init__(self):self.juman=Juman("jumanpp",multithreading=True)deftokenize(self,sequence:str)->list[str]:text=mojimoji.han_to_zen(sequence).rstrip()try:result=self.juman.analysis(text)except:traceback.print_exc()text=""result=self.juman.analysis(text)return [mrph.midasiformrphinresult.mrph_list()]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classMecabPreTokenizer:def__init__(self,mecab_dict_path:Optional[str]=None):mecab_option= (f"-Owakati -d{mecab_dict_path}"ifmecab_dict_pathisnotNoneelse"-Owakati")self.mecab=Tagger(mecab_option)deftokenize(self,sequence:str)->list[str]:returnself.mecab.parse(sequence).strip().split(" ")defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classSudachiPreTokenizer:def__init__(self,mecab_dict_path:Optional[str]=None):self.sudachi=dictionary.Dictionary().create()deftokenize(self,sequence:str)->list[str]:return [token.surface()fortokeninself.sudachi.tokenize(sequence)]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classVaporettoPreTokenizer:def__init__(self,unidic_path:str):withopen(unidic_path,'rb')asfp:model=fp.read()self.tokenizer=vaporetto.Vaporetto(model,predict_tags=False)deftokenize(self,sequence:str)->list[str]:tokens=self.tokenizer.tokenize(sequence)return [token.surface()fortokenintokens]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)defbuild_tokenizer(dict_path:str,pretokenizer_type:str=None,vaporetto_model_path:str=None)->PreTrainedTokenizerFast:# load a tokenizertokenizer=Tokenizer.from_file(dict_path)# load a pre-tokenizerifpretokenizer_type=='mecab':pre_tokenizer=MecabPreTokenizer()elifpretokenizer_type=='jumanpp':pre_tokenizer=JumanPreTokenizer()elifpretokenizer_type=='vaporetto':pre_tokenizer=VaporettoPreTokenizer(vaporetto_model_path)elifpretokenizer_type=='sudachi':pre_tokenizer=SudachiPreTokenizer()elifpretokenizer_type=='nothing':pre_tokenizer=Noneelse:raiseNotImplementedError()tokenizer.post_processor=BertProcessing(cls=("[CLS]",tokenizer.token_to_id('[CLS]')),sep=("[SEP]",tokenizer.token_to_id('[SEP]'))    )# convert to PreTrainedTokenizerFasttokenizer=PreTrainedTokenizerFast(tokenizer_object=tokenizer,unk_token='[UNK]',cls_token='[CLS]',sep_token='[SEP]',pad_token='[PAD]',mask_token='[MASK]'    )# set a pre-tokenizerifpre_tokenizerisnotNone:tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(pre_tokenizer)returntokenizer

Citation

@inproceedings{fujii-etal-2023-how,      title={How does the task complexity of masked pretraining objectives affect downstream performance?},       author={Takuro Fujii and Koki Shibata and Atsuki Yamaguchi and Terufumi Morishita and Yasuhiro Sogawa},      booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",      month = July,      year = "2023",      address = "Toronto, Canada",      publisher = "Association for Computational Linguistics",}

License

This work is licensed under aCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License unless specified.

About

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese - ACL SRW 2023

Movatterモバイル変換

License

hitachi-nlp/compare-ja-tokenizer

Folders and files

Latest commit

History

Repository files navigation

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

1. Requirements

2. Installation

2.1 PyTorch

2.2 Transformers

2.3 Other Python packages

2.4 Japanese Morphological Analyzers

2.4.1 MeCab

2.4.2 Juman++

2.4.3 Sudachi

2.4.4 Vaporetto

3. Preprocessing data for tokenizer training

4. Training tokenizers

5. Preprocessing data for pretraining

6. Pretraining

7. Fine-tuning

7.1 JGLUE

7.1.1 MARC-ja

7.1.2 JSTS

7.1.3 JNLI

7.1.4 JSQuAD

7.1.5 JCommonsenseQA

7.2 NER

7.3 UD

Pretrained Weights

Dictionary files

How to load our dictionary files

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages