- Notifications
You must be signed in to change notification settings - Fork0
How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese - ACL SRW 2023
License
hitachi-nlp/compare-ja-tokenizer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese
This is the official implementation of the paper titled: "How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese". To reproduce our results, please follow the following instructions.
- Python >= 3.9
- PyTorch 1.8.1
- Transformers 4.24.0.dev0
pip install torch==1.8.1+cu101 torchvision==0.9.1+cu101 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
git clone https://github.com/huggingface/transformers.gitcd transformerspip install -e.cd ..
pip install -r requirements.txt
Here, we install required packages under${HOME}/usr
, but you can choose your preferred location by modifying--prefix
.
Model
git clone https://github.com/taku910/mecab.gitcd mecab/mecab./configure --prefix=${HOME}/usr --with-charset=UTF8makemake installcd ../..
Dictionary
wget"https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM" -O mecab-ipadic-2.70-20070801.tar.gztar xvzf mecab-ipadic-2.7.0-20070801.tar.gzcd mecab-ipadic-2.7.0-20070801./configure --with-mecab-config=$HOME/usr/bin/mecab-config --with-charset=UTF8 --prefix=$HOME/usrmakemake installcd ..
wget"https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz"tar xvJf jumanpp-2.0.0-rc3.tar.xzcd jumanpp-2.0.0-rc3mkdir build&&cd buildcurl -LO https://github.com/catchorg/Catch2/releases/download/v2.13.8/catch.hppmv catch.hpp ../libs/cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$HOME/usrmakemake installecho'export PATH=$PATH:$HOME/usr'>>~/.bashrcecho'export PATH=$PATH:$HOME/usr/bin'>>~/.bashrccd ..
pip install sudachipypip install sudachidict_core
Seehttps://github.com/daac-tools/vaporetto for more details.
cd data/dictwget https://github.com/daac-tools/vaporetto/releases/download/v0.5.0/bccwj-suw+unidic+tag.tar.xztar xf ./bccwj-suw+unidic+tag.tar.xzcd ../..
Please seepreprocessing_for_tokenizers.
Please seetokenizer.
Please seepreprocessing_for_pretraining.
Please seepretraining.
First, please clone the JGLUE repository and download the JGLUE dataset under./data
, followinghttps://github.com/yahoojapan/JGLUE.
Please seemarc-ja.
Please seejsts.
Please seejnli.
Please seejsquad.
Please seejcommonsenseqa.
Please seener.
Please seedependency_parsing.
The pretrained weights are available on the Hugging Face Hub.
The trained dictionary files are available from this repository.
Because we use the customised tokenizers, we cannot useAutoTokenizer.from_pretrained()
to load a dictionary file.
To load the file and construct a tokenizer, please use the following script. You must callbuild_tokenizer()
to generate a tokenizer.
fromtypingimportOptionalfromtokenizersimportTokenizerfromtokenizersimportNormalizedString,PreTokenizedStringfromtokenizers.processorsimportBertProcessingfromtokenizers.pre_tokenizersimportPreTokenizerfromtransformersimportPreTrainedTokenizerFastfrompyknpimportJumanfromMeCabimportTaggerfromsudachipyimporttokenizerfromsudachipyimportdictionaryimportvaporettoimportmojimojiimporttracebackimporttextspanclassJumanPreTokenizer:def__init__(self):self.juman=Juman("jumanpp",multithreading=True)deftokenize(self,sequence:str)->list[str]:text=mojimoji.han_to_zen(sequence).rstrip()try:result=self.juman.analysis(text)except:traceback.print_exc()text=""result=self.juman.analysis(text)return [mrph.midasiformrphinresult.mrph_list()]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classMecabPreTokenizer:def__init__(self,mecab_dict_path:Optional[str]=None):mecab_option= (f"-Owakati -d{mecab_dict_path}"ifmecab_dict_pathisnotNoneelse"-Owakati")self.mecab=Tagger(mecab_option)deftokenize(self,sequence:str)->list[str]:returnself.mecab.parse(sequence).strip().split(" ")defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classSudachiPreTokenizer:def__init__(self,mecab_dict_path:Optional[str]=None):self.sudachi=dictionary.Dictionary().create()deftokenize(self,sequence:str)->list[str]:return [token.surface()fortokeninself.sudachi.tokenize(sequence)]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)classVaporettoPreTokenizer:def__init__(self,unidic_path:str):withopen(unidic_path,'rb')asfp:model=fp.read()self.tokenizer=vaporetto.Vaporetto(model,predict_tags=False)deftokenize(self,sequence:str)->list[str]:tokens=self.tokenizer.tokenize(sequence)return [token.surface()fortokenintokens]defcustom_split(self,i:int,normalized_string:NormalizedString)->list[NormalizedString]:text=str(normalized_string)tokens=self.tokenize(text)tokens_spans=textspan.get_original_spans(tokens,text)return [normalized_string[st:ed]forcahr_spansintokens_spansforst,edincahr_spans]defpre_tokenize(self,pretok:PreTokenizedString):pretok.split(self.custom_split)defbuild_tokenizer(dict_path:str,pretokenizer_type:str=None,vaporetto_model_path:str=None)->PreTrainedTokenizerFast:# load a tokenizertokenizer=Tokenizer.from_file(dict_path)# load a pre-tokenizerifpretokenizer_type=='mecab':pre_tokenizer=MecabPreTokenizer()elifpretokenizer_type=='jumanpp':pre_tokenizer=JumanPreTokenizer()elifpretokenizer_type=='vaporetto':pre_tokenizer=VaporettoPreTokenizer(vaporetto_model_path)elifpretokenizer_type=='sudachi':pre_tokenizer=SudachiPreTokenizer()elifpretokenizer_type=='nothing':pre_tokenizer=Noneelse:raiseNotImplementedError()tokenizer.post_processor=BertProcessing(cls=("[CLS]",tokenizer.token_to_id('[CLS]')),sep=("[SEP]",tokenizer.token_to_id('[SEP]')) )# convert to PreTrainedTokenizerFasttokenizer=PreTrainedTokenizerFast(tokenizer_object=tokenizer,unk_token='[UNK]',cls_token='[CLS]',sep_token='[SEP]',pad_token='[PAD]',mask_token='[MASK]' )# set a pre-tokenizerifpre_tokenizerisnotNone:tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(pre_tokenizer)returntokenizer
@inproceedings{fujii-etal-2023-how, title={How does the task complexity of masked pretraining objectives affect downstream performance?}, author={Takuro Fujii and Koki Shibata and Atsuki Yamaguchi and Terufumi Morishita and Yasuhiro Sogawa}, booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: Student Research Workshop", month = July, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics",}
This work is licensed under aCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License unless specified.
About
How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese - ACL SRW 2023
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.