- Notifications
You must be signed in to change notification settings - Fork2
License
informatix-inc/bert
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository provides snippets to use RoBERTa pre-trained on Japanese corpus.Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total.The released model is built based on that fromHuggingFace.
We usedJuman++ (version 2.0.0-rc3) as a morphological analyzer,and also applied WordPiece embedding (subword-nmt) to split each word into word pieces.
Configurations of our model are following.Please refer toHuggingFace page for definitions of each parameter.
{ "attention_probs_dropout_prob": 0.1, "bos_token_id": 3, "classifier_dropout": null, "eos_token_id": 0, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 515, "max_seq_length": 512, "model_type": "roberta", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 2, "position_embedding_type": "absolute", "transformers_version": "4.16.2", "type_vocab_size": 2, "use_cache": true, "vocab_size": 32005}
We trained our model in the same way as RoBERTa. We optimized our model using the masked language modeling (MLM) objective.The accuracy of the MLM is 72.0%.
- Install Juman++ following the instructions fromthis repository.
- Install PyTorch following the instructions fromthis page.
- Install all of the necessary python packages.
pip install -r requirements.txt
- Download weight files and configuration files.
Files | Description |
---|---|
roberta.pth | Trained weights of RobertaModel from HuggingFace |
linear_word.pth | Trained weights of linear layer for MLM |
roberta_config.json | Configurations for RobertaModel |
bpe.txt | Rules for splitting each word into word pieces |
bpe_dict.csv | Dictionary of word pieces |
- Load weights and configurations.
# Paths to each file bpe_file = <Path to the file (bpe.txt) which defined word pieces> count_file = <Path to the file (bpe_dict.csv) which defines ids for word pieces> roberta_config_path = <Path to the file (roberta_config.json) which defines configurations of RobertaModel> juman_config_path = <Path to config file for juman> roberta_weight_path = <Path to the weight file (roberta.pth) of RobertaModel> linear_weight_path = <Path to the weight file (linear_word.pth) of final linear layer for MLM> # load tokenizer processor = TextProcessor(bpe_file=bpe_file, count_file=count_file) # load pretrained roberta model with open(roberta_config_path, "r") as f: config_dict = json.load(f) config_bert = RobertaConfig().from_dict(config_dict) roberta = RobertaModel(config=config_roberta) roberta.load_state_dict(torch.load(roberta_weight_path, map_location=device)) # load pretained decoder ifxroberta = IFXRoberta(roberta) ifxroberta.linear_word.load_state_dict(torch.load(linear_weight_path, map_location=device))
- Encode inputs.
# infer inp_text = "コンピュータ技術およびその関連技術に対する研究開発に努め、自己研鑽に励み、専門的な技術集団を形成する。" bpe_text = processor.encode(inp_text) with torch.no_grad(): inp_tensor = torch.LongTensor([bpe_text]) inp_tensor.to(device) out_tensor = ifxroberta(inp_tensor)
- Decode model outputs.
# decoding output out_text_code = torch.max(out_tensor, dim=-1, keepdim=True)[1][0] ids = out_text_code.squeeze(-1).cpu().numpy().tolist() ignore_idx = np.where(np.array(bpe_ids) < 5)[0] out_text = processor.decode(ids, ignore_idx=ignore_idx)
main.py
helps you understand how to use our model.
About
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.