informatix-inc/bertPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star28

License

Apache-2.0 license

28 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py

Repository files navigation

Japanese RoBERTa

This repository provides snippets to use RoBERTa pre-trained on Japanese corpus.Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total.The released model is built based on that fromHuggingFace.

Details

Preprocessing

We usedJuman++ (version 2.0.0-rc3) as a morphological analyzer,and also applied WordPiece embedding (subword-nmt) to split each word into word pieces.

Model

Configurations of our model are following.Please refer toHuggingFace page for definitions of each parameter.

{  "attention_probs_dropout_prob": 0.1,  "bos_token_id": 3,  "classifier_dropout": null,  "eos_token_id": 0,  "gradient_checkpointing": false,  "hidden_act": "gelu",  "hidden_dropout_prob": 0.1,  "hidden_size": 768,  "initializer_range": 0.02,  "intermediate_size": 3072,  "layer_norm_eps": 1e-12,  "max_position_embeddings": 515,  "max_seq_length": 512,  "model_type": "roberta",  "num_attention_heads": 12,  "num_hidden_layers": 12,  "pad_token_id": 2,  "position_embedding_type": "absolute",  "transformers_version": "4.16.2",  "type_vocab_size": 2,  "use_cache": true,  "vocab_size": 32005}

Training

We trained our model in the same way as RoBERTa. We optimized our model using the masked language modeling (MLM) objective.The accuracy of the MLM is 72.0%.

How to use our model

Install Juman++ following the instructions fromthis repository.
Install PyTorch following the instructions fromthis page.
Install all of the necessary python packages.

pip install -r requirements.txt

Download weight files and configuration files.

Files	Description
roberta.pth	Trained weights of RobertaModel from HuggingFace
linear_word.pth	Trained weights of linear layer for MLM
roberta_config.json	Configurations for RobertaModel
bpe.txt	Rules for splitting each word into word pieces
bpe_dict.csv	Dictionary of word pieces

Load weights and configurations.

    # Paths to each file    bpe_file = <Path to the file (bpe.txt) which defined word pieces>    count_file = <Path to the file (bpe_dict.csv) which defines ids for word pieces>    roberta_config_path = <Path to the file (roberta_config.json) which defines configurations of RobertaModel>    juman_config_path = <Path to config file for juman>        roberta_weight_path = <Path to the weight file (roberta.pth) of RobertaModel>    linear_weight_path = <Path to the weight file (linear_word.pth) of final linear layer for MLM>        # load tokenizer    processor = TextProcessor(bpe_file=bpe_file, count_file=count_file)    # load pretrained roberta model    with open(roberta_config_path, "r") as f:        config_dict = json.load(f)    config_bert = RobertaConfig().from_dict(config_dict)    roberta = RobertaModel(config=config_roberta)    roberta.load_state_dict(torch.load(roberta_weight_path, map_location=device))    # load pretained decoder    ifxroberta = IFXRoberta(roberta)    ifxroberta.linear_word.load_state_dict(torch.load(linear_weight_path, map_location=device))

Encode inputs.

    # infer    inp_text = "コンピュータ技術およびその関連技術に対する研究開発に努め、自己研鑽に励み、専門的な技術集団を形成する。"    bpe_text = processor.encode(inp_text)    with torch.no_grad():        inp_tensor = torch.LongTensor([bpe_text])        inp_tensor.to(device)        out_tensor = ifxroberta(inp_tensor)

Decode model outputs.

    # decoding output    out_text_code = torch.max(out_tensor, dim=-1, keepdim=True)[1][0]    ids = out_text_code.squeeze(-1).cpu().numpy().tolist()    ignore_idx = np.where(np.array(bpe_ids) < 5)[0]    out_text = processor.decode(ids, ignore_idx=ignore_idx)

main.py helps you understand how to use our model.

License

The Apache 2.0 license

About

No description, website, or topics provided.

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Japanese RoBERTa

Details

Preprocessing

Model

Training

How to use our model

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

informatix-inc/bert

Folders and files

Latest commit

History

Repository files navigation

Japanese RoBERTa

Details

Preprocessing

Model

Training

How to use our model

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages