lbnlp/MatBERTPublic

NotificationsYou must be signed in to change notification settings
Fork7
Star62

A pretrained BERT model on materials science literature

License

MIT license

62 stars 7 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
docs		docs
examples		examples
matbert		matbert
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TrainingParams.md		TrainingParams.md
requirements.txt		requirements.txt
requirements_training.txt		requirements_training.txt
setup.py		setup.py

Repository files navigation

MatBERT

A pretrained BERT model on materials science literature. MatBERT specializes in understandingmaterials science terminologies and paragraph-level scientific reasoning.

Downloading data files

To use MatBERT, download these files into a folder:

export MODEL_PATH="Your path"mkdir $MODEL_PATH/matbert-base-cased $MODEL_PATH/matbert-base-uncasedcurl -# -o $MODEL_PATH/matbert-base-cased/config.json https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/config.jsoncurl -# -o $MODEL_PATH/matbert-base-cased/vocab.txt https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/vocab.txtcurl -# -o $MODEL_PATH/matbert-base-cased/pytorch_model.bin https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/pytorch_model.bincurl -# -o $MODEL_PATH/matbert-base-uncased/config.json https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_uncased_30522_wd/config.jsoncurl -# -o $MODEL_PATH/matbert-base-uncased/vocab.txt https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_uncased_30522_wd/vocab.txtcurl -# -o $MODEL_PATH/matbert-base-uncased/pytorch_model.bin https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_uncased_30522_wd/pytorch_model.bin

Using MatBERT

Tokenizers

The tokenizer is specifically trained to handle materials science terminologies:

>>>fromtransformersimportBertTokenizerFast>>>tokenizer=BertTokenizerFast.from_pretrained('PATH-TO-MATBERT/matbert-base-cased',do_lower_case=False)>>>tokenizer_bert=BertTokenizerFast.from_pretrained('bert-base-cased',do_lower_case=False)>>>foriin ['Fe(NO3)3• 9H2O','La0.85Ag0.15Mn1−yAlyO3']:>>>print(i)>>>print('='*100)>>>print('MatBERT tokenizer:',tokenizer.tokenize(i))>>>print('BERT tokenizer:',tokenizer_bert.tokenize(i))Fe(NO3)3• 9H2O====================================================================================================MatBERTtokenizer: ['Fe','(','NO3',')','3','•','9H2O']BERTtokenizer: ['Fe','(','NO','##3',')','3','•','9','##H','##2','##O']La0.85Ag0.15Mn1−yAlyO3====================================================================================================MatBERTtokenizer: ['La0','.','85','##Ag','##0','.','15','##Mn1','##−y','##Al','##y','##O3']BERTtokenizer: ['La','##0','.','85','##A','##g','##0','.','15','##M','##n','##1','##−','##y','##A','##ly','##O','##3']

The model

The model can be loaded using Transformers' unversal loading API. Here, we demonstratehow MatBERT performs scientific reasoning for the synthesis of Li-ion battery materials.

>>>fromtransformersimportBertForMaskedLM,BertTokenizerFast,pipeline>>>frompprintimportpprint>>>model=BertForMaskedLM.from_pretrained('PATH-TO-MATBERT/matbert-base-cased')>>>tokenizer=BertTokenizerFast.from_pretrained('PATH-TO-MATBERT/matbert-base-cased',do_lower_case=False)>>>unmasker=pipeline('fill-mask',model=model,tokenizer=tokenizer)>>>pprint(unmasker("Conventional [MASK] synthesis is used to fabricate material LiMn2O4."))[{'sequence':'[CLS] Conventional combustion synthesis is used to fabricate material LiMn2O4. [SEP]','score':0.4971400499343872,'token':5444,'token_str':'combustion'}, {'sequence':'[CLS] Conventional hydrothermal synthesis is used to fabricate material LiMn2O4. [SEP]','score':0.2478722780942917,'token':7524,'token_str':'hydrothermal'}, {'sequence':'[CLS] Conventional chemical synthesis is used to fabricate material LiMn2O4. [SEP]','score':0.060953784734010696,'token':2868,'token_str':'chemical'}, {'sequence':'[CLS] Conventional gel synthesis is used to fabricate material LiMn2O4. [SEP]','score':0.03871171176433563,'token':4003,'token_str':'gel'}, {'sequence':'[CLS] Conventional solution synthesis is used to fabricate material LiMn2O4. [SEP]','score':0.019403140991926193,'token':2291,'token_str':'solution'}]

Evaluation

GLUE

TheGeneral Language Understanding Evaluation (GLUE) benchmarkis a collection of resources for training, evaluating, and analyzing natural language understanding systems.Note, GLUE evaluates language models' capability to model general purpose language understanding,which may not align with the capabilities of language models trained on special domains, such as MatBERT.

Task	Metric	Score (MatBERT-base-cased)	Score (MatBERT-base-uncased)
The Corpus of Linguistic Acceptability	Matthew's Corr	26.1	26.6
The Stanford Sentiment Treebank	Accuracy	89.5	90.2
Microsoft Research Paraphrase Corpus	F1/Accuracy	87.0/83.1	87.4/82.7
Semantic Textual Similarity Benchmark	Pearson-Spearman Corr	80.3/79.3	81.5/80.2
Quora Question Pairs	F1/Accuracy	69.8/88.4	69.7/88.6
MultiNLI Matched	Accuracy	79.6	80.7
MultiNLI Mismatched	Accuracy	79.3	80.1
Question NLI	Accuracy	88.4	88.5
Recognizing Textual Entailment	Accuracy	63.0	60.2
Winograd NLI	Accuracy	61.6	65.1
Diagnostics Main	Matthew's Corr	32.6	31.9
Average Score	----	72.4	72.9

Training details

Training of all MatBERT models was done usingtransformers==3.3.1.The corpus of this training contains 2 million papers collected by thetext-mining efforts atCEDER group. Intotal, we had collected 61,253,938 paragraphs, from which around 50 millionparagraphs with 20-510 tokens are filtered and used for training. Two WordPiecetokenizers (cased and uncased) that are optimized for materials scienceliterature was trained using these paragraphs.

The DOIs and titles of the 2 million papers can be found atthis CSV file.To grasp an overview of this corpus, we created a word cloud imagehere.

For training MatBERT, the config files we used werematbert-base-uncasedandmatbert-base-cased.Only the masked language modeling (MLM) task was used to pretrain MatBERT models.Roughly the batch size is 192 paragraphs per gradient update step and there are5 epochs in total. The optimizer used is AdamW with beta1=0.9 and beta2=0.999.Learning rates start with 5e-5 and decays linearly to zero as the training finishes.A weight decay of 0.01 was used. All models are trained using FP16 mode and O2optimization on 8 NVIDIA V100 cards. The loss values during training can be foundatmatbert-base-uncased andmatbert-base-cased.

Citing

If you used this work in your projects, please consider citing this paper:

@article{walker2021impact,  title={The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science},  author={Walker, Nicholas and Trewartha, Amalie and Huo, Haoyan and Lee, Sanghoon and Cruse, Kevin and Dagdelen, John and Dunn, Alexander and Persson, Kristin and Ceder, Gerbrand and Jain, Anubhav},  journal={Available at SSRN 3950755},  year={2021}}

This work used the Extreme Science and Engineering Discovery Environment (XSEDE)GPU resources, specifically the Bridges-2 supercomputer at the PittsburghSupercomputing Center, through allocation TG-DMR970008S.

About

A pretrained BERT model on materials science literature

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MatBERT

Downloading data files

Using MatBERT

Tokenizers

The model

Evaluation

GLUE

Training details

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

lbnlp/MatBERT

Folders and files

Latest commit

History

Repository files navigation

MatBERT

Downloading data files

Using MatBERT

Tokenizers

The model

Evaluation

GLUE

Training details

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages