yuzhimanhua/SciMultPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star11

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP'23)

License

CC-BY-4.0, MIT licenses found

Licenses found

11 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
LICENSE-CODE		LICENSE-CODE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
environment.yml		environment.yml
requirements-dev.txt		requirements-dev.txt

Repository files navigation

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

This repository contains code and instructions for reproducing the experiments in the paperPre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP 2023).

Installation

We use one NVIDIA RTX A6000 GPU to run the evaluation code in our experiments. The code is written in Python 3.8. You can install the dependencies as follows.

git clone --recurse-submodules https://github.com/yuzhimanhua/SciMultcd SciMult# get the DPR codebasemkdir third_partycd third_partygit clone https://github.com/facebookresearch/DPR.gitcd ../# create the sandboxconda env create --file=environment.yml --name=scimultconda activate scimult# add the `src/` and `third_party/DPR` to the list of places python searches for packagesconda develop src/ third_party/DPR/# download spacy modelspython -m spacy download en_core_web_sm

Quick Start

You need to first download theevaluation datasets and thepre-trained models. After you unzip the dataset file, put the folder (i.e.,data/) under the repository main folder./. After you download the four model checkpoints (i.e.,scimult_vanilla.ckpt,scimult_moe.ckpt,scimult_moe_pmcpatients_par.ckpt, andscimult_moe_pmcpatients_ppr.ckpt), put them under the model folder./model/.

Then, you can run the evaluation code for each task:

cd src# evaluate fine-grained classification (MAPLE [CS-Conference, Chemistry-MeSH, Geography, Psychology])./eval_classification_fine.sh# evaluate coarse-grained classification (SciDocs [MAG, MeSH])./eval_classification_coarse.sh# evaluate link prediction under the retrieval setting (SciDocs [Cite, Co-cite], PMC-Patients [PPR])./eval_link_prediction_retrieval.sh# evaluate link prediction under the reranking setting (Recommendation)./eval_link_prediction_reranking.sh# evaluate search (SciRepEval [Search, TREC-COVID], BEIR [TREC-COVID, SciFact, NFCorpus])./eval_search.sh

The metrics will be shown at the end of the terminal output as well as inscores.txt.

Getting embeddings of your own data

If you have some documents (e.g., scientific papers) and want to get the embedding of each document using SciMult, we provide the following sample code for your reference:

cd srcpython3.8 get_embedding.py

PMC-Patients

NOTE: The performance of SciMult on PMC-Patients reported in ourpaper is based onthe old version of PMC-Patients (i.e., the version when we wrote the SciMult paper). The PMC-Patients Leaderboard at that time can be foundhere.

To reproduce our reported performance on the "old" PMC-Patients Leaderboard:

cd src./eval_pmc_patients.sh

The metrics will be shown at the end of the terminal output as well as inscores.txt. The similarity scores that we submitted to the leaderboard can be found at../output/PMCPatientsPAR_test_out.json and../output/PMCPatientsPPR_test_out.json.

For the performance of SciMult onthe new version of PMC-Patients, please refer to the up-to-datePMC-Patients Leaderboard.

SciDocs

To reproduce our performance on theSciDocs benchmark:

cd src./eval_scidocs.sh

The output embedding files can be found at../output/cls.jsonl and../output/user-citation.jsonl. Then, run the adapted SciDocs evaluation code:

cd ../git clone https://github.com/yuzhimanhua/SciDocs.gitcd scidocs# install dependenciesconda deactivateconda create -y --name scidocs python==3.7conda activate scidocsconda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorchpip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0python setup.py install# run evaluationpython eval.py

The metrics will be shown at the end of the terminal output.

Datasets

The preprocessed evaluation datasets can be downloaded fromhere. The aggregate version is released under theODC-By v1.0 License. By downloading this version you acknowledge that you have read and agreed to all the terms in this license.

Similar to Tensorflowdatasets or Hugging Face'sdatasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have the license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

More details about each constituent dataset are as follows.

Dataset	Folder	#Queries	#Candidates	Source	License
MAPLE (CS-Conference)	`classification_fine/`	261,781	15,808	Link	ODC-By v1.0
MAPLE (Chemistry-MeSH)	`classification_fine/`	762,129	30,194	Link	ODC-By v1.0
MAPLE (Geography)	`classification_fine/`	73,883	3,285	Link	ODC-By v1.0
MAPLE (Psychology)	`classification_fine/`	372,954	7,641	Link	ODC-By v1.0
SciDocs (MAG Fields)	`classification_coarse/`	25,001	19	Link	CC BY 4.0
SciDocs (MeSH Diseases)	`classification_coarse/`	23,473	11	Link	CC BY 4.0
SciDocs (Cite)	`link_prediction_retrieval/`	92,214	142,009	Link	CC BY 4.0
SciDocs (Co-cite)	`link_prediction_retrieval/`	54,543	142,009	Link	CC BY 4.0
PMC-Patients (PPR, Zero-shot)	`link_prediction_retrieval/`	100,327	155,151	Link	CC BY-NC-SA 4.0
PMC-Patients (PAR, Supervised)	`pmc_patients/`	5,959	1,413,087	Link	CC BY-NC-SA 4.0
PMC-Patients (PPR, Supervised)	`pmc_patients/`	2,812	155,151	Link	CC BY-NC-SA 4.0
SciDocs (Co-view)	`scidocs/`	1,000	reranking, 29.98 for each query on average	Link	CC BY 4.0
SciDocs (Co-read)	`scidocs/`	1,000	reranking, 29.98 for each query on average	Link	CC BY 4.0
SciDocs (Cite)	`scidocs/`	1,000	reranking, 29.93 for each query on average	Link	CC BY 4.0
SciDocs (Co-cite)	`scidocs/`	1,000	reranking, 29.95 for each query on average	Link	CC BY 4.0
Recommendation	`link_prediction_reranking/`	137	reranking, 16.28 for each query on average	Link	N/A
SciRepEval-Search	`search/`	2,637	reranking, 10.00 for each query on average	Link	ODC-By v1.0
TREC-COVID in SciRepEval	`search/`	50	reranking, 1386.36 for each query on average	Link	ODC-By v1.0
TREC-COVID in BEIR	`search/`	50	171,332	Link	Apache License 2.0
SciFact	`search/`	1,109	5,183	Link	Apache License 2.0,CC BY-NC 2.0
NFCorpus	`search/`	3,237	3,633	Link	Apache License 2.0

Models

Our pre-trained models can be downloaded fromhere. Please refer to the Hugging FaceREADME for more details about the models.

Citation

If you find SciMult useful in your research, please cite the following paper:

@inproceedings{zhang2023pre,  title={Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding},  author={Zhang, Yu and Cheng, Hao and Shen, Zhihong and Liu, Xiaodong and Wang, Ye-Yi and Gao, Jianfeng},  booktitle={Findings of EMNLP'23},  pages={12259--12275},  year={2023}}

About

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP'23)

Topics

scientific-text-mining mixture-of-experts pre-trained-language-models contrastive-learning

Resources

Readme

License

CC-BY-4.0, MIT licenses found

Licenses found

Code of conduct

Releases

No releases published

Packages

No packages published

Languages

Python98.9%
Other1.1%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

Links

Installation

Quick Start

Getting embeddings of your own data

PMC-Patients

SciDocs

Datasets

Models

Citation

About

Topics

Resources

License

Licenses found

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

yuzhimanhua/SciMult

Folders and files

Latest commit

History

Repository files navigation

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

Links

Installation

Quick Start

Getting embeddings of your own data

PMC-Patients

SciDocs

Datasets

Models

Citation

About

Topics

Resources

License

Licenses found

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages