- Notifications
You must be signed in to change notification settings - Fork0
Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP'23)
License
CC-BY-4.0, MIT licenses found
Licenses found
yuzhimanhua/SciMult
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains code and instructions for reproducing the experiments in the paperPre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP 2023).
We use one NVIDIA RTX A6000 GPU to run the evaluation code in our experiments. The code is written in Python 3.8. You can install the dependencies as follows.
git clone --recurse-submodules https://github.com/yuzhimanhua/SciMultcd SciMult# get the DPR codebasemkdir third_partycd third_partygit clone https://github.com/facebookresearch/DPR.gitcd ../# create the sandboxconda env create --file=environment.yml --name=scimultconda activate scimult# add the `src/` and `third_party/DPR` to the list of places python searches for packagesconda develop src/ third_party/DPR/# download spacy modelspython -m spacy download en_core_web_sm
You need to first download theevaluation datasets and thepre-trained models. After you unzip the dataset file, put the folder (i.e.,data/
) under the repository main folder./
. After you download the four model checkpoints (i.e.,scimult_vanilla.ckpt
,scimult_moe.ckpt
,scimult_moe_pmcpatients_par.ckpt
, andscimult_moe_pmcpatients_ppr.ckpt
), put them under the model folder./model/
.
Then, you can run the evaluation code for each task:
cd src# evaluate fine-grained classification (MAPLE [CS-Conference, Chemistry-MeSH, Geography, Psychology])./eval_classification_fine.sh# evaluate coarse-grained classification (SciDocs [MAG, MeSH])./eval_classification_coarse.sh# evaluate link prediction under the retrieval setting (SciDocs [Cite, Co-cite], PMC-Patients [PPR])./eval_link_prediction_retrieval.sh# evaluate link prediction under the reranking setting (Recommendation)./eval_link_prediction_reranking.sh# evaluate search (SciRepEval [Search, TREC-COVID], BEIR [TREC-COVID, SciFact, NFCorpus])./eval_search.sh
The metrics will be shown at the end of the terminal output as well as inscores.txt
.
If you have some documents (e.g., scientific papers) and want to get the embedding of each document using SciMult, we provide the following sample code for your reference:
cd srcpython3.8 get_embedding.py
NOTE: The performance of SciMult on PMC-Patients reported in ourpaper is based onthe old version of PMC-Patients (i.e., the version when we wrote the SciMult paper). The PMC-Patients Leaderboard at that time can be foundhere.
To reproduce our reported performance on the "old" PMC-Patients Leaderboard:
cd src./eval_pmc_patients.sh
The metrics will be shown at the end of the terminal output as well as inscores.txt
. The similarity scores that we submitted to the leaderboard can be found at../output/PMCPatientsPAR_test_out.json
and../output/PMCPatientsPPR_test_out.json
.
For the performance of SciMult onthe new version of PMC-Patients, please refer to the up-to-datePMC-Patients Leaderboard.
To reproduce our performance on theSciDocs benchmark:
cd src./eval_scidocs.sh
The output embedding files can be found at../output/cls.jsonl
and../output/user-citation.jsonl
. Then, run the adapted SciDocs evaluation code:
cd ../git clone https://github.com/yuzhimanhua/SciDocs.gitcd scidocs# install dependenciesconda deactivateconda create -y --name scidocs python==3.7conda activate scidocsconda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorchpip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0python setup.py install# run evaluationpython eval.py
The metrics will be shown at the end of the terminal output.
The preprocessed evaluation datasets can be downloaded fromhere. The aggregate version is released under theODC-By v1.0 License. By downloading this version you acknowledge that you have read and agreed to all the terms in this license.
Similar to Tensorflowdatasets or Hugging Face'sdatasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have the license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.
More details about each constituent dataset are as follows.
Dataset | Folder | #Queries | #Candidates | Source | License |
---|---|---|---|---|---|
MAPLE (CS-Conference) | classification_fine/ | 261,781 | 15,808 | Link | ODC-By v1.0 |
MAPLE (Chemistry-MeSH) | classification_fine/ | 762,129 | 30,194 | Link | ODC-By v1.0 |
MAPLE (Geography) | classification_fine/ | 73,883 | 3,285 | Link | ODC-By v1.0 |
MAPLE (Psychology) | classification_fine/ | 372,954 | 7,641 | Link | ODC-By v1.0 |
SciDocs (MAG Fields) | classification_coarse/ | 25,001 | 19 | Link | CC BY 4.0 |
SciDocs (MeSH Diseases) | classification_coarse/ | 23,473 | 11 | Link | CC BY 4.0 |
SciDocs (Cite) | link_prediction_retrieval/ | 92,214 | 142,009 | Link | CC BY 4.0 |
SciDocs (Co-cite) | link_prediction_retrieval/ | 54,543 | 142,009 | Link | CC BY 4.0 |
PMC-Patients (PPR, Zero-shot) | link_prediction_retrieval/ | 100,327 | 155,151 | Link | CC BY-NC-SA 4.0 |
PMC-Patients (PAR, Supervised) | pmc_patients/ | 5,959 | 1,413,087 | Link | CC BY-NC-SA 4.0 |
PMC-Patients (PPR, Supervised) | pmc_patients/ | 2,812 | 155,151 | Link | CC BY-NC-SA 4.0 |
SciDocs (Co-view) | scidocs/ | 1,000 | reranking, 29.98 for each query on average | Link | CC BY 4.0 |
SciDocs (Co-read) | scidocs/ | 1,000 | reranking, 29.98 for each query on average | Link | CC BY 4.0 |
SciDocs (Cite) | scidocs/ | 1,000 | reranking, 29.93 for each query on average | Link | CC BY 4.0 |
SciDocs (Co-cite) | scidocs/ | 1,000 | reranking, 29.95 for each query on average | Link | CC BY 4.0 |
Recommendation | link_prediction_reranking/ | 137 | reranking, 16.28 for each query on average | Link | N/A |
SciRepEval-Search | search/ | 2,637 | reranking, 10.00 for each query on average | Link | ODC-By v1.0 |
TREC-COVID in SciRepEval | search/ | 50 | reranking, 1386.36 for each query on average | Link | ODC-By v1.0 |
TREC-COVID in BEIR | search/ | 50 | 171,332 | Link | Apache License 2.0 |
SciFact | search/ | 1,109 | 5,183 | Link | Apache License 2.0,CC BY-NC 2.0 |
NFCorpus | search/ | 3,237 | 3,633 | Link | Apache License 2.0 |
Our pre-trained models can be downloaded fromhere. Please refer to the Hugging FaceREADME for more details about the models.
If you find SciMult useful in your research, please cite the following paper:
@inproceedings{zhang2023pre, title={Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding}, author={Zhang, Yu and Cheng, Hao and Shen, Zhihong and Liu, Xiaodong and Wang, Ye-Yi and Gao, Jianfeng}, booktitle={Findings of EMNLP'23}, pages={12259--12275}, year={2023}}
About
Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP'23)