Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP'23)

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE
MIT
LICENSE-CODE
NotificationsYou must be signed in to change notification settings

yuzhimanhua/SciMult

License: MIT

This repository contains code and instructions for reproducing the experiments in the paperPre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP 2023).

Links

Installation

We use one NVIDIA RTX A6000 GPU to run the evaluation code in our experiments. The code is written in Python 3.8. You can install the dependencies as follows.

git clone --recurse-submodules https://github.com/yuzhimanhua/SciMultcd SciMult# get the DPR codebasemkdir third_partycd third_partygit clone https://github.com/facebookresearch/DPR.gitcd ../# create the sandboxconda env create --file=environment.yml --name=scimultconda activate scimult# add the `src/` and `third_party/DPR` to the list of places python searches for packagesconda develop src/ third_party/DPR/# download spacy modelspython -m spacy download en_core_web_sm

Quick Start

You need to first download theevaluation datasets and thepre-trained models. After you unzip the dataset file, put the folder (i.e.,data/) under the repository main folder./. After you download the four model checkpoints (i.e.,scimult_vanilla.ckpt,scimult_moe.ckpt,scimult_moe_pmcpatients_par.ckpt, andscimult_moe_pmcpatients_ppr.ckpt), put them under the model folder./model/.

Then, you can run the evaluation code for each task:

cd src# evaluate fine-grained classification (MAPLE [CS-Conference, Chemistry-MeSH, Geography, Psychology])./eval_classification_fine.sh# evaluate coarse-grained classification (SciDocs [MAG, MeSH])./eval_classification_coarse.sh# evaluate link prediction under the retrieval setting (SciDocs [Cite, Co-cite], PMC-Patients [PPR])./eval_link_prediction_retrieval.sh# evaluate link prediction under the reranking setting (Recommendation)./eval_link_prediction_reranking.sh# evaluate search (SciRepEval [Search, TREC-COVID], BEIR [TREC-COVID, SciFact, NFCorpus])./eval_search.sh

The metrics will be shown at the end of the terminal output as well as inscores.txt.

Getting embeddings of your own data

If you have some documents (e.g., scientific papers) and want to get the embedding of each document using SciMult, we provide the following sample code for your reference:

cd srcpython3.8 get_embedding.py

PMC-Patients

NOTE: The performance of SciMult on PMC-Patients reported in ourpaper is based onthe old version of PMC-Patients (i.e., the version when we wrote the SciMult paper). The PMC-Patients Leaderboard at that time can be foundhere.

To reproduce our reported performance on the "old" PMC-Patients Leaderboard:

cd src./eval_pmc_patients.sh

The metrics will be shown at the end of the terminal output as well as inscores.txt. The similarity scores that we submitted to the leaderboard can be found at../output/PMCPatientsPAR_test_out.json and../output/PMCPatientsPPR_test_out.json.

For the performance of SciMult onthe new version of PMC-Patients, please refer to the up-to-datePMC-Patients Leaderboard.

SciDocs

To reproduce our performance on theSciDocs benchmark:

cd src./eval_scidocs.sh

The output embedding files can be found at../output/cls.jsonl and../output/user-citation.jsonl. Then, run the adapted SciDocs evaluation code:

cd ../git clone https://github.com/yuzhimanhua/SciDocs.gitcd scidocs# install dependenciesconda deactivateconda create -y --name scidocs python==3.7conda activate scidocsconda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorchpip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0python setup.py install# run evaluationpython eval.py

The metrics will be shown at the end of the terminal output.

Datasets

The preprocessed evaluation datasets can be downloaded fromhere. The aggregate version is released under theODC-By v1.0 License. By downloading this version you acknowledge that you have read and agreed to all the terms in this license.

Similar to Tensorflowdatasets or Hugging Face'sdatasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have the license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

More details about each constituent dataset are as follows.

DatasetFolder#Queries#CandidatesSourceLicense
MAPLE (CS-Conference)classification_fine/261,78115,808LinkODC-By v1.0
MAPLE (Chemistry-MeSH)classification_fine/762,12930,194LinkODC-By v1.0
MAPLE (Geography)classification_fine/73,8833,285LinkODC-By v1.0
MAPLE (Psychology)classification_fine/372,9547,641LinkODC-By v1.0
SciDocs (MAG Fields)classification_coarse/25,00119LinkCC BY 4.0
SciDocs (MeSH Diseases)classification_coarse/23,47311LinkCC BY 4.0
SciDocs (Cite)link_prediction_retrieval/92,214142,009LinkCC BY 4.0
SciDocs (Co-cite)link_prediction_retrieval/54,543142,009LinkCC BY 4.0
PMC-Patients (PPR, Zero-shot)link_prediction_retrieval/100,327155,151LinkCC BY-NC-SA 4.0
PMC-Patients (PAR, Supervised)pmc_patients/5,9591,413,087LinkCC BY-NC-SA 4.0
PMC-Patients (PPR, Supervised)pmc_patients/2,812155,151LinkCC BY-NC-SA 4.0
SciDocs (Co-view)scidocs/1,000reranking, 29.98 for each query on averageLinkCC BY 4.0
SciDocs (Co-read)scidocs/1,000reranking, 29.98 for each query on averageLinkCC BY 4.0
SciDocs (Cite)scidocs/1,000reranking, 29.93 for each query on averageLinkCC BY 4.0
SciDocs (Co-cite)scidocs/1,000reranking, 29.95 for each query on averageLinkCC BY 4.0
Recommendationlink_prediction_reranking/137reranking, 16.28 for each query on averageLinkN/A
SciRepEval-Searchsearch/2,637reranking, 10.00 for each query on averageLinkODC-By v1.0
TREC-COVID in SciRepEvalsearch/50reranking, 1386.36 for each query on averageLinkODC-By v1.0
TREC-COVID in BEIRsearch/50171,332LinkApache License 2.0
SciFactsearch/1,1095,183LinkApache License 2.0,CC BY-NC 2.0
NFCorpussearch/3,2373,633LinkApache License 2.0

Models

Our pre-trained models can be downloaded fromhere. Please refer to the Hugging FaceREADME for more details about the models.

Citation

If you find SciMult useful in your research, please cite the following paper:

@inproceedings{zhang2023pre,  title={Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding},  author={Zhang, Yu and Cheng, Hao and Shen, Zhihong and Liu, Xiaodong and Wang, Ye-Yi and Gao, Jianfeng},  booktitle={Findings of EMNLP'23},  pages={12259--12275},  year={2023}}

About

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP'23)

Topics

Resources

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE
MIT
LICENSE-CODE

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp