- Notifications
You must be signed in to change notification settings - Fork7
princeton-nlp/c-sts
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains the dataset and code for the paper C-STS: Conditional Semantic Textual Similarity.[ArXiv]
To avoid the intentional/unintentional scraping of the C-STS dataset for pre-training LLMs, which could cause training data contamination and impact their evaluation, we adopt the following approach for our dataset release.
The dataset for C-STS is stored in an encrypted file namedcsts.tar.enc
. To access the dataset, follow these steps:
Request Access: Submit a request to obtain the decryption password byclicking here. You will receive an email response with the password immediately.
Decrypt the Dataset: Once you have received the password via email, you can decrypt the
csts.tar.enc
file using the providedextract.sh
script. Follow the instructions below:Open a terminal and navigate to the
data
directory.Run the following command, replacing
<password>
with the decryption password obtained via email:bash extract.sh csts.tar.enc<password>
Provided the correct password, this step will generate three files
csts_train.csv
,csts_validation.csv
, andcsts_test.csv
, the unencrypted dataset splits.
You can load the data usingdatasets with the following lines
fromdatasetsimportload_datasetdataset=load_dataset('csv',data_files= {'train':'data/csts_train.csv','validation':'data/csts_validation.csv','test':'data/csts_test.csv' })
Important: By using this dataset, you agree to not publicly share its unencrypted contents or decryption password.
We provide the basic training scripts and utilities for finetuning and evaluating the models in the paper. The code is adapted from theHuggingFace Transformers library. Refer to thedocumentation for more details.
The current code supports finetuning any encoder-only model, using thecross_encoder
,bi_encoder
, ortri_encoder
settings described in the paper.You can finetune the models described in the paper using therun_sts.sh
script. For example, to finetune theprinceton-nlp/sup-simcse-roberta-base
model on the C-STS dataset, run the following command:
MODEL=princeton-nlp/sup-simcse-roberta-base \ENCODER_TYPE=bi_encoder \LR=1e-5 \WD=0.1 \TRANSFORM=False \OBJECTIVE=mse \OUTPUT_DIR=output \TRAIN_FILE=data/csts_train.csv \EVAL_FILE=data/csts_validation.csv \TEST_FILE=data/csts_test.csv \bash run_sts.sh
Seerun_sts.sh
for a full description of the available options and default values.
The scriptrun_sts_fewshot.sh
can be used to evaluate large language-models in a few-shot setting with or without instructions. For example, to evaluate thegoogle/flan-t5-xxl
model on the C-STS dataset, run the following command:
python run_sts_fewshot.py \--model_name_or_path google/flan-t5-xxl \--k_shot 2 \--prompt_name long \--train_file data/csts_train.csv \--validation_file data/csts_validation.csv \--test_file data/csts_test.csv \--output_dir output/flan-t5-xxl/k2_long \--dtype tf32 \--batch_size 4
To accommodate large model typesrun_sts_fewshot.sh
will use all visible GPUs to load the model in model parallel. For smaller models setCUDA_VISIBLE_DEVICES
to the desired GPU ids.
Runpython run_sts_fewshot.py --help
for a full description of additional options and default values.
You can scores for your model on the test set by submitting your predictions using themake_test_submission.py
script as follows:
python make_test_submission.py your_email@email.com /path/to/your/predictions.json
This script expects the test predictions file to be in the format generated automatically by the scripts above; i.e.
{"0":1.0,"1":0.0,"...":"4731":0.5}
After submission your results will be emailed to the submitted email address with the relevant filename in the subject.
@misc{deshpande2023csts, title={CSTS: Conditional Semantic Textual Similarity}, author={Ameet Deshpande and Carlos E. Jimenez and Howard Chen and Vishvak Murahari and Victoria Graf and Tanmay Rajpurohit and Ashwin Kalyan and Danqi Chen and Karthik Narasimhan}, year={2023}, eprint={2305.15093}, archivePrefix={arXiv}, primaryClass={cs.CL}}