NotificationsYou must be signed in to change notification settings
Fork7
Star73

[EMNLP 2023] C-STS: Conditional Semantic Textual Similarity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
utils		utils
.gitignore		.gitignore
README.md		README.md
make_test_submission.py		make_test_submission.py
requirements.txt		requirements.txt
run_sts.py		run_sts.py
run_sts.sh		run_sts.sh
run_sts_fewshot.py		run_sts_fewshot.py

Repository files navigation

C-STS

This repository contains the dataset and code for the paper C-STS: Conditional Semantic Textual Similarity.[ArXiv]

Data

To avoid the intentional/unintentional scraping of the C-STS dataset for pre-training LLMs, which could cause training data contamination and impact their evaluation, we adopt the following approach for our dataset release.

The dataset for C-STS is stored in an encrypted file namedcsts.tar.enc. To access the dataset, follow these steps:

Request Access: Submit a request to obtain the decryption password byclicking here. You will receive an email response with the password immediately.
Decrypt the Dataset: Once you have received the password via email, you can decrypt thecsts.tar.enc file using the providedextract.sh script. Follow the instructions below:
- Open a terminal and navigate to thedata directory.
- Run the following command, replacing<password> with the decryption password obtained via email:
```
bash extract.sh csts.tar.enc<password>
```
Provided the correct password, this step will generate three filescsts_train.csv,csts_validation.csv, andcsts_test.csv, the unencrypted dataset splits.

You can load the data usingdatasets with the following lines

fromdatasetsimportload_datasetdataset=load_dataset('csv',data_files=  {'train':'data/csts_train.csv','validation':'data/csts_validation.csv','test':'data/csts_test.csv'  })

Important: By using this dataset, you agree to not publicly share its unencrypted contents or decryption password.

Code

We provide the basic training scripts and utilities for finetuning and evaluating the models in the paper. The code is adapted from theHuggingFace Transformers library. Refer to thedocumentation for more details.

Fine-tuning

The current code supports finetuning any encoder-only model, using thecross_encoder,bi_encoder, ortri_encoder settings described in the paper.You can finetune the models described in the paper using therun_sts.sh script. For example, to finetune theprinceton-nlp/sup-simcse-roberta-base model on the C-STS dataset, run the following command:

MODEL=princeton-nlp/sup-simcse-roberta-base \ENCODER_TYPE=bi_encoder \LR=1e-5 \WD=0.1 \TRANSFORM=False \OBJECTIVE=mse \OUTPUT_DIR=output \TRAIN_FILE=data/csts_train.csv \EVAL_FILE=data/csts_validation.csv \TEST_FILE=data/csts_test.csv \bash run_sts.sh

Seerun_sts.sh for a full description of the available options and default values.

Few-shot Evaluation

The scriptrun_sts_fewshot.sh can be used to evaluate large language-models in a few-shot setting with or without instructions. For example, to evaluate thegoogle/flan-t5-xxl model on the C-STS dataset, run the following command:

python run_sts_fewshot.py \--model_name_or_path google/flan-t5-xxl \--k_shot 2 \--prompt_name long \--train_file data/csts_train.csv \--validation_file data/csts_validation.csv \--test_file data/csts_test.csv \--output_dir output/flan-t5-xxl/k2_long \--dtype tf32 \--batch_size 4

To accommodate large model typesrun_sts_fewshot.sh will use all visible GPUs to load the model in model parallel. For smaller models setCUDA_VISIBLE_DEVICES to the desired GPU ids.

Runpython run_sts_fewshot.py --help for a full description of additional options and default values.

Submitting Test Results

You can scores for your model on the test set by submitting your predictions using themake_test_submission.py script as follows:

python make_test_submission.py your_email@email.com /path/to/your/predictions.json

This script expects the test predictions file to be in the format generated automatically by the scripts above; i.e.

{"0":1.0,"1":0.0,"...":"4731":0.5}

After submission your results will be emailed to the submitted email address with the relevant filename in the subject.

Citation

@misc{deshpande2023csts,      title={CSTS: Conditional Semantic Textual Similarity},       author={Ameet Deshpande and Carlos E. Jimenez and Howard Chen and Vishvak Murahari and Victoria Graf and Tanmay Rajpurohit and Ashwin Kalyan and Danqi Chen and Karthik Narasimhan},      year={2023},      eprint={2305.15093},      archivePrefix={arXiv},      primaryClass={cs.CL}}

About

[EMNLP 2023] C-STS: Conditional Semantic Textual Similarity

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

C-STS

Table of Contents

Data

Code

Fine-tuning

Few-shot Evaluation

Submitting Test Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Languages

Movatterモバイル変換

princeton-nlp/c-sts

Folders and files

Latest commit

History

Repository files navigation

C-STS

Table of Contents

Data

Code

Fine-tuning

Few-shot Evaluation

Submitting Test Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Languages

Packages