Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Train Models Contrastively in Pytorch

License

NotificationsYou must be signed in to change notification settings

catchcake/contrastors

 
 

Repository files navigation

contrastors is contrastive learning toolkit that enables researchers and engineers to train and evaluate contrastive models efficiently.

img

Features

  • Built on top ofFlash Attention for fast and efficient training
  • Support for training on multiple GPUs
  • GradCache support for training with large batch sizes in constrained memory environments
  • Huggingface Support for easy loading of common models (Pythia/GPTNeoX, BERT, etc.)
  • Masked Language Modeling (MLM) Pretraining
  • Matryoshka Representation Learning for flexible embedding sizes
  • CLIP andLiT style contrastive learning
  • Support for loading popular ViT (e.g.timm) models

Research

Getting Started and Requirements

Thecontrastors library relies on custom kernels from theFlash Attention repository. To setup your enviornment you will need to follow the steps below.

Make sure that you have Cuda 11.8+. You can check this by runningnvcc --version or if you already have torch installed you can runpython -c "import torch; print(torch.version.cuda)"

Create a python venv and activate it

python3 -m venv envsource env/bin/activate

Installtorch. See the torch docs for specific instructions for your system (e.g. the default CUDA torch supports is 12.1 as of 12/12/2023).

pip3 install torch torchvision torchaudio

Install wheel, packaging, ninja for Flash Attention (so the builds don't take too long)

pip install wheel packaging ninja setuptools

Install Flash Attention and the custom kernels

pip install --no-cache-dir flash-attn --no-build-isolation git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/layer_norm git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/fused_dense_lib git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/xentropy

Install the rest of the requirements and the package

pip install -e.

Data Access

We provide access to thenomic-embed-text-v1 dataset via thenomic package. To access the data, you will need to create an account and login to thenomic package. First create an account atatlas.nomic.ai, download thenomic Python client, and run the following commands:

pip install nomicnomic login# follow prompts to loginpython -c"from nomic import atlas; print(atlas._get_datastream_credentials(name='contrastors'))"

which will print out your access keys. You can then configure them by usingaws configure or settingtheAWS_ACCESS_KEY_ID andAWS_SECRET_ACCESS_KEY environment variables.

If you do not have the AWS CLI installed, you can install ithere.

To verify your access, you can run the following command to list the contents of the bucket:

aws s3 ls --endpoint-url=https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/ s3://contrastiveaws s3 ls --endpoint-url=https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/ s3://contrastive-index-filtered

You should be able to see the contents of the bucket and download the data.

If you intend to train using our data and thecontrastors repo, you will need to setupfsspec support for Cloudflare R2. To do so,create a file~/.config/fsspec/s3.json with the following contents:

{"s3": {"client_kwargs": {"endpoint_url":"https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/","aws_access_key_id":<ACCESS_KEY_ID>,"aws_secret_access_key":<SECRET_KEY_ID>    }  }}

Nomic Data Format

Our text data is stored in gziped jsonl files with which we also store acounts.json file andoffsets.json.gzip.

Thecounts.json file is a dictionary mapping the file name to the number of examples in the file. Theoffsets.json.gz file is a dictionary mapping the file name to a dictionary where each key is the index of the example and the value is a tuple of the start and end byte offset of the example in the file. We do this to allow for streaming of data in from R2, especially when the data is larger than the buffer size.

Here's a small example of what a dataset configuration might look like:

datasets:  -name:"paq"bucket:"s3://contrastive-index-filtered/paq_full/shard-{00000..00538}.jsonl.gz"query_prefix:"search_query"document_prefix:"search_document"objective:type:"paired"columns:["query", "document"]

objective defines if it's a paired or triplet objective. In both cases, thecolumns field defines the columns to use for each example.

Trainingnomic-embed-text-v1

Masked Language Modeling Pretraining

To train your own BERT from scratch (with all the optimizations) run

cd src/contrastorsdeepspeed --num_gpus=8 train.py --config=configs/train/mlm.yaml --deepspeed_config=configs/deepspeed/ds_config.json --dtype=bf16

Constrastive Pretraining and Finetuning

To launch an experiment run

cd src/contrastorstorchrun --nproc-per-node=8 train.py --config=configs/train/contrastive_pretrain.yaml --dtype=bf16

This will train a bert model on all ~200M examples. To change the dataset, you can modifydata_args.input_shards.

To finetunenomic-bert-embed-v1-unsupervised, update the config toconfigs/train/contrastive_finetune.yaml.

Generating Your Own Data

To generate your own data for any step of the pipeline, you can use the provided scripts inscripts/text.

See theREADME inscripts/text for more information.

Trainingnomic-embed-vision-v1.5

To align a vision model, you will need to curate a large image-text dataset. More details can be foundhere.

To alignnomic-embed-vision-v1.5 withnomic-embed-text-v1.5, you can run the following command:

deepspeed  train.py --deepspeed_config=configs/deepspeed/image_text.json --config=configs/train/nomic_embed_vision_v1.5.yaml --dtype=bf16

Pretrained Models

We provide pretrained models forNomic Embed at the following locations:

Join the Nomic Community

License

This code is licensed under theApache 2.0 License. See the model cards for the individual license for each model.

Acknowledgements

We thank Tri Dao for his work on Flash Attention and the custom kernels that make this project possible, theOpenCLIP team for theirgreat repository with which much of this work is based on, and the Huggingface team for their great work on the transformers library.

Citation

If you find the model, dataset, or training code useful, please cite our work

@misc{nussbaum2024nomic,title={Nomic Embed: Training a Reproducible Long Context Text Embedder},author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},year={2024},eprint={2402.01613},archivePrefix={arXiv},primaryClass={cs.CL}}@misc{nussbaum2024nomicembedvisionexpanding,title={Nomic Embed Vision: Expanding the Latent Space},author={Zach Nussbaum and Brandon Duderstadt and Andriy Mulyar},year={2024},eprint={2406.18587},archivePrefix={arXiv},primaryClass={cs.CV},url={https://arxiv.org/abs/2406.18587}, }@misc{nussbaum2025trainingsparsemixtureexperts,title={Training Sparse Mixture Of Experts Text Embedding Models},author={Zach Nussbaum and Brandon Duderstadt},year={2025},eprint={2502.07972},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2502.07972}, }

About

Train Models Contrastively in Pytorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python99.1%
  • Other0.9%

[8]ページ先頭

©2009-2025 Movatter.jp