arxyzan/data2vec-pytorchPublic

NotificationsYou must be signed in to change notification settings
Fork25
Star185

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

License

MIT license

185 stars 25 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
audio		audio
data2vec		data2vec
text		text
vision		vision
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data2vec.png		data2vec.png
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Repository files navigation

data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (FAIR)

Disclaimer: This repo's goal is to make data2vec easier to understand hence it's not recommended to use for actual model pretraining but instead you'd better use the official version in fairseq or the ones provided on HuggingFace.

Data2Vec is the first high-performance self-supervised algorithm that learns the same way in multiple modalities, including speech, vision and text.Most machines learn exclusively from labeled data. However, through self-supervised learning, machines are able to learn about the world just by observing itand then figuring out the structure of images, speech or text. This is a more scalable and efficient approach for machines to tackle new complex tasks,such as understanding text for more spoken languages.

In summary, the method is as follows:

The encoder extracts features from the masked inputs. These features are outputs of every transformer/linear layer.
The teacher which is an EMA instance of the encoder (in eval model), extracts features from the unmasked inputs.
Optional normalizations are applied to the layers/outputs of the teacher.
Encoder outputs are regressed by a projection block/layer.
The loss is calculated from encoder outputs and teacher outputs.

You can read the paper for more detail.

Implementation

Data2Vec is already implemented infairseq in which for all modalities there is a seperate implementation (text, vision, audio). According to the paper:

Our primary is to design a single learning mechanism for different modalities.Despite the unified learning regime, we still use modality-specific features extractors and masking strategies.This makes sense given the vastly different nature of the input data.

This implementation differs in the fact that a single Data2Vec model is provided powered by a custom encoder (implemented using PyTorch + HuggingFace Transformers) and tries to unify the whole concept in a single module.The key concept is that there must be modality-specific feature extractions and masking strategies.

Masking: For each modality, the Dataset instance must return the masked source, the target and the mask tensor.
Feature Extraction: Features are the outputs from the transformer/attention layers. So the forward method must return outputs from all Encoder blocks of the transformer model. HuggingFace Transformers/Fairseq models return transformer layers outputs separately out of the box.

This implementation uses HuggingFace Transformers models as encoders for Data2Vec which you can inspect in theencoder.py files for each modality. Although, you can provide your own encoder model. Just make sure that your encoder must be Transformer-based according to the paper and outputs from every encoder layer must be provided.

Note: This implementation's goal is to provide the necessary building blocks of Data2Vec so anyone can adapt it to their own use case with ease, so in order to make it easy to get hands on, some functionalities like mixed precision, distributed training, etc are not included to keep it as clean & simple as possible. If you only need to train a standard large scale Data2Vec model use theofficial repo.

Train

First things first, install the requirements:

pip install -r requirements.txt

NLP

Train a Language Model based on RoBERTa (HuggingFace) on WikiText103

Configure the related properties intext/configs/roberta-pretraining.yaml and run:

python train.py --config text/configs/roberta-pretraining.yaml

Vision

Run a Masked Image modeling training based on BEiT (HuggingFace)

Pass the path to the image dataset in the config file atvision/configs/beit-pretraining.yaml under dataset > path > train/test and modify other properties as you desire and run the following:

python train.py --config vision/configs/beit-pretraining.yaml

Speech

Audio pretraining based on Wav2Vec2 (HuggingFace) ontimit dataset. If you want to use other datasets likelibrispeech provide it inaudio/dataset.py (some minor changes to the timit class would do the job because both are loaded from HuggingFace datasets)

Configure other properties as you desire and run the following:

python train.py --config audio/configs/wav2vec2-pretraining.yaml

Pre-trained Weights

Note: The below models' weights were carefully ported from the original checkpoints in thefairseq version.

RoBERTa

Data2Vec model trained with RoBERTa as the encoder (data2vec-roberta-base)

fromtransformersimportAutoModel,AutoConfigfromtransformersimportRobertaModelcheckpoint='arxyzan/data2vec-roberta-base'# Option 1: load using AutoModeldata2vec_roberta=AutoModel.from_pretrained(checkpoint)# Option 2: load directly by RobertaModeldata2vec_roberta=RobertaModel.from_pretrained(checkpoint)

BEiT

Data2Vec model trained with BEiT as the encoder (data2vec-beit-base)

fromtransformersimportAutoModel,AutoConfigfromtransformersimportBeitModelcheckpoint='arxyzan/data2vec-beit-base'# Option 1: load using AutoModeldata2vec_beit=AutoModel.from_pretrained(checkpoint)# Option 2: load directly by BeitModeldata2vec_beit=BeitModel.from_pretrained(checkpoint)

Wav2Vec2

Data2Vec model trained with Wav2Vec2 as the encoder (data2vec-wav2vec2-base)

fromtransformersimportAutoModel,AutoConfigfromtransformersimportWav2Vec2Modelcheckpoint='arxyzan/data2vec-wav2vec2-base'# Option 1: load using AutoModeldata2vec_wav2vec2=AutoModel.from_pretrained(checkpoint)# Option 2: load directly by Wav2Vec2Modeldata2vec_wav2vec2=Wav2Vec2Model.from_pretrained(checkpoint)

Fine-tuning

Fine-tune using the checkpoints mentioned above:

# Text classification using Roberta model from HuggingFacefromtransformersimportRobertaModel,RobertaForSequenceClassificationcheckpoint='arxyzan/data2vec-roberta-base'# this is exactly a roberta model but trained with data2vecdata2vec_roberta=RobertaModel.from_pretrained(checkpoint)text_classifier=RobertaForSequenceClassification(data2vec_roberta.config)# assign `data2vec-roberta` weights to the roberta block of the classifiertext_classifier.roberta=data2vec_roberta...

In case you trained a model using this codebase, you can fine-tune it by taking out the encoder's state dict from the checkpoint which gives you a HuggingFace model and you can fine-tune it for any downstream task as you'd normally do for HuggingFace models.

# load a checkpoint for finetuningfromtransformersimportRobertaModel,RobertaConfigroberta=RobertaModel(RobertaConfig())checkpoint=torch.load('path/to/data2vec.pt')roberta_state_dict=checkpoint['encoder']# load roberta weights from the encoder part of the data2vec modelencoder=roberta.load_state_dict(roberta_state_dict)# Now fine-tune a regular HuggingFace RoBERTa model...

Contributions

Any contribution regarding training, development (for Data2Vec2) and issues are welcome!

About

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (FAIR)

Disclaimer: This repo's goal is to make data2vec easier to understand hence it's not recommended to use for actual model pretraining but instead you'd better use the official version in fairseq or the ones provided on HuggingFace.

Implementation

Train

NLP

Vision

Speech

Pre-trained Weights

RoBERTa

BEiT

Wav2Vec2

Fine-tuning

Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

arxyzan/data2vec-pytorch

Folders and files

Latest commit

History

Repository files navigation

data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (FAIR)

Disclaimer: This repo's goal is to make data2vec easier to understand hence it's not recommended to use for actual model pretraining but instead you'd better use the official version in fairseq or the ones provided on HuggingFace.

Implementation

Train

NLP

Vision

Speech

Pre-trained Weights

RoBERTa

BEiT

Wav2Vec2

Fine-tuning

Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages