lucasgris/wav2vec4bpPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star35

Wav2vec resources and models for Brazilian Portuguese

License

MIT license

35 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
fine_tune.sh		fine_tune.sh
fine_tune_transformer_lm.sh		fine_tune_transformer_lm.sh
prepare_data.sh		prepare_data.sh
run_tests.sh		run_tests.sh

Repository files navigation

Wav2vec 2.0 for Brazilian Portuguese 🇧🇷

This repository aims at the development of audio technologies using Wav2vec 2.0, such as Automatic Speech Recognition (ASR), for the Brazilian Portuguese language.

Description

This repository contains code and fine-tuned Wav2vec checkpoints for Brazilian Portuguese, including some useful scripts to download and preprocess transcribed data.

Wav2vec 2.0 learns speech representations on unlabeled data as described inwav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020). For more information about Wav2vec, please access theofficial repository.

Tasks

AddCORAA to the BP Dataset (BP Dataset Version 2);
Release BP Dataset V2 fine tuned models;
Finetune using the XLR-S 300M, XLR-S 1B and XLR-S 2B models.

Checkpoints

ASR checkpoints

We provide several Wav2vec fine-tuned models for ASR. For a more detailed description of how we finetuned these models, please check the paperBrazilian Portuguese Speech Recognition Using Wav2vec 2.0.

Our last model is the bp_400. It was finetuned using the 400h filtered version of the BP Dataset (seeBrazilian Portuguese (BP) Dataset Version 1 below). The results against each gathered dataset are shown below.

Checkpoints of BP Dataset V1

Model name	Pretrained model	Fairseq model	Dict	Hugging Face link
bp_400	XLSR-53	fairseq	dict	hugging face
bp_400_xls-r-300M	XLS-R-300M	fairseq	dict	hugging face

Checkpoints of non-filtered BP Dataset (early version of the BP dataset)

Model name	Pretrained model	Fairseq model	Dict	Hugging Face link
bp_500	XLSR-53	fairseq	dict	hugging face
bp_500_10k	VoxPopuli 10k BASE	fairseq	dict	hugging face
bp_500_100k	VoxPopuli 100k BASE	fairseq	dict	hugging face

Checkpoints of each gathered dataset

Model name	Pretrained model	Fairseq model	Dict	Hugging Face link
bp_cetuc_100	XLSR-53	fairseq	dict	hugging face
bp_commonvoice_100	XLSR-53	fairseq	dict	hugging face
bp_commonvoice_10	XLSR-53	fairseq	dict	hugging face
bp_lapsbm_1	XLSR-53	fairseq	dict	hugging face
bp_mls_100	XLSR-53	fairseq	dict	hugging face
bp_sid_10	XLSR-53	fairseq	dict	hugging face
bp_tedx_100	XLSR-53	fairseq	dict	hugging face
bp_voxforge_1	XLSR-53	fairseq	dict	hugging face

Other checkpoints

We provide other Wav2vec checkpoints. These models were trained using all the available data at the time, including its dev and test subsets. Only Common Voice dev/test was selected to validate and test the model, respectively.

Datasets used for training	Fairseq model	Dict	Hugging Face link
CETUC + CV 6.1 (only train) + LaPS BM + MLS + VoxForge	fairseq	dict	hugging face
CETUC + CV 6.1 (all validated) + LaPS BM + MLS + VoxForge			hugging face

ASR Results

Summary (WER)

Model	CETUC	CV	LaPS	MLS	SID	TEDx	VF	AVG
bp_400	0.052	0.140	0.074	0.117	0.121	0.245	0.118	0.124
bp_400_xls-r-300M	0.048	0.123	0.068	0.111	0.084	0.207	0.095	0.105
bp_500	0.052	0.137	0.032	0.118	0.095	0.236	0.082*	0.112
bp_500-base10k_voxpopuli	0.120	0.249	0.039	0.227	0.169	0.349	0.116*	0.181
bp_500-base100k_voxpopuli	0.074	0.174	0.032	0.182	0.181	0.349	0.111*	0.157
bp_cetuc_100**	0.446	0.856	0.089	0.967	1.172	0.929	0.902	0.765
bp_commonvoice_100	0.088	0.126	0.121	0.173	0.177	0.424	0.145	0.179
bp_commonvoice_10	0.133	0.189	0.165	0.189	0.247	0.474	0.251	0.235
bp_lapsbm_1	0.111	0.418	0.145	0.299	0.562	0.580	0.469	0.369
bp_mls_100	0.192	0.260	0.162	0.163	0.268	0.492	0.268	0.257
bp_sid_10	0.186	0.327	0.207	0.505	0.124	0.835	0.472	0.379
bp_tedx_100	0.138	0.369	0.169	0.165	0.794	0.222	0.395	0.321
bp_voxforge_1	0.468	0.608	0.503	0.505	0.717	0.731	0.561	0.584

* We found a problem with the dataset used in these experiments regarding the VoxForge subset. In this test set, some speakers were also present in the training set (which explains the lower WER). The final version of the dataset does not have such contamination.

** We do not perform validation in the subset experiments. CETUC has a poor variety of transcriptions. It might be overfitted.

Transcription examples

Text	Transcription
alguém sabe a que horas começa o jantar	alguém sabe a que horascomeço jantar
lila covas ainda não sabe o que vai fazer no fundo	lilacovas ainda não sabe o que vai fazer no fundo
que tal um pouco desse bom spaghetti	quetá um poucodeste bomispaguete
hong kong em cantonês significa porto perfumado	rongkongencantones significa porto perfumado
vamos hackear esse problema	vamosrackar esse problema
apenas a poucos metros há uma estação de ônibus	apenasha poucos metrosá uma estação de ônibus
relâmpago e trovão sempre andam juntos	relampagotrevão sempre andam juntos

Datasets

Datasets provided:

CETUC: contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from theCETEN-Folha corpus.
Common Voice 7.0: is a project proposed by Mozilla Foundation with the goal to create a wide-open dataset in different languages. In this project, volunteers donate and validate speech using theoficial site.
Lapsbm: "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control.
Multilingual Librispeech (MLS): a massive dataset available in many languages. The MLS is based on audiobook recordings in the public domain likeLibriVox. The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portugueseused in this work (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
Multilingual TEDx: a collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese variant) contains 164 hours of transcribed speech.
Sidney (SID): contains 5,777 utterances recorded by 72 speakers (20 women) from 17 to 59 years old with fields such as place of birth, age, gender, education, and occupation;
VoxForge: is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.

These datasets were combined to build a larger Brazilian Portuguese dataset (BP Dataset). All data was used for training except Common Voice dev/test sets, which were used for validation/test respectively. We also made test sets for all the gathered datasets.

Dataset	Train	Valid	Test
CETUC	93.9h	--	5.4h
Common Voice	37.6h	8.9h	9.5h
LaPS BM	0.8h	--	0.1h
MLS	161.0h	--	3.7h
Multilingual TEDx (Portuguese)	144.2h	--	1.8h
SID	5.0h	--	1.0h
VoxForge	2.8h	--	0.1h
Total	437.2h	8.9h	21.6h

You can download the datasets individually using the scripts at scripts/ directory. The scripts will create the respective dev and test sets automatically.

python scripts/mls.py

If you want to join several datasets, execute the script join_datasets at scripts/:

python scripts/join_datasets.py /path/to/dataset1/train /path/to/dataset2/train ... --output-dir data/my_dataset --output-name train

After joining datasets, you might have some degree of transcription contamination. To remove all transcriptions present in a specific subset (for example, test subset), you can use the filter_dataset script:

python scripts/filter_datasets.py /path/to/my_dataset/train /path/to/dataset1/test /path/to/dataset2/test -output-dir data/my_dataset --output-name my_filtered_train

Alternativelly, download the raw datasets using the links below:

Brazilian Portuguese (BP) Dataset Version 1

The BP Dataset is an assembled dataset composed of many others in Brazilian Portuguese. We used the original test sets of each gathered dataset to make individual test sets. For the datasets without test sets, we created them by selecting 5% of unique male and female speakers. Additionally, we performed some filtering removing all transcriptions of the test sets from the final training set. We also ignored audio more than 30 seconds long from the dataset.

If you run the provided scripts, you might generate a slightly different version of the BP dataset. If you want to use the same files used to train, validate and test our models, you can download the metadatahere.

Other versions

Our first attempt to build a larger dataset for BP produced a 500 hours dataset. However, we found some problems with the VoxForge subset. We also found some transcriptions of the test sets present in the training set. We made available the models trained with this version of the dataset (bp_500).

Language models

Language models can improve the ASR output. To use with fairseq, you will need to installflashlight python bindings. You will also need a lexicon containing the possible words.

Ken LM models

You can download some Ken LM modelshere. It is compatible with the flashlight decoder.

Transformer LM (fairseq) models

Model name	Fairseq model	Dict
BP Transformer LM	fairseq model	dict
Wikipedia Transformer LM	fairseq model	dict
Wikipedia Prunned Transformer LM	fairseq model	dict

Lexicon

🤗 Hugging Face Transformers + Wav2Vec2_PyCTCDecode

If you want to useWav2Vec2_PyCTCDecode with Transformers to decode the Hugging Face models, the Ken LM models provided above might not work. In this case, you should train your own following the instructionshere, or use one of the two models trained with BP Dataset and Wikipedia below:

ASR finetune

To finetune the model, first installfairseq and its dependencies.

cd fairseqpip install -e .

Download a pre-trained model (Seepretrained models)
Create or use a configuration file (see configs/ directory).
Finetune the model executing fairseq-hydra-train

root=/path/to/wav2vec4bpfairseq-hydra-train \   task.data=$root/data/my_dataset \   checkpoint.save_dir=$root/checkpoints/stt/my_model_name \   model.w2v_path=$root/xlsr_53_56k.pt \   common.tensorboard_logdir=$root/logs/stt/my_model_name \   --config-dir $root/configs \   --config-name my_configuration_file_name

Pretrained models

To fine-tune Wav2vec, you will need to download a pre-trained model first.

🤗 ASR finetune with HuggingFace

To easily finetune the model using hugging face, you can use the repositoryWav2vec-wrapper.

Language model training

To train a language model, one can use a Transformer LM or KenLM.

Ken LM

First, installKenLM.

git clone https://github.com/kpu/kenlm.gitcd kenlmmkdir -p buildcd buildcmake ..make -j 4

Then create a text file and run the following command:

./kenlm/build/bin/lmplz -o 5 <text.txt > path_to_lm.arpa

Transformer LM

To train aTransformer LM, first prepare and preprocess train, valid and test text files:

TEXT=path/to/datasetfairseq-preprocess \    --only-source \    --trainpref $TEXT/train.tokens \    --validpref $TEXT/valid.tokens \    --testpref $TEXT/test.tokens \    --destdir data/text/$dataset \    --workers 20

Then train the model:

fairseq-train --task language_modeling \  data/text/$dataset \  --save-dir checkpoints/transformer_lms/$name \  --arch transformer_lm --share-decoder-input-output-embed \  --dropout 0.1 \  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \  --tokens-per-sample 512 --sample-break-mode none \  --max-tokens 1024 --update-freq 32 \  --fp16 \  --max-update 50000

Docker

We recommend using a docker container, such asflml/flashlight, to easily finetune and test your models.

About

Wav2vec resources and models for Brazilian Portuguese

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook99.0%
Other1.0%

Movatterモバイル変換

License

lucasgris/wav2vec4bp

Folders and files

Latest commit

History

Repository files navigation

Wav2vec 2.0 for Brazilian Portuguese 🇧🇷

Description

Tasks

Checkpoints

ASR checkpoints

Checkpoints of BP Dataset V1

Checkpoints of non-filtered BP Dataset (early version of the BP dataset)

Checkpoints of each gathered dataset

Other checkpoints

ASR Results

Summary (WER)

Transcription examples

Datasets

Brazilian Portuguese (BP) Dataset Version 1

Other versions

Language models

Ken LM models

Transformer LM (fairseq) models

Lexicon

🤗 Hugging Face Transformers + Wav2Vec2_PyCTCDecode

ASR finetune

Pretrained models

🤗 ASR finetune with HuggingFace

Language model training

Ken LM

Transformer LM

Docker

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages