Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Wav2vec resources and models for Brazilian Portuguese

License

NotificationsYou must be signed in to change notification settings

lucasgris/wav2vec4bp

Repository files navigation

This repository aims at the development of audio technologies using Wav2vec 2.0, such as Automatic Speech Recognition (ASR), for the Brazilian Portuguese language.

Description

This repository contains code and fine-tuned Wav2vec checkpoints for Brazilian Portuguese, including some useful scripts to download and preprocess transcribed data.

Wav2vec 2.0 learns speech representations on unlabeled data as described inwav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020). For more information about Wav2vec, please access theofficial repository.

Tasks

  • AddCORAA to the BP Dataset (BP Dataset Version 2);
  • Release BP Dataset V2 fine tuned models;
  • Finetune using the XLR-S 300M, XLR-S 1B and XLR-S 2B models.

Checkpoints

ASR checkpoints

We provide several Wav2vec fine-tuned models for ASR. For a more detailed description of how we finetuned these models, please check the paperBrazilian Portuguese Speech Recognition Using Wav2vec 2.0.

Our last model is the bp_400. It was finetuned using the 400h filtered version of the BP Dataset (seeBrazilian Portuguese (BP) Dataset Version 1 below). The results against each gathered dataset are shown below.

Checkpoints of BP Dataset V1

Model namePretrained modelFairseq modelDictHugging Face link
bp_400XLSR-53fairseqdicthugging face
bp_400_xls-r-300MXLS-R-300Mfairseqdicthugging face

Checkpoints of non-filtered BP Dataset (early version of the BP dataset)

Model namePretrained modelFairseq modelDictHugging Face link
bp_500XLSR-53fairseqdicthugging face
bp_500_10kVoxPopuli 10k BASEfairseqdicthugging face
bp_500_100kVoxPopuli 100k BASEfairseqdicthugging face

Checkpoints of each gathered dataset

Model namePretrained modelFairseq modelDictHugging Face link
bp_cetuc_100XLSR-53fairseqdicthugging face
bp_commonvoice_100XLSR-53fairseqdicthugging face
bp_commonvoice_10XLSR-53fairseqdicthugging face
bp_lapsbm_1XLSR-53fairseqdicthugging face
bp_mls_100XLSR-53fairseqdicthugging face
bp_sid_10XLSR-53fairseqdicthugging face
bp_tedx_100XLSR-53fairseqdicthugging face
bp_voxforge_1XLSR-53fairseqdicthugging face

Other checkpoints

We provide other Wav2vec checkpoints. These models were trained using all the available data at the time, including its dev and test subsets. Only Common Voice dev/test was selected to validate and test the model, respectively.

Datasets used for trainingFairseq modelDictHugging Face link
CETUC + CV 6.1 (only train) + LaPS BM + MLS + VoxForgefairseqdicthugging face
CETUC + CV 6.1 (all validated) + LaPS BM + MLS + VoxForgehugging face

ASR Results

Summary (WER)
ModelCETUCCVLaPSMLSSIDTEDxVFAVG
bp_4000.0520.1400.0740.1170.1210.2450.1180.124
bp_400_xls-r-300M0.0480.1230.0680.1110.0840.2070.0950.105
bp_5000.0520.1370.0320.1180.0950.2360.082*0.112
bp_500-base10k_voxpopuli0.1200.2490.0390.2270.1690.3490.116*0.181
bp_500-base100k_voxpopuli0.0740.1740.0320.1820.1810.3490.111*0.157
bp_cetuc_100**0.4460.8560.0890.9671.1720.9290.9020.765
bp_commonvoice_1000.0880.1260.1210.1730.1770.4240.1450.179
bp_commonvoice_100.1330.1890.1650.1890.2470.4740.2510.235
bp_lapsbm_10.1110.4180.1450.2990.5620.5800.4690.369
bp_mls_1000.1920.2600.1620.1630.2680.4920.2680.257
bp_sid_100.1860.3270.2070.5050.1240.8350.4720.379
bp_tedx_1000.1380.3690.1690.1650.7940.2220.3950.321
bp_voxforge_10.4680.6080.5030.5050.7170.7310.5610.584

* We found a problem with the dataset used in these experiments regarding the VoxForge subset. In this test set, some speakers were also present in the training set (which explains the lower WER). The final version of the dataset does not have such contamination.

** We do not perform validation in the subset experiments. CETUC has a poor variety of transcriptions. It might be overfitted.

Transcription examples
TextTranscription
alguém sabe a que horas começa o jantaralguém sabe a que horascomeço jantar
lila covas ainda não sabe o que vai fazer no fundolilacovas ainda não sabe o que vai fazer no fundo
que tal um pouco desse bom spaghettiquetá um poucodeste bomispaguete
hong kong em cantonês significa porto perfumadorongkongencantones significa porto perfumado
vamos hackear esse problemavamosrackar esse problema
apenas a poucos metros há uma estação de ônibusapenasha poucos metrosá uma estação de ônibus
relâmpago e trovão sempre andam juntosrelampagotrevão sempre andam juntos

Datasets

Datasets provided:

  • CETUC: contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from theCETEN-Folha corpus.
  • Common Voice 7.0: is a project proposed by Mozilla Foundation with the goal to create a wide-open dataset in different languages. In this project, volunteers donate and validate speech using theoficial site.
  • Lapsbm: "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control.
  • Multilingual Librispeech (MLS): a massive dataset available in many languages. The MLS is based on audiobook recordings in the public domain likeLibriVox. The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portugueseused in this work (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
  • Multilingual TEDx: a collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese variant) contains 164 hours of transcribed speech.
  • Sidney (SID): contains 5,777 utterances recorded by 72 speakers (20 women) from 17 to 59 years old with fields such as place of birth, age, gender, education, and occupation;
  • VoxForge: is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.

These datasets were combined to build a larger Brazilian Portuguese dataset (BP Dataset). All data was used for training except Common Voice dev/test sets, which were used for validation/test respectively. We also made test sets for all the gathered datasets.

DatasetTrainValidTest
CETUC93.9h--5.4h
Common Voice37.6h8.9h9.5h
LaPS BM0.8h--0.1h
MLS161.0h--3.7h
Multilingual TEDx (Portuguese)144.2h--1.8h
SID5.0h--1.0h
VoxForge2.8h--0.1h
Total437.2h8.9h21.6h

You can download the datasets individually using the scripts at scripts/ directory. The scripts will create the respective dev and test sets automatically.

python scripts/mls.py

If you want to join several datasets, execute the script join_datasets at scripts/:

python scripts/join_datasets.py /path/to/dataset1/train /path/to/dataset2/train ... --output-dir data/my_dataset --output-name train

After joining datasets, you might have some degree of transcription contamination. To remove all transcriptions present in a specific subset (for example, test subset), you can use the filter_dataset script:

python scripts/filter_datasets.py /path/to/my_dataset/train /path/to/dataset1/test /path/to/dataset2/test -output-dir data/my_dataset --output-name my_filtered_train

Alternativelly, download the raw datasets using the links below:

Brazilian Portuguese (BP) Dataset Version 1

The BP Dataset is an assembled dataset composed of many others in Brazilian Portuguese. We used the original test sets of each gathered dataset to make individual test sets. For the datasets without test sets, we created them by selecting 5% of unique male and female speakers. Additionally, we performed some filtering removing all transcriptions of the test sets from the final training set. We also ignored audio more than 30 seconds long from the dataset.

If you run the provided scripts, you might generate a slightly different version of the BP dataset. If you want to use the same files used to train, validate and test our models, you can download the metadatahere.

Other versions

Our first attempt to build a larger dataset for BP produced a 500 hours dataset. However, we found some problems with the VoxForge subset. We also found some transcriptions of the test sets present in the training set. We made available the models trained with this version of the dataset (bp_500).

Language models

Language models can improve the ASR output. To use with fairseq, you will need to installflashlight python bindings. You will also need a lexicon containing the possible words.

Ken LM models

You can download some Ken LM modelshere. It is compatible with the flashlight decoder.

Transformer LM (fairseq) models

Model nameFairseq modelDict
BP Transformer LMfairseq modeldict
Wikipedia Transformer LMfairseq modeldict
Wikipedia Prunned Transformer LMfairseq modeldict

Lexicon

🤗 Hugging Face Transformers + Wav2Vec2_PyCTCDecode

If you want to useWav2Vec2_PyCTCDecode with Transformers to decode the Hugging Face models, the Ken LM models provided above might not work. In this case, you should train your own following the instructionshere, or use one of the two models trained with BP Dataset and Wikipedia below:

ASR finetune

  1. To finetune the model, first installfairseq and its dependencies.
cd fairseqpip install -e .
  1. Download a pre-trained model (Seepretrained models)

  2. Create or use a configuration file (see configs/ directory).

  3. Finetune the model executing fairseq-hydra-train

root=/path/to/wav2vec4bpfairseq-hydra-train \   task.data=$root/data/my_dataset \   checkpoint.save_dir=$root/checkpoints/stt/my_model_name \   model.w2v_path=$root/xlsr_53_56k.pt \   common.tensorboard_logdir=$root/logs/stt/my_model_name \   --config-dir $root/configs \   --config-name my_configuration_file_name

Pretrained models

To fine-tune Wav2vec, you will need to download a pre-trained model first.

🤗 ASR finetune with HuggingFace

To easily finetune the model using hugging face, you can use the repositoryWav2vec-wrapper.

Language model training

To train a language model, one can use a Transformer LM or KenLM.

Ken LM

First, installKenLM.

git clone https://github.com/kpu/kenlm.gitcd kenlmmkdir -p buildcd buildcmake ..make -j 4

Then create a text file and run the following command:

./kenlm/build/bin/lmplz -o 5 <text.txt > path_to_lm.arpa

Transformer LM

To train aTransformer LM, first prepare and preprocess train, valid and test text files:

TEXT=path/to/datasetfairseq-preprocess \    --only-source \    --trainpref $TEXT/train.tokens \    --validpref $TEXT/valid.tokens \    --testpref $TEXT/test.tokens \    --destdir data/text/$dataset \    --workers 20

Then train the model:

fairseq-train --task language_modeling \  data/text/$dataset \  --save-dir checkpoints/transformer_lms/$name \  --arch transformer_lm --share-decoder-input-output-embed \  --dropout 0.1 \  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \  --tokens-per-sample 512 --sample-break-mode none \  --max-tokens 1024 --update-freq 32 \  --fp16 \  --max-update 50000

Docker

We recommend using a docker container, such asflml/flashlight, to easily finetune and test your models.

About

Wav2vec resources and models for Brazilian Portuguese

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp