Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Open tools and data for cloudless automatic speech recognition

License

NotificationsYou must be signed in to change notification settings

gooofy/zamia-speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python scripts to compute audio and language models from voxforge.org speech data and many sources.Models that can be built include:

  • Kaldi nnet3 chain audio models
  • KenLM language models in ARPA format
  • sequitur g2p models
  • wav2letter++ models

Important: Please note that these scripts form in no way a complete application ready for end-user consumption.However, if you are a developer interested in natural language processing you may find some of them useful.Contributions, patches and pull requests are very welcome.

At the time of this writing, the scripts here are focused on building the English and German VoxForge models.However, there is no reason why they couldn't be used to build other language models as well, feel free tocontribute support for those.

Table of Contents

Created bygh-md-toc

Download

We have various models plus source code and binaries for the tools used to build these modelsavailable for download. Everything is free and open source.

All our model and data downloads can be found here:Downloads

ASR Models

Our pre-built ASR models can be downloaded here:ASR Models

  • Kaldi ASR, English:
    • kaldi-generic-en-tdnn_fLarge nnet3-chain factorized TDNN model, trained on ~1200 hours of audio. Has decent background noise resistance and canalso be used on phone recordings. Should provide the best accuracy but is a bit more resource intensive than theother models.
    • kaldi-generic-en-tdnn_spLarge nnet3-chain model, trained on ~1200 hours of audio. Has decent background noise resistance and canalso be used on phone recordings. Less accurate but also slightly less resource intensive than thetddn_f model.
    • kaldi-generic-en-tdnn_250Same as the larger models but less resource intensive, suitable for use in embedded applications (e.g. a RaspberryPi 3).
    • kaldi-generic-en-tri2b_chainGMM Model, trained on the same data as the above two models - meant for auto segmentation tasks.
  • Kaldi ASR, German:
    • kaldi-generic-de-tdnn_fLarge nnet3-chain model, trained on ~400 hours of audio. Has decent background noise resistance and canalso be used on phone recordings.
    • kaldi-generic-de-tdnn_250Same as the large model but less resource intensive, suitable for use in embedded applications (e.g. a RaspberryPi 3).
    • kaldi-generic-de-tri2b_chainGMM Model, trained on the same data as the above two models - meant for auto segmentation tasks.
  • wav2letter++, German:
    • w2l-generic-deLarge model, trained on ~400 hours of audio. Has decent background noise resistance and canalso be used on phone recordings.

NOTE: It is important to realize that these models can and should be adapted to your application domain. SeeModel Adaptation for details.

IPA Dictionaries (Lexicons)

Our dictionaries can be downloaded here:Dictionaries

  • IPA UTF-8, English:
    • dict-en.ipaBased on CMUDict with many additional entries generated via Sequitur G2P.
  • IPA UTF-8, German:
    • dict-de.ipaCreated manually from scratch with many additional auto-reviewed entries extracted from Wiktionary.

G2P Models

Our pre-built G2P models can be downloaded here:G2P Models

  • Sequitur, English:
    • sequitur-dict-en.ipaSequitur G2P model trained on our English IPA dictionary (UTF8).
  • Sequitur, German:
    • sequitur-dict-de.ipaSequitur G2P model trained on our German IPA dictionary (UTF8).

Language Models

Our pre-built ARPA language models can be downloaded here:Language Models

  • KenLM, order 4, English, ARPA:
    • generic_en_lang_model_small
  • KenLM, order 6, English, ARPA:
    • generic_en_lang_model_large
  • KenLM, order 4, German, ARPA:
    • generic_de_lang_model_small
  • KenLM, order 6, German, ARPA:
    • generic_de_lang_model_large

Code

Get Started with our Pre-Trained Models

Run Example Applications

Wave File Decoding Demo

Download a few sample wave files

$ wget http://goofy.zamia.org/zamia-speech/misc/demo_wavs.tgz--2018-06-23 16:46:28--  http://goofy.zamia.org/zamia-speech/misc/demo_wavs.tgzResolving goofy.zamia.org (goofy.zamia.org)... 78.47.65.20Connecting to goofy.zamia.org (goofy.zamia.org)|78.47.65.20|:80... connected.HTTP request sent, awaiting response... 200 OKLength: 619852 (605K) [application/x-gzip]Saving to: ‘demo_wavs.tgz’demo_wavs.tgz                     100%[==========================================================>] 605.32K  2.01MB/sin 0.3s    2018-06-23 16:46:28 (2.01 MB/s) - ‘demo_wavs.tgz’ saved [619852/619852]

unpack them:

$ tar xfvz demo_wavs.tgzdemo1.wavdemo2.wavdemo3.wavdemo4.wav

download the demo program

$ wget http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_wav.py--2018-06-23 16:47:53--  http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_wav.pyResolving goofy.zamia.org (goofy.zamia.org)... 78.47.65.20Connecting to goofy.zamia.org (goofy.zamia.org)|78.47.65.20|:80... connected.HTTP request sent, awaiting response... 200 OKLength: 2469 (2.4K) [text/plain]Saving to: ‘kaldi_decode_wav.py’kaldi_decode_wav.py               100%[==========================================================>]   2.41K  --.-KB/sin 0s      2018-06-23 16:47:53 (311 MB/s) - ‘kaldi_decode_wav.py’ saved [2469/2469]

now run kaldi automatic speech recognition on the demo wav files:

$ python kaldi_decode_wav.py -v demo?.wavDEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp loading model...DEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp loading model... done, took 1.473226s.DEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp creating decoder...DEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp creating decoder... done, took 0.143928s.DEBUG:root:demo1.wav decoding took     0.37s, likelyhood: 1.863645i cannot follow you she said DEBUG:root:demo2.wav decoding took     0.54s, likelyhood: 1.572326i should like to engage justforone whole lifein that DEBUG:root:demo3.wav decoding took     0.42s, likelyhood: 1.709773philip knew that she was not an indian DEBUG:root:demo4.wav decoding took     1.06s, likelyhood: 1.715135he also contented that better confidence was established by carrying no weapons

Live Mic Demo

Determine the name of your pulseaudio mic source:

$ pactl list sourcesSource#0    State: SUSPENDED    Name: alsa_input.usb-C-Media_Electronics_Inc._USB_PnP_Sound_Device-00.analog-mono    Description: CM108 Audio Controller Analog Mono                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

download and run demo:

$ wget'http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_live.py'$ python kaldi_decode_live.py -s'CM108'Kaldi live demo V0.2Loading model from /opt/kaldi/model/kaldi-generic-en-tdnn_250 ...Please speak.hallo computer                      switch on the radio please                      please switch on the light                      what about the weatherin stuttgart                     how are you                      thank you                      good bye

Get Started with a Demo STT Service Packaged in Docker

To start the STT service on your local machine, execute:

$ docker pull quay.io/mpuels/docker-py-kaldi-asr-and-model:kaldi-generic-en-tdnn_sp-r20180611$ docker run --rm -p 127.0.0.1:8080:80/tcp quay.io/mpuels/docker-py-kaldi-asr-and-model:kaldi-generic-en-tdnn_sp-r20180611

To transfer an audio file for transcription to the service, in a secondterminal, execute:

$ git clone https://github.com/mpuels/docker-py-kaldi-asr-and-model.git$ conda env create -f environment.yml$source activate py-kaldi-asr-client$ ./asr_client.py asr.wavINFO:root: 0.005s:  4000 frames ( 0.250s) decoded, status=200....INFO:root:19.146s: 152000 frames ( 9.500s) decoded, status=200.INFO:root:27.136s: 153003 frames ( 9.563s) decoded, status=200.INFO:root:*****************************************************************INFO:root:** wavfn: asr.wavINFO:root:** hstr: speech recognition system requires training where individuals to exercise political systemINFO:root:** confidence: -0.578844INFO:root:** decodingtime:    27.14sINFO:root:*****************************************************************

The Docker image in the example above is the result of stacking 4 images on topof each other:

Requirements

Note: probably incomplete.

  • Python 2.7 with nltk, numpy, ...
  • KenLM
  • kaldi
  • wav2letter++
  • py-nltools
  • sox
  • ffmpeg

Dependencies installation example for Debian:

apt-get install build-essential pkg-config python-pip python-dev python-setuptools python-wheel ffmpeg sox libatlas-base-dev# Create a symbolic link because one of the pip packages expect atlas in this location: ln -s /usr/include/x86_64-linux-gnu/atlas /usr/include/atlaspip install numpy nltk cythonpip install py-kaldi-asr py-nltools

Setup Notes

Just some rough notes on the environment needed to get these scripts to run. This is in no way a complete set ofinstructions, just some hints to get you started.

~/.speechrc

[speech]vf_login              = <your voxforge login>speech_arc            = /home/bofh/projects/ai/data/speech/arcspeech_corpora        = /home/bofh/projects/ai/data/speech/corporakaldi_root            = /apps/kaldi-cuda; facebook's wav2letter++w2l_env_activate      = /home/bofh/projects/ai/w2l/bin/activatew2l_train             = /home/bofh/projects/ai/w2l/src/wav2letter/build/Trainw2l_decoder           = /home/bofh/projects/ai/w2l/src/wav2letter/build/Decoderwav16                 = /home/bofh/projects/ai/data/speech/16kHznoise_dir             = /home/bofh/projects/ai/data/speech/corpora/noiseeuroparl_de           = /home/bofh/projects/ai/data/corpora/de/europarl-v7.de-en.deparole_de             = /home/bofh/projects/ai/data/corpora/de/German Parole Corpus/DE_Parole/europarl_en           = /home/bofh/projects/ai/data/corpora/en/europarl-v7.de-en.encornell_movie_dialogs = /home/bofh/projects/ai/data/corpora/en/cornell_movie_dialogs_corpusweb_questions         = /home/bofh/projects/ai/data/corpora/en/WebQuestionsyahoo_answers         = /home/bofh/projects/ai/data/corpora/en/YahooAnswerseuroparl_fr           = /home/bofh/projects/ai/data/corpora/fr/europarl-v7.fr-en.frest_republicain       = /home/bofh/projects/ai/data/corpora/fr/est_republicain.txtwiktionary_de         = /home/bofh/projects/ai/data/corpora/de/dewiktionary-20180320-pages-meta-current.xml[tts]host                  = localhostport                  = 8300

tmp directory

Some scripts expect al localtmp directory to be present, located in the same directory where all the scripts live, i.e.

mkdir tmp

Speech Corpora

The following list contains speech corpora supported by this script collection.

  • Forschergeist (German, 2 hours):

    • Download all .tgz files into the directory<~/.speechrc:speech_arc>/forschergeist
    • unpack them into the directory<~/.speechrc:speech_corpora>/forschergeist
  • German Speechdata Package Version 2 (German, 148 hours):

    • Unpack the archive such that the directoriesdev,test, andtrain aredirect subdirectories of<~/.speechrc:speech_arc>/gspv2.
    • Then run run the script./import_gspv2.py to convert the corpus to the VoxForgeformat. The resulting corpus will be written to<~/.speechrc:speech_corpora>/gspv2.
  • Noise:

    • Download the tarball
    • unpack it into the directory<~/.speechrc:speech_corpora>/ (it will generate anoise subdirectory there)
  • LibriSpeech ASR (English, 475 hours):

    • Download the set of 360 hours "clean" speech tarball
    • Unpack the archive such that the directoryLibriSpeech is a directsubdirectory of<~/.speechrc:speech_arc>.
    • Then run run the script./import_librispeech.py to convert the corpus to the VoxForgeformat. The resulting corpus will be written to<~/.speechrc:speech_corpora>/librispeech.
  • The LJ Speech Dataset (English, 24 hours):

    • Download the tarball
    • Unpack the archive such that the directoryLJSpeech-1.1 is a directsubdirectory of<~/.speechrc:speech_arc>.
    • Then run run the scriptimport_ljspeech.py to convert the corpus to the VoxForgeformat. The resulting corpus will be written to<~/.speechrc:speech_corpora>/lindajohnson-11.
  • Mozilla Common Voice German (German, 140 hours):

    • Downloadde.tar.gz
    • Unpack the archive such that the directorycv_de is a directsubdirectory of<~/.speechrc:speech_arc>.
    • Then run run the script./import_mozde.py to convert the corpus to the VoxForgeformat. The resulting corpus will be written to<~/.speechrc:speech_corpora>/cv_de.
  • Mozilla Common Voice V1 (English, 252 hours):

    • Downloadcv_corpus_v1.tar.gz
    • Unpack the archive such that the directorycv_corpus_v1 is a directsubdirectory of<~/.speechrc:speech_arc>.
    • Then run run the script./import_mozcv1.py to convert the corpus to the VoxForgeformat. The resulting corpus will be written to<~/.speechrc:speech_corpora>/cv_corpus_v1.
  • Munich Artificial Intelligence Laboratories GmbH (M-AILABS) Speech Dataset (English, 147 hours, German, 237 hours, French, 190 hours):

    • Downloadde_DE.tgz,en_UK.tgz,en_US.tgz,fr_FR.tgz (Mirror)
    • Create a subdirectorym_ailabs in<~/.speechrc:speech_arc>
    • Unpack the downloaded tarbals inside them_ailabs subdirectory
    • For French, create a directoryby_book and movemale andfemale directories in it as the archive does not follow exactly English and German structures
    • Then run run the script./import_mailabs.py to convert the corpus to the VoxForgeformat. The resulting corpus will be written to<~/.speechrc:speech_corpora>/m_ailabs_en,<~/.speechrc:speech_corpora>/m_ailabs_de and<~/.speechrc:speech_corpora>/m_ailabs_fr.
  • TED-LIUM Release 3 (English, 210 hours):

    • DownloadTEDLIUM_release-3.tgz
    • Unpack the archive such that the directoryTEDLIUM_release-3 is a directsubdirectory of<~/.speechrc:speech_arc>.
    • Then run run the script./import_tedlium3.py to convert the corpus to the VoxForgeformat. The resulting corpus will be written to<~/.speechrc:speech_corpora>/tedlium3.
  • VoxForge (English, 75 hours):

    • Download all .tgz files into the directory<~/.speechrc:speech_arc>/voxforge_en
    • unpack them into the directory<~/.speechrc:speech_corpora>/voxforge_en
  • VoxForge (German, 56 hours):

    • Download all .tgz files into the directory<~/.speechrc:speech_arc>/voxforge_de
    • unpack them into the directory<~/.speechrc:speech_corpora>/voxforge_de
  • VoxForge (French, 140 hours):

    • Download all .tgz files into the directory<~/.speechrc:speech_arc>/voxforge_fr
    • unpack them into the directory<~/.speechrc:speech_corpora>/voxforge_fr
  • Zamia (English, 5 minutes):

    • Download all .tgz files into the directory<~/.speechrc:speech_arc>/zamia_en
    • unpack them into the directory<~/.speechrc:speech_corpora>/zamia_en
  • Zamia (German, 18 hours):

    • Download all .tgz files into the directory<~/.speechrc:speech_arc>/zamia_de
    • unpack them into the directory<~/.speechrc:speech_corpora>/zamia_de

Technical note: For most corpora we have corrected transcripts in our databases which can be foundindata/src/speech/<corpus_name>/transcripts_*.csv. As these have been created by many hours of (semi-)manual review they should be of higher quality than the original prompts so they will be used duringtraining of our ASR models.

Once you have downloaded and, if necessary, converted a corpus you need to run

./speech_audio_scan.py<corpus name>

on it. This will add missing prompts to the CSV databases and convert audio files to 16kHz mono WAVE format.

Adding Artificial Noise or Other Effects

To improve noise resistance it is possible to derive corpora from existing ones with noise added:

./speech_gen_noisy.py zamia_de./speech_audio_scan.py zamia_de_noisycp data/src/speech/zamia_de/spk2gender data/src/speech/zamia_de_noisy/cp data/src/speech/zamia_de/spk_test.txt data/src/speech/zamia_de_noisy/./auto_review.py -a zamia_de_noisy./apply_review.py -l de zamia_de_noisy review-result.csv

This script will run recording through typical telephone codecs. Such a corpus can be used to train modelsthat support 8kHz phone recordings:

./speech_gen_phone.py zamia_de./speech_audio_scan.py zamia_de_phonecp data/src/speech/zamia_de/spk2gender data/src/speech/zamia_de_phone/cp data/src/speech/zamia_de/spk_test.txt data/src/speech/zamia_de_phone/./auto_review.py -a zamia_de_phone./apply_review.py -l de zamia_de_phone review-result.csv

Text Corpora

The following list contains text corpora that can be used to train languagemodels with the scripts contained in this repository:

  • Europarl, specificallyparallel corpus German-English andparallel corpus French-English:

    • corresponding variable in.speechrc:europarl_de,europarl_en,europarl_fr
    • sentences extraction: run./speech_sentences.py europarl_de,./speech_sentences.py europarl_en and./speech_sentences.py europarl_fr
  • Cornell Movie--Dialogs Corpus:

    • corresponding variable in.speechrc:cornell_movie_dialogs
    • sentences extraction: run./speech_sentences.py cornell_movie_dialogs
  • German Parole Corpus:

    • corresponding variable in.speechrc:parole_de
    • sentences extraction: train punkt tokenizer using./speech_train_punkt_tokenizer.py, then run./speech_sentences.py parole_de
  • WebQuestions:web_questions

    • corresponding variable in.speechrc:web_questions
    • sentences extraction: run./speech_sentences.py web_questions
  • Yahoo! Answers dataset:yahoo_answers

    • corresponding variable in.speechrc:yahoo_answers
    • sentences extraction: run./speech_sentences.py yahoo_answers
  • CNRTL Est Républicain Corpus, large corpus of news articles (4.3M headlines/paragraphs) available under a CC BY-NC-SA license. Download XML files and extract headlines and paragraphs to a text file with the following command:xmllint --xpath '//*[local-name()="div"][@type="article"]//*[local-name()="p" or local-name()="head"]/text()' Annee*/*.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt

    • corresponding variable in.speechrc:est_republicain
    • sentences extraction: run./speech_sentences.py est_republicain

Sentences can also be extracted from our speech corpora. To do that, run:

  • English Speech Corpora

    • ./speech_sentences.py voxforge_en
    • ./speech_sentences.py librispeech
    • ./speech_sentences.py zamia_en
    • ./speech_sentences.py cv_corpus_v1
    • ./speech_sentences.py ljspeech
    • ./speech_sentences.py m_ailabs_en
    • ./speech_sentences.py tedlium3
  • German Speech Corpora

    • ./speech_sentences.py forschergeist
    • ./speech_sentences.py gspv2
    • ./speech_sentences.py voxforge_de
    • ./speech_sentences.py zamia_de
    • ./speech_sentences.py m_ailabs_de
    • ./speech_sentences.py cv_de

Language Model

English

Prerequisites:

  • text corporaeuroparl_en,cornell_movie_dialogs,web_questions, andyahoo_answers are installed, sentences extracted (see instructions above).
  • sentences are extracted from speech corporalibrispeech,voxforge_en,zamia_en,cv_corpus_v1,ljspeech,m_ailabs_en,tedlium3

To train a small, pruned English language model of order 4 using KenLM for use in both kaldi and wav2letter builds run:

./speech_build_lm.py generic_en_lang_model_small europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3

to train a larger model of order 6 with less pruning:

./speech_build_lm.py -o 6 -p"0 0 0 0 1" generic_en_lang_model_large europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3

to train a medium size model of order 5:

./speech_build_lm.py -o 5 -p"0 0 1 2" generic_en_lang_model_medium europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3

German

Prerequisites:

  • text corporaeuroparl_de andparole_de are installed, sentences extracted (see instructions above).
  • sentences are extracted from speech corporaforschergeist,gspv2,voxforge_de,zamia_de,m_ailabs_de,cv_de

To train a small, pruned German language model of order 4 using KenLM for use in both kaldi and wav2letter builds run:

./speech_build_lm.py generic_de_lang_model_small europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de

to train a larger model of order 6 with less pruning:

./speech_build_lm.py -o 6 -p"0 0 0 0 1" generic_de_lang_model_large europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de

to train a medium size model of order 5:

./speech_build_lm.py -o 5 -p"0 0 1 2" generic_de_lang_model_medium europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de

French

Prerequisites:

  • text corporaeuroparl_fr andest_republicain are installed, sentences extracted (see instructions above).
  • sentences are extracted from speech corporavoxforge_fr andm_ailabs_fr

To train a French language model using KenLM run:

./speech_build_lm.py generic_fr_lang_model europarl_fr est_republicain voxforge_fr m_ailabs_fr

Submission Review and Transcription

The main tool used for submission review, transcription and lexicon expansion is:

./speech_editor.py

Lexica/Dictionaries

NOTE: We use the terms lexicon and dictionary interchangably in this documentation and our scripts.

Currently, we have two lexica, one for English and one for German (indata/src/dicts):

  • dict-en.ipa

  • dict-de.ipa

    • started manually from scratch
    • once enough entries existed to train a reasonable Sequitur G2P model, many entries where converted from German wiktionary (see below)

The native format of our lexica is in (UTF8) IPA with semicolons as separator. This format is then converted towhatever format is used by the target ASR engine by the corresponding export scripts.

Sequitur G2P

Many lexicon-related tools rely on Sequitur G2P to compute pronunciations for words missing from the dictionary. Thenecessary models can be downloaded from our file server:http://goofy.zamia.org/zamia-speech/g2p/ .For installation, download and unpack them and then put links to them underdata/models like so:

data/models/sequitur-dict-de.ipa-latest -><your model dir>/sequitur-dict-de.ipa-r20180510data/models/sequitur-dict-en.ipa-latest -><your model dir>/sequitur-dict-en.ipa-r20180510

To train your own Sequitur G2P models, use the export and train scripts provided, e.g.:

[guenter@dagobert speech]$ ./speech_sequitur_export.py -d dict-de.ipaINFO:root:loading lexicon...INFO:root:loading lexicon...done.INFO:root:sequitur workdir data/dst/dict-models/dict-de.ipa/sequitur done.[guenter@dagobert speech]$ ./speech_sequitur_train.sh dict-de.ipatraining sample: 322760 + 16988 develiteration: 0...

Manual Editing

./speech_lex_edit.py word [word2 ...]

is the main curses based, interactive lexicon editor. It will automaticallyproduce candidate entries for new words using Sequitur G2P, MaryTTS andeSpeakNG. The user can then edit these entries manually if necessary and checkthem by listening to them being synthesized via MaryTTS in different voices.

The lexicon editor is also integrated into various other tools,speech_editor.py in particularwhich allows you to transcribe, review and add missing words for new audio sampleswithin one tool - which is recommended.

I also tend to review lexicon entries randomly from time to time. For that I have a small script which will pick 20random entries where Sequitur G2P disagrees with the current transcription in the lexicon:

./speech_lex_edit.py`./speech_lex_review.py`

Also, I sometimes use this command to add missing words from transcripts in batch mode:

./speech_lex_edit.py`./speech_lex_missing.py`

Wiktionary

For the German lexicon, entries can be extracted from the German wiktionary using a set of scripts.To do that, the first step is to extract a set of candidate entries from an wiktionary XML dump:

./wiktionary_extract_ipa.py

this will output extracted entries todata/dst/speech/de/dict_wiktionary_de.txt. We now need totrain a Sequitur G2P model that translates these entries into our own IPA style and phoneme set:

./wiktionary_sequitur_export.py./wiktionary_sequitur_train.sh

finally, we translate the entries and check them against the predictions from our regular Sequitur G2P model:

./wiktionary_sequitur_gen.py

this script produces two output files:data/dst/speech/de/dict_wiktionary_gen.txt contains acceptable entries,data/dst/speech/de/dict_wiktionary_rej.txt contains rejected entries.

Kaldi Models (recommended)

English NNet3 Chain Models

The following recipe trains Kaldi models for English.

Before running it, make sure all prerequisites are met (see above for instructions on these):

  • language modelgeneric_en_lang_model_small built
  • some or all speech corpora ofvoxforge_en,librispeech,cv_corpus_v1,ljspeech,m_ailabs_en,tedlium3 andzamia_en are installed, converted and scanned.
  • optionally noise augmented corpora:voxforge_en_noisy,voxforge_en_phone,librispeech_en_noisy,librispeech_en_phone,cv_corpus_v1_noisy,cv_corpus_v1_phone,zamia_en_noisy andzamia_en_phone
./speech_kaldi_export.py generic-en-small dict-en.ipa generic_en_lang_model_small voxforge_en librispeech zamia_encd data/dst/asr-models/kaldi/generic-en-small./run-chain.sh

export run with noise augmented corpora included:

./speech_kaldi_export.py generic-en dict-en.ipa generic_en_lang_model_small voxforge_en cv_corpus_v1 librispeech ljspeech m_ailabs_en tedlium3 zamia_en voxforge_en_noisy librispeech_noisy cv_corpus_v1_noisy cv_corpus_v1_phone zamia_en_noisy voxforge_en_phone librispeech_phone zamia_en_phone

German NNet3 Chain Models

The following recipe trains Kaldi models for German.

Before running it, make sure all prerequisites are met (see above for instructions on these):

  • language modelgeneric_de_lang_model_small built
  • some or all speech corpora ofvoxforge_de,gspv2,forschergeist,zamia_de,m_ailabs_de,cv_de are installed, converted and scanned.
  • optionally noise augmented corpora:voxforge_de_noisy,voxforge_de_phone,zamia_de_noisy andzamia_de_phone
./speech_kaldi_export.py generic-de-small dict-de.ipa generic_de_lang_model_small voxforge_de gspv2 [ forschergeist zamia_de ...]cd data/dst/asr-models/kaldi/generic-de-small./run-chain.sh

export run with noise augmented corpora included:

./speech_kaldi_export.py generic-de dict-de.ipa generic_de_lang_model_small voxforge_de gspv2 forschergeist zamia_de voxforge_de_noisy voxforge_de_phone zamia_de_noisy zamia_de_phone m_ailabs_de cv_de

Model Adaptation

For a standalone kaldi model adaptation tool that does not require a complete zamia-speech setup, see

kaldi-adapt-lm

Existing kaldi models (such as the ones we provide for download but also those you may train from scratch using our scripts)can be adapted to (typically domain specific) language models, JSGF grammars and grammar FSTs.

Here is an example how to adapt our English model to a simple command and control JSGF grammar. Please note that this is justa toy example - for real world usage you will probably want to add garbage phoneme loops to the grammar or produce a languagemodel that has some noise resistance built in right away.

Here is the grammar we will use:

#JSGF V1.0;grammar org.zamia.control;public <control> = <wake> | <politeCommand> ;<wake> = ( good morning | hello | ok | activate ) computer;<politeCommand> = [ please | kindly | could you ] <command> [ please | thanks | thank you ];<command> = <onOffCommand> | <muteCommand> | <volumeCommand> | <weatherCommand>;<onOffCommand> = [ turn | switch ] [the] ( light | fan | music | radio ) (on | off) ;<volumeCommand> = turn ( up | down ) the ( volume | music | radio ) ;<muteCommand> = mute the ( music | radio ) ;<weatherCommand> = (what's | what) is the ( temperature | weather ) ;

the next step is to set up a kaldi model adaptation experiment using this script:

./speech_kaldi_adapt.py data/models/kaldi-generic-en-tdnn_250-latest dict-en.ipa control.jsgf control-en

here,data/models/kaldi-generic-en-tdnn_250-latest is the model to be adapted,dict-en.ipa is the dictionary whichwill be used by the new model,control.jsgf is the JSGF grammar we want the model to be adapted to (you could specify anFST source file or a language model instead here) andcontrol-en is the name of the new model that will be created.

To run the actual adaptation, change into the model directory and run the adaptation script there:

cd data/dst/asr-models/kaldi/control-en./run-adaptation.sh

finally, you can create a tarball from the newly created model:

cd ../../../../.../speech_dist.sh control-en kaldi adapt

wav2letter++ models

English Wav2letter Models

./wav2letter_export.py -l en -v generic-en dict-en.ipa generic_en_lang_model_large voxforge_en cv_corpus_v1 librispeech ljspeech m_ailabs_en tedlium3 zamia_en voxforge_en_noisy librispeech_noisy cv_corpus_v1_noisy cv_corpus_v1_phone zamia_en_noisy voxforge_en_phone librispeech_phone zamia_en_phonepushd data/dst/asr-models/wav2letter/generic-en/bash run_train.sh

German Wav2letter Models

./wav2letter_export.py -l de -v generic-de dict-de.ipa generic_de_lang_model_large voxforge_de gspv2 forschergeist zamia_de voxforge_de_noisy voxforge_de_phone zamia_de_noisy zamia_de_phone m_ailabs_de cv_depushd data/dst/asr-models/wav2letter/generic-de/bash run_train.sh

auto-reviews using wav2letter

create auto-review case:

./wav2letter_auto_review.py -l de w2l-generic-de-latest gspv2

run it:

pushd tmp/w2letter_auto_reviewbash run_auto_review.shpopd

apply the results:

./wav2letter_apply_review.py

Audiobook Segmentation and Transcription (Manual)

Some notes on how to segment and transcribe audiobooks or other audio sources (e.g. from librivox) usingthe abook scripts provided:

(0/3) Convert Audio to WAVE Format

MP3

```bashffmpeg -i foo.mp3 foo.wav```MKV
mkvextract tracks foo.mkv 0:foo.oggopusdec foo.ogg foo.wav

(1/3) Convert Audio to 16kHz mono

sox foo.wav -r 16000 -c 1 -b 16 foo_16m.wav

(2/3) Split Audio into Segments

This tool will use silence detection to find good cut-points. You may want to adjustits settings to achieve a good balance of short-segments but few words split in half.

./abook-segment.py foo_16m.wav

settings:

[guenter@dagobert speech]$ ./abook-segment.py -hUsage: abook-segment.py [options] foo.wavOptions:  -h, --help            show thishelp message andexit  -s SILENCE_LEVEL, --silence-level=SILENCE_LEVEL                        silence level (default: 2048 / 65536)  -l MIN_SIL_LENGTH, --min-sil-length=MIN_SIL_LENGTH                        minimum silence length (default:  0.07s)  -m MIN_UTT_LENGTH, --min-utt-length=MIN_UTT_LENGTH                        minimum utterance length (default:  2.00s)  -M MAX_UTT_LENGTH, --max-utt-length=MAX_UTT_LENGTH                        maximum utterance length (default:  9.00s)  -o OUTDIRFN, --out-dir=OUTDIRFN                        output directory (default: abook/segments)  -v, --verboseenable debug output

by default, the resulting segments will end up in abook/segments

(3/3) Transcribe Audio

The transcription tool supports up to two speakers which you can specify on the command line.The resulting voxforge-packages will end up in abook/out by default.

./abook-transcribe.py -s speaker1 -S speaker2 abook/segments/

Audiobook Segmentation and Transcription (kaldi)

Some notes on how to segment and transcribe semi-automatically audiobooks or other audio sources (e.g. from librivox) usingkaldi:

Directory Layout

Our scripts rely on a fixed directory layout. As segmentation of librivox recordings is one of the mainapplications of these scripts, their terminology of books and sections is used here. For each section ofa book two source files are needed: a wave file containing the audio and a text file containing the transcript.

A fixed naming scheme is used for those which is illustrated by this example:

abook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.txtabook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.wavabook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-2.txtabook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-2.wav...

Theabook-librivox.py script is provided to help with retrieval of librivox recordings and setting up thedirectory structure. Please note that for now, the tool will not retrieve transcripts automatically butwill create empty .txt files (according to the naming scheme) which you will have to fill in manually.

The tool will convert the retrieved audio to 16kHz mono wav format as required by the segmentation scripts, however.If you intend to segment material from other sources, make sure to convert it to that format. For suggestions onwhat tools to use for this step, please refer to the manual segmentation instructions in the previous section.

NOTE: As the kaldi process is parallelized for mass-segmentation, at least 4audio and prompt files are needed for the process to work.

(1/4) Preprocess the Transcript

This tool will tokenize the transcript and detect OOV tokens. Those can then be eitherreplaced or added to the dictionary:

./abook-preprocess-transcript.py abook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.txt

(2/4) Model adaptation

For the automatic segmentation to work, we need a GMM model that is adapted to the current dictionary (which likely hadto be expanded during transcript preprocessing) plus uses a language model that covers the prompts.

First, we create a language model tuned for our purpose:

./abook-sentences.py abook/in/librivox/11442-toten-Seelen/*.prompt./speech_build_lm.py abook_lang_model abook abook abook parole_de

Now we can create an adapted model using this language model and our current dict:

./speech_kaldi_adapt.py data/models/kaldi-generic-de-tri2b_chain-latest dict-de.ipa data/dst/lm/abook_lang_model/lm.arpa abook-depushd data/dst/asr-models/kaldi/abook-de./run-adaptation.shpopd./speech_dist.sh -c abook-de kaldi adapttar xfvJ data/dist/asr-models/kaldi-abook-de-adapt-current.tar.xz -C data/models/

(3/4) Auto-Segment using Kaldi

Next, we need to create the kaldi directory structure and files for auto-segmentation:

./abook-kaldi-segment.py data/models/kaldi-abook-de-adapt-current abook/in/librivox/11442-toten-Seelen

now we can run the segmentation:

pushd data/dst/speech/asr-models/kaldi/segmentation./run-segmentation.shpopd

(4/4) Retrieve Segmentation Result

Finally, we can retrieve the segmentation result in voxforge format:

./abook-kaldi-retrieve.py abook/in/librivox/11442-toten-Seelen/

Training Voices for Zamia-TTS

Zamia-TTS is an experimental project that tries to train TTS voices based on (reviewed) Zamia-Speech data. Downloads here:

https://goofy.zamia.org/zamia-speech/tts/

Tacotron 2

This section describes how to train voices forNVIDIA's Tacotron 2 implementation.The resulting voices will have a sample rate of 16kHz as that is the defaultsample rate used for Zamia Speech ASR model training. This means that you will have to use a 16kHz waveglow model which you can find, along with pretrained voices and sample wavs here:

https://goofy.zamia.org/zamia-speech/tts/tacotron2/

now with that out of the way, Tacotron 2 model training is pretty straightforward. First step is to export filelists for the voice you'd like to train, e.g.:

./speech_tacotron2_export.py -l en -o ../torch/tacotron2/filelists m_ailabs_en mailabselliotmiller

next, change into your Tacotron 2 training directory

cd ../torch/tacotron2

and specify file lists, sampling rate and batch size in ''hparams.py'':

diff --git a/hparams.py b/hparams.pyindex 8886f18..75e89c9 100644--- a/hparams.py+++ b/hparams.py@@ -25,15 +25,19 @@ def create_hparams(hparams_string=None, verbose=False):         # Data Parameters             #         ################################         load_mel_from_disk=False,-        training_files='filelists/ljs_audio_text_train_filelist.txt',-        validation_files='filelists/ljs_audio_text_val_filelist.txt',-        text_cleaners=['english_cleaners'],+        training_files='filelists/mailabselliotmiller_train_filelist.txt',+        validation_files='filelists/mailabselliotmiller_val_filelist.txt',+        text_cleaners=['basic_cleaners'],          ################################         # Audio Parameters             #         ################################         max_wav_value=32768.0,-        sampling_rate=22050,+        #sampling_rate=22050,+        sampling_rate=16000,         filter_length=1024,         hop_length=256,         win_length=1024,@@ -81,7 +85,8 @@ def create_hparams(hparams_string=None, verbose=False):         learning_rate=1e-3,         weight_decay=1e-6,         grad_clip_thresh=1.0,-        batch_size=64,+        # batch_size=64,+        batch_size=16,         mask_padding=True  # set model's padded outputs to padded values     )

and start the training:

python train.py --output_directory=elliot --log_directory=elliot/logs

Tacotron

  • (1/2) Prepare a training data set
./ztts_prepare.py -l en m_ailabs_en mailabselliotmiller elliot
  • (2/2) Run the training
./ztts_train.py -v elliot2>&1| tee train_elliot.log

Model Distribution

To build tarballs from models, use thespeech-dist.sh script, e.g.:

./speech_dist.sh generic-en kaldi tdnn_sp

License

My own scripts as well as the data I create (i.e. lexicon and transcripts) isLGPLv3 licensed unless otherwise noted in the script's copyright headers.

Some scripts and files are based on works of others, in those cases it is myintention to keep the original license intact. Please make sure to check thecopyright headers inside for more information.

Authors


[8]ページ先頭

©2009-2025 Movatter.jp