This repository was archived by the owner on Nov 1, 2024. It is now read-only.

facebookresearch/textlesslibPublic archive

NotificationsYou must be signed in to change notification settings
Fork54
Star545

Library for Textless Spoken Language Processing

License

MIT license

545 stars 54 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
examples		examples
tests		tests
textless		textless
tools/distributed_transcribe		tools/distributed_transcribe
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.bib		CITATION.bib
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

textlesslib

Textless NLP is an active area of research that aims to extend NLP techniques (and tools!) to work directly on spoken language. By using self-supervisedlylearnt discrete speech representations, the area promises to unlock interesting NLP applications on languages without written form or on facets of spokenlanguage that are unaccessable for text-based approaches, e.g. prosody. To learn more, please check some of thepapers.

textlesslib is a library aimed to facilitate research in Textless NLP. The goal of the library is to speed up the research cycle andlower the learning curve for those who want to start. We provide highly configurable, off-the-shelf available tools to encode speechas sequences of discrete values and tools to decode such streams back into the audio domain. A high-level description of the library can also befound in our paper[arxiv].

Installation

git clone git@github.com:facebookresearch/textlesslib.gitcd textlesslibpip install -e.pip install git+git://github.com:pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8

Usage examples

We include a set of examples in theexamples folder:

There is also a[Jupyter notebook] and a[Google Colab] that combine discrete resynthesis and speech continuation examples in a step-by-step mini-tutorial.

We believe those examples can serve both as illustrations for the provided components and providea starting point for tinkering in interesting directions.

Encoding speech

Below is an example on loading an audio example and encoding it as a sequence of HuBERT-based discrete tokens (aka pseudo-units).Downloading of the required checkpoints is handled by textlesslib itself (by default they are stored in~/.textless):

importtorchaudiofromtextless.data.speech_encoderimportSpeechEncoderdense_model_name="hubert-base-ls960"quantizer_name,vocab_size="kmeans",100input_file="input.wav"# now let's load an audio examplewaveform,sample_rate=torchaudio.load(input_file)# We can build a speech encoder module using names of pre-trained# dense and quantizer models.  The call below will download# appropriate checkpoints as needed behind the scenes. We can# also construct an encoder by directly passing model instancesencoder=SpeechEncoder.by_name(dense_model_name=dense_model_name,quantizer_model_name=quantizer_name,vocab_size=vocab_size,deduplicate=True,).cuda()# now convert it in a stream of deduplicated units (as in GSLM)encoded=encoder(waveform.cuda())# encoded is a dict with keys ('dense', 'units', 'durations').# It can also contain 'f0' if SpeechEncoder was initialized# with need_f0=True flag.units=encoded["units"]# tensor([71, 12, 57, ...], ...)

Now it can be casted back into the audio domain:

# as with encoder, we can setup vocoder by passing checkpoints# directly or by specifying the expected format by the names# of dense and quantizer models (these models themselves# won't be loaded)vocoder=TacotronVocoder.by_name(dense_model_name,quantizer_name,vocab_size,).cuda()# now we turn those units back into the audio.audio=vocoder(units)# save the audiotorchaudio.save(output_file,audio.cpu().float().unsqueeze(0),vocoder.output_sample_rate)

Dataset helpers

Below is an example on usingtextless view on the LibriSpeech dataset:

encoder=SpeechEncoder.by_name(dense_model_name=dense_model_name,quantizer_model_name=quantizer_name,vocab_size=vocab_size,deduplicate=True,).cuda()quantized_dataset=QuantizedLibriSpeech(root=existing_root,speech_encoder=encoder,url=url)datum=quantized_dataset[0]sample_rate,utterance,speaker_id,chapter_id,utterance_id=datum['rest']# datum['units'] = tensor([71, 12, 63, ...])

In theprobing example we illustrate how such a datasetcan be used with a standard Pytorch dataloader in a scalable manner.

Data preprocessing

We also provide amulti-GPU/multi-node preprocessing toolfor the cases where on-the-fly processing of audio should be avoided.

Provided models

We provide implementations and pre-trained checkpoints for the following models:

Dense representations: HuBERT-base (trained on LibriSpeech 960h) and CPC (trained on 6Kh subset of LibriLight);
Quantizers: k-means quantizers with vocabulary sizes of 50, 100, 200 for both the dense models (trained on LibriSpeech 960h);
Decoders: Tacotron2 models for all (dense model x quantizer) combinations (trained on LJSpeech).

Finally, the pitch extraction is done via YAAPT.

Testing

We use pytest (pip install pytest pytest-xdist). Our unit tests are located in thetests directory:

cd tests&& pytest -n 8

Citing textless-lib

If you find textless-lib useful in your research, please consider citing our work:

@article{Kharitonov2022,      title={textless-lib: a Library for Textless Spoken Language Processing},      author={Eugene Kharitonov and Jade Copet and Kushal Lakhotia and Tu Anh Nguyen and Paden Tomasello and Ann Lee and Ali Elkahky and Wei-Ning Hsu and Abdelrahman Mohamed and Emmanuel Dupoux and Yossi Adi},      year={2022},      eprint={2202.07359},      archivePrefix={arXiv},      primaryClass={cs.CL}}

Licence

textlesslib is licensed under MIT, the text of the license can be foundhere.Internally, it uses

WaveGlow - licensed under BSD-3-Clause license;
tacotron implementation - licensed under MIT license;
tacotron2 implementation - licensed under BSD-3-Clause license;
STFT implementation - licensed under BSD-3-Clause license.

About

Library for Textless Spoken Language Processing

Resources

Readme

License

MIT license

Code of conduct

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

textlesslib

Table of Contents

Installation

Usage examples

Encoding speech

Dataset helpers

Data preprocessing

Provided models

Testing

Citing textless-lib

Licence

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors7

Uh oh!

Languages

Movatterモバイル変換

License

facebookresearch/textlesslib

Folders and files

Latest commit

History

Repository files navigation

textlesslib

Table of Contents

Installation

Usage examples

Encoding speech

Dataset helpers

Data preprocessing

Provided models

Testing

Citing textless-lib

Licence

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors7

Uh oh!

Languages

Packages