keonlee9420/DiffSingerPublic

NotificationsYou must be signed in to change notification settings
Fork32
Star240

PyTorch implementation of DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech)

License

MIT license

240 stars 32 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
audio		audio
config/LJSpeech		config/LJSpeech
demo/LJSpeech		demo/LJSpeech
hifigan		hifigan
img		img
lexicon		lexicon
model		model
preprocessed_data/LJSpeech		preprocessed_data/LJSpeech
preprocessor		preprocessor
text		text
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
boundary_predictor.py		boundary_predictor.py
dataset.py		dataset.py
evaluate.py		evaluate.py
prepare_align.py		prepare_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
synthesize.py		synthesize.py
train.py		train.py

Repository files navigation

DiffSinger - PyTorch Implementation

PyTorch implementation ofDiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech).

Repository Status

Naive Version of DiffSpeech (not DiffSinger)
Auxiliary Decoder (from FastSpeech2)
An Easier Trick for Boundary Prediction ofK
Shallow Version of DiffSpeech (Shallow Diffusion Mechanism): Leveraging pre-trained auxiliary decoder + Training denoiser usingK as a maximum time step
Multi-Speaker Training

Quickstart

DATASET refers to the names of datasets such asLJSpeech in the following documents.

MODEL refers to the types of model (choose from 'naive', 'aux', 'shallow').

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download thepretrained models and put them in

output/ckpt/LJSpeech_naive/ for 'naive' model.
output/ckpt/LJSpeech_shallow/ for 'shallow' model. Please note that the checkpoint of the 'shallow' model contains both 'shallow' and 'aux' models, and these two models will share all directories except results throughout the whole process.

For English single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET

The generated utterances will be put inoutput/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances inpreprocessed_data/LJSpeech/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios.For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Please note that the controllability is originated fromFastSpeech2 and not a vital interest of DiffSpeech.

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.

Preprocessing

First, run

python3 prepare_align.py --dataset DATASET

for some preparations.

For the forced alignment,Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.Pre-extracted alignments for the datasets are providedhere.You have to unzip the files inpreprocessed_data/DATASET/TextGrid/. Alternately, you canrun the aligner by yourself.

After that, run the preprocessing script by

python3 preprocess.py --dataset DATASET

Training

You can train three types of model: 'naive', 'aux', and 'shallow'.

Training Naive Version ('naive'):
Train the naive version with
```
python3 train.py --model naive --dataset DATASET
```
Training Auxiliary Decoder for Shallow Version ('aux'):
To train the shallow version, we need a pre-trained FastSpeech2. The below command will let you train the FastSpeech2 modules, including Auxiliary Decoder.
```
python3 train.py --model aux --dataset DATASET
```
An Easier Trick for Boundary Prediction:
To get the boundaryK from our validation dataset, you can run the boundary predictor using pre-trained auxiliary FastSpeech2 by the following command.
```
python3 boundary_predictor.py --restore_step RESTORE_STEP --dataset DATASET
```
It will print out the predicted value (say,K_STEP) in the command log.
Then, set the config with the predicted value as follows
```
# In the model.yamldenoiser:K_step:K_STEP
```
Please note that this is based on the trick introduced in Appendix B.
Training Shallow Version ('shallow'):
To leverage pre-trained FastSpeech2, including Auxiliary Decoder, you must setrestore_step with the final step of auxiliary FastSpeech2 training as the following command.
```
python3 train.py --model shallow --restore_step RESTORE_STEP --dataset DATASET
```
For example, if the last checkpoint is saved at 160000 steps during the auxiliary training, you have to setrestore_step with160000. Then it will load the aux model and then continue the training under a shallow training mechanism.

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost.The loss curves, synthesized mel-spectrograms, and audios are shown.

Naive Diffusion

Shallow Diffusion

Loss Comparison

Notes

(Naive version of DiffSpeech) The number of learnable parameters is27.767M, which is similar to the original paper (27.722M).
Unfortunately, the predicted boundary (of LJSpeech) for the shallow diffusion in the current implementation is100, which is the full timesteps of the naive diffusion so that there is no advantage on diffusion steps.
UseHiFi-GAN instead ofParallel WaveGAN (PWG) for vocoding.

Citation

@misc{lee2021diffsinger,  author = {Lee, Keon},  title = {DiffSinger},  year = {2021},  publisher = {GitHub},  journal = {GitHub repository},  howpublished = {\url{https://github.com/keonlee9420/DiffSinger}}}

References

MoonInTheRiver's DiffSinger (Authors' codebase)
ming024's FastSpeech2 (Later than 2021.02.26 ver.)
hojonathanho's diffusion
lmnt-com's diffwave

About

PyTorch implementation of DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech)

Releases

1tags

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

License

keonlee9420/DiffSinger

Folders and files

Latest commit

History

Repository files navigation

DiffSinger - PyTorch Implementation

Repository Status

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

Naive Diffusion

Shallow Diffusion

Loss Comparison

Notes

Citation

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages