- Notifications
You must be signed in to change notification settings - Fork31.3k
Convenient default behavior for pipeline TTS usage.#42473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
eustlb left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM, thanks@ebezzam! 🤗
| # Add speaker ID if needed and user didn't insert at start of text | ||
| ifself.model.config.model_type=="csm": | ||
| text= [f"[0]{t}"ifnott.startswith("[")elsetfortintext] | ||
| ifself.model.config.model_type=="dia": | ||
| text= [f"[S1]{t}"ifnott.startswith("[")elsetfortintext] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hum really really not a fan of such hidden processing. This is where the abstraction of the pipeline (this does make sense if you want to interchange model id with simply changing the model) complicates things more than they simplify it ... but okay to keep here since there is already so much custom processing in the audio pipeline codes and that anyway.
Note we might remove in the future though if we find an good API to have specific kwargs for each TTS models and a convinient way to default them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Definitely, for examplepreset as we discussedhere
HuggingFaceDocBuilderDev commentedNov 28, 2025
The docs for this PR livehere. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Uh oh!
There was an error while loading.Please reload this page.
What does this PR do?
Related to offline discussion with@eustlb and@Deep-unlearning, let's change default pipeline TTS behavior to make it easier to users.
I pinned
output_audio=Truefor CSM but also did manual insertion of speaker IDs (for CSM and Dia) to make usage more intuitive for simple TTS usage.See below some CSM and Dia examples.
@Deep-unlearning what do you think about adding such examples to theTTS page (while pruning the verbose comments).
At least the CSM voice cloning example (and pointing tothis dataset so they know what the original voice sounds like).