Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Convenient default behavior for pipeline TTS usage.#42473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ebezzam wants to merge2 commits intohuggingface:main
base:main
Choose a base branch
Loading
fromebezzam:csm_pipeline

Conversation

@ebezzam
Copy link
Contributor

@ebezzamebezzam commentedNov 28, 2025
edited
Loading

What does this PR do?

Related to offline discussion with@eustlb and@Deep-unlearning, let's change default pipeline TTS behavior to make it easier to users.

I pinnedoutput_audio=True for CSM but also did manual insertion of speaker IDs (for CSM and Dia) to make usage more intuitive for simple TTS usage.

See below some CSM and Dia examples.

importsoundfileassfimporttorchfromdatasetsimportAudio,load_datasetfromtransformersimportpipelinedevice="cuda"iftorch.cuda.is_available()else"cpu""""CSM"""pipe=pipeline("text-to-audio",model="sesame/csm-1b",device=device)# -- minimal TTS exampletorch.manual_seed(0)outputs=pipe("Hello from Sesame.")# instead of pipe("[0]Hello from Sesame.")fn="csm_pipeline_tts.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")# -- minimal TTS example with voice cloningtorch.manual_seed(0)ds=load_dataset("hf-internal-testing/dailytalk-dummy",split="train")ds=ds.cast_column("audio",Audio(sampling_rate=24000))conversation= [# audio/text pair(s) for voice cloning    {"role":"0","content": [            {"type":"text","text":"What are you working on?"},            {"type":"audio","path":ds[0]["audio"]["array"]},        ],    },# desired audio response for voice cloning    {"role":"0","content": [{"type":"text","text":"How much money can you spend?"}]},]outputs=pipe(conversation)fn="csm_pipeline_voice_cloning.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")"""Dia"""pipe=pipeline("text-to-audio",model="nari-labs/Dia-1.6B-0626",device=device)# -- minimal TTS exampletorch.manual_seed(42)outputs=pipe("Dia is an open weights text to dialogue model.",# instead of pipe("[S1] Dia is an open weights text to dialogue model..")generate_kwargs={"max_new_tokens":256},)fn="dia_pipeline_tts.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")# -- minimal conversation example# note: Dia doesn't support chat template for voice cloning# explicit model loading should be used instead: https://huggingface.co/nari-labs/Dia-1.6B-0626#generation-with-text-and-audio-voice-cloningtorch.manual_seed(0)outputs=pipe("[S1] Dia is an open weights text to dialogue model. [S2] That's cool, tell me how it works.",)fn="dia_pipeline_conversation.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")

@Deep-unlearning what do you think about adding such examples to theTTS page (while pruning the verbose comments).

At least the CSM voice cloning example (and pointing tothis dataset so they know what the original voice sounds like).

importsoundfileassfimporttorchfromdatasetsimportAudio,load_datasetfromtransformersimportpipelinedevice="cuda"iftorch.cuda.is_available()else"cpu"pipe=pipeline("text-to-audio",model="sesame/csm-1b",device=device)# prepare inputds=load_dataset("hf-internal-testing/dailytalk-dummy",split="train")ds=ds.cast_column("audio",Audio(sampling_rate=24000))conversation= [# -- audio/text pair(s) for voice cloning    {"role":"0","content": [            {"type":"text","text":"What are you working on?"},             {"type":"audio","path":ds[0]["audio"]["array"]},        ],    },# -- desired audio response for voice cloning    {"role":"0","content": [{"type":"text","text":"How much money can you spend?"}]},]outputs=pipe(conversation)fn="csm_pipeline_voice_cloning.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")

Deep-unlearning reacted with hooray emoji
Copy link
Contributor

@eustlbeustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, thanks@ebezzam! 🤗

Comment on lines +175 to +179
# Add speaker ID if needed and user didn't insert at start of text
ifself.model.config.model_type=="csm":
text= [f"[0]{t}"ifnott.startswith("[")elsetfortintext]
ifself.model.config.model_type=="dia":
text= [f"[S1]{t}"ifnott.startswith("[")elsetfortintext]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hum really really not a fan of such hidden processing. This is where the abstraction of the pipeline (this does make sense if you want to interchange model id with simply changing the model) complicates things more than they simplify it ... but okay to keep here since there is already so much custom processing in the audio pipeline codes and that anyway.

Note we might remove in the future though if we find an good API to have specific kwargs for each TTS models and a convinient way to default them.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Definitely, for examplepreset as we discussedhere

@HuggingFaceDocBuilderDev

The docs for this PR livehere. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@eustlbeustlbeustlb approved these changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@ebezzam@HuggingFaceDocBuilderDev@eustlb

[8]ページ先頭

©2009-2025 Movatter.jp