NotificationsYou must be signed in to change notification settings
Fork31.3k
Star153k

Convenient default behavior for pipeline TTS usage.#42473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

ebezzam wants to merge2 commits intohuggingface:main

base:main

Choose a base branch

fromebezzam:csm_pipeline

Open

Convenient default behavior for pipeline TTS usage.#42473

ebezzam wants to merge2 commits intohuggingface:mainfromebezzam:csm_pipeline

+18 −9

Conversation

Copy link

Contributor

ebezzam commentedNov 28, 2025•
edited
Loading

What does this PR do?

Related to offline discussion with@eustlb and@Deep-unlearning, let's change default pipeline TTS behavior to make it easier to users.

I pinnedoutput_audio=True for CSM but also did manual insertion of speaker IDs (for CSM and Dia) to make usage more intuitive for simple TTS usage.

See below some CSM and Dia examples.

importsoundfileassfimporttorchfromdatasetsimportAudio,load_datasetfromtransformersimportpipelinedevice="cuda"iftorch.cuda.is_available()else"cpu""""CSM"""pipe=pipeline("text-to-audio",model="sesame/csm-1b",device=device)# -- minimal TTS exampletorch.manual_seed(0)outputs=pipe("Hello from Sesame.")# instead of pipe("[0]Hello from Sesame.")fn="csm_pipeline_tts.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")# -- minimal TTS example with voice cloningtorch.manual_seed(0)ds=load_dataset("hf-internal-testing/dailytalk-dummy",split="train")ds=ds.cast_column("audio",Audio(sampling_rate=24000))conversation= [# audio/text pair(s) for voice cloning    {"role":"0","content": [            {"type":"text","text":"What are you working on?"},            {"type":"audio","path":ds[0]["audio"]["array"]},        ],    },# desired audio response for voice cloning    {"role":"0","content": [{"type":"text","text":"How much money can you spend?"}]},]outputs=pipe(conversation)fn="csm_pipeline_voice_cloning.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")"""Dia"""pipe=pipeline("text-to-audio",model="nari-labs/Dia-1.6B-0626",device=device)# -- minimal TTS exampletorch.manual_seed(42)outputs=pipe("Dia is an open weights text to dialogue model.",# instead of pipe("[S1] Dia is an open weights text to dialogue model..")generate_kwargs={"max_new_tokens":256},)fn="dia_pipeline_tts.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")# -- minimal conversation example# note: Dia doesn't support chat template for voice cloning# explicit model loading should be used instead: https://huggingface.co/nari-labs/Dia-1.6B-0626#generation-with-text-and-audio-voice-cloningtorch.manual_seed(0)outputs=pipe("[S1] Dia is an open weights text to dialogue model. [S2] That's cool, tell me how it works.",)fn="dia_pipeline_conversation.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")

@Deep-unlearning what do you think about adding such examples to theTTS page (while pruning the verbose comments).

At least the CSM voice cloning example (and pointing tothis dataset so they know what the original voice sounds like).

importsoundfileassfimporttorchfromdatasetsimportAudio,load_datasetfromtransformersimportpipelinedevice="cuda"iftorch.cuda.is_available()else"cpu"pipe=pipeline("text-to-audio",model="sesame/csm-1b",device=device)# prepare inputds=load_dataset("hf-internal-testing/dailytalk-dummy",split="train")ds=ds.cast_column("audio",Audio(sampling_rate=24000))conversation= [# -- audio/text pair(s) for voice cloning    {"role":"0","content": [            {"type":"text","text":"What are you working on?"},             {"type":"audio","path":ds[0]["audio"]["array"]},        ],    },# -- desired audio response for voice cloning    {"role":"0","content": [{"type":"text","text":"How much money can you spend?"}]},]outputs=pipe(conversation)fn="csm_pipeline_voice_cloning.wav"sf.write(fn,outputs["audio"],outputs["sampling_rate"])print(f"Audio saved to{fn}")

Convenient default behavior for pipeline TTS usage.

32daa47

ebezzam requested a review fromeustlb

November 28, 2025 13:33

Deep-unlearning mentioned this pull request

Nov 28, 2025

more tts pipeline exampel#42484

Open

eustlb approved these changes

Nov 28, 2025

View reviewed changes

Copy link

Contributor

eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, thanks@ebezzam! 🤗

src/transformers/pipelines/text_to_audio.py

Comment on lines +175 to +179

		# Add speaker ID if needed and user didn't insert at start of text
		ifself.model.config.model_type=="csm":
		text= [f"[0]{t}"ifnott.startswith("[")elsetfortintext]
		ifself.model.config.model_type=="dia":
		text= [f"[S1]{t}"ifnott.startswith("[")elsetfortintext]

Copy link

Contributor

eustlbNov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hum really really not a fan of such hidden processing. This is where the abstraction of the pipeline (this does make sense if you want to interchange model id with simply changing the model) complicates things more than they simplify it ... but okay to keep here since there is already so much custom processing in the audio pipeline codes and that anyway.

Note we might remove in the future though if we find an good API to have specific kwargs for each TTS models and a convinient way to default them.

Copy link

ContributorAuthor

ebezzamNov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Definitely, for examplepreset as we discussedhere

Merge branch 'main' into csm_pipeline

5cd5931

Copy link

HuggingFaceDocBuilderDev commentedNov 28, 2025

The docs for this PR livehere. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Labels

None yet

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convenient default behavior for pipeline TTS usage.#42473

Are you sure you want to change the base?

Convenient default behavior for pipeline TTS usage.#42473

Conversation

ebezzam commentedNov 28, 2025•
edited
Loading

Uh oh!

What does this PR do?