Directory structure for the task underpackages/tasks/src/tasks/audio-text-to-text/.
about.md with task description, use cases, and Python/JS examples.
data.ts with metadata including example datasets, models, metrics, and demo definitions.
spec/input.json andspec/output.json defining the task's input and output schema.

The main task registration filepackages/tasks/src/tasks/index.ts has been updated to:

Import and include the 'audio-text-to-text' task data.
List relevant model libraries (transformers,speechbrain,espnet,nemo) for this task type.

This task covers functionalities like automatic speech recognition (ASR) and speech translation.

Add 'audio-text-to-text' task to Hugging Face Tasks

4184b12

This commit introduces the new 'audio-text-to-text' task.The following has been added:- Directory structure for the task under `packages/tasks/src/tasks/audio-text-to-text/`.- `about.md` with task description, use cases, and Python/JS examples.- `data.ts` with metadata including example datasets, models, metrics, and demo definitions.- `spec/input.json` and `spec/output.json` defining the task's input and output schema.The main task registration file `packages/tasks/src/tasks/index.ts` has been updated to:- Import and include the 'audio-text-to-text' task data.- List relevant model libraries (`transformers`, `speechbrain`, `espnet`, `nemo`) for this task type.This task covers functionalities like automatic speech recognition (ASR) and speech translation.

ProCreations-Official requested review fromSBrandeis,gary149,Wauplin,julien-c,pcuenca andngxson ascode owners

May 23, 2025 01:06

Merge branch 'main' into feat/add-audio-text-to-text-task

1962aeb

Vaibhavs10 requested review frommerveenoyan andDeep-unlearning

May 23, 2025 09:07

Vaibhavs10 reviewed

May 23, 2025

View reviewed changes

Copy link

Member

Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks a lot for the PR@ProCreations-Official - in general it'd be good to add more relevant info to models that are capable of theAudio-text-to-text task, so more relevant models would be:

Ultravox, Phi4, Qwen Audio etc.

cc:@Deep-unlearning

merveenoyan reviewed

May 23, 2025

View reviewed changes

Copy link

Contributor

merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks a lot for working on this! left very general comments 🙂

packages/tasks/src/tasks/audio-text-to-text/about.md

		@@ -0,0 +1,123 @@
		## Audio Text to Text

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

## Audio Text to Text

no need for this

packages/tasks/src/tasks/audio-text-to-text/about.md

		@@ -0,0 +1,123 @@
		## Audio Text to Text

		The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this should be in data.ts summary part

packages/tasks/src/tasks/audio-text-to-text/about.md


		The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.

		### Use Cases

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	###Use Cases
	##Use Cases

packages/tasks/src/tasks/audio-text-to-text/about.md


		### Use Cases

		* Speech Recognition: Transcribing spoken language from an audio clip into text. This is foundational for voice assistants, dictation software, and transcribing meetings or interviews.

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

these can be separate headers instead of bullet points

packages/tasks/src/tasks/audio-text-to-text/about.md

		* Voice Command Interfaces: Converting spoken commands into text that can then be interpreted by a system to perform actions (e.g., "Play music," "Set a timer").
		* Audio Event Description/Captioning: Generating textual descriptions of sounds or events occurring in an audio stream (though this might sometimes overlap with Audio Tagging).

		### Python Examples

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	###Python Examples
	##Inference

	Transformers

packages/tasks/src/tasks/audio-text-to-text/data.ts

		spaces: [
		{
		description: "A demonstration of the Whisper model for speech recognition.",
		id: "openai/whisper",

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is ASR, not audio-text-to-text

packages/tasks/src/tasks/audio-text-to-text/data.ts

		},
		{
		description: "An ESPnet demo that can perform speech recognition and translation.",
		id: "espnet/espnet_asr_demo",

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is ASR, not audio-text-to-text

packages/tasks/src/tasks/audio-text-to-text/data.ts

		},
		{
		description: "A model for translating speech from English to German (example of a speech translation model).",
		id: "facebook/s2t-medium-en-de-st",

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this is speech-to-text, not audio-text-to-text

packages/tasks/src/tasks/audio-text-to-text/data.ts

		models: [
		{
		description: "A popular multilingual model for automatic speech recognition.",
		id: "openai/whisper-base",

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is ASR, not audio-text-to-text

packages/tasks/src/tasks/index.ts

		@@ -119,7 +120,7 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
		"audio-classification": ["speechbrain", "transformers", "transformers.js"],
		"audio-to-audio": ["asteroid", "fairseq", "speechbrain"],
		"automatic-speech-recognition": ["espnet", "nemo", "speechbrain", "transformers", "transformers.js"],
		"audio-text-to-text": [],
		"audio-text-to-text": ["transformers", "speechbrain", "espnet", "nemo"],

Copy link

Contributor

merveenoyanMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think only transformers supports it

Merge branch 'main' into feat/add-audio-text-to-text-task

e86b270

Deep-unlearning reviewed

May 23, 2025

View reviewed changes

packages/tasks/src/tasks/audio-text-to-text/about.md


		# Initialize the ASR pipeline
		# Replace "openai/whisper-base" with any ASR model of your choice
		asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base")

Copy link

Contributor

Deep-unlearningMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For Automatic Speech Recognition you can useQwenAudio orGranite Speech which are audio-text-to-text models

Deep-unlearning reviewed

May 23, 2025

View reviewed changes

packages/tasks/src/tasks/audio-text-to-text/about.md

		# Initialize the speech-to-text translation pipeline
		# Replace "facebook/s2t-small-librispeech-asr" with a speech translation model
		# For example, if you want to translate English audio to French text:
		translator_pipeline = pipeline("automatic-speech-recognition", model="facebook/s2t-small-en-fr-st") # Example model

Copy link

Contributor

Deep-unlearningMay 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Same for translationGranite Speech support translation En -> X

Deep-unlearning reviewed

May 23, 2025

View reviewed changes

packages/tasks/src/tasks/audio-text-to-text/about.md OutdatedShow resolvedHide resolved

Update packages/tasks/src/tasks/audio-text-to-text/about.md

b64ad84

Co-authored-by: Steven Zheng <58599908+Deep-unlearning@users.noreply.github.com>

Labels

None yet

4 participants

Movatterモバイル変換

Add 'audio-text-to-text' task to Hugging Face Tasks#1479

Are you sure you want to change the base?

Add 'audio-text-to-text' task to Hugging Face Tasks#1479

Uh oh!

Conversation

ProCreations-Official commentedMay 23, 2025

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!