Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add 'audio-text-to-text' task to Hugging Face Tasks#1479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ProCreations-Official wants to merge4 commits intohuggingface:main
base:main
Choose a base branch
Loading
fromProCreations-Official:feat/add-audio-text-to-text-task

Conversation

ProCreations-Official

This commit introduces the new 'audio-text-to-text' task.

The following has been added:

  • Directory structure for the task underpackages/tasks/src/tasks/audio-text-to-text/.
  • about.md with task description, use cases, and Python/JS examples.
  • data.ts with metadata including example datasets, models, metrics, and demo definitions.
  • spec/input.json andspec/output.json defining the task's input and output schema.

The main task registration filepackages/tasks/src/tasks/index.ts has been updated to:

  • Import and include the 'audio-text-to-text' task data.
  • List relevant model libraries (transformers,speechbrain,espnet,nemo) for this task type.

This task covers functionalities like automatic speech recognition (ASR) and speech translation.

This commit introduces the new 'audio-text-to-text' task.The following has been added:- Directory structure for the task under `packages/tasks/src/tasks/audio-text-to-text/`.- `about.md` with task description, use cases, and Python/JS examples.- `data.ts` with metadata including example datasets, models, metrics, and demo definitions.- `spec/input.json` and `spec/output.json` defining the task's input and output schema.The main task registration file `packages/tasks/src/tasks/index.ts` has been updated to:- Import and include the 'audio-text-to-text' task data.- List relevant model libraries (`transformers`, `speechbrain`, `espnet`, `nemo`) for this task type.This task covers functionalities like automatic speech recognition (ASR) and speech translation.
Copy link
Member

@Vaibhavs10Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks a lot for the PR@ProCreations-Official - in general it'd be good to add more relevant info to models that are capable of theAudio-text-to-text task, so more relevant models would be:

Ultravox, Phi4, Qwen Audio etc.

cc:@Deep-unlearning

Copy link
Contributor

@merveenoyanmerveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks a lot for working on this! left very general comments 🙂

@@ -0,0 +1,123 @@
## Audio Text to Text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
## Audio Text to Text

no need for this

@@ -0,0 +1,123 @@
## Audio Text to Text

The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this should be in data.ts summary part


The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.

### Use Cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
###Use Cases
##Use Cases


### Use Cases

* **Speech Recognition:** Transcribing spoken language from an audio clip into text. This is foundational for voice assistants, dictation software, and transcribing meetings or interviews.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

these can be separate headers instead of bullet points

* **Voice Command Interfaces:** Converting spoken commands into text that can then be interpreted by a system to perform actions (e.g., "Play music," "Set a timer").
* **Audio Event Description/Captioning:** Generating textual descriptions of sounds or events occurring in an audio stream (though this might sometimes overlap with Audio Tagging).

### Python Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
###Python Examples
##Inference
**Transformers**

spaces: [
{
description: "A demonstration of the Whisper model for speech recognition.",
id: "openai/whisper",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is ASR, not audio-text-to-text

},
{
description: "An ESPnet demo that can perform speech recognition and translation.",
id: "espnet/espnet_asr_demo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is ASR, not audio-text-to-text

},
{
description: "A model for translating speech from English to German (example of a speech translation model).",
id: "facebook/s2t-medium-en-de-st",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this is speech-to-text, not audio-text-to-text

models: [
{
description: "A popular multilingual model for automatic speech recognition.",
id: "openai/whisper-base",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is ASR, not audio-text-to-text

@@ -119,7 +120,7 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
"audio-classification": ["speechbrain", "transformers", "transformers.js"],
"audio-to-audio": ["asteroid", "fairseq", "speechbrain"],
"automatic-speech-recognition": ["espnet", "nemo", "speechbrain", "transformers", "transformers.js"],
"audio-text-to-text": [],
"audio-text-to-text": ["transformers", "speechbrain", "espnet", "nemo"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think only transformers supports it


# Initialize the ASR pipeline
# Replace "openai/whisper-base" with any ASR model of your choice
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For Automatic Speech Recognition you can useQwenAudio orGranite Speech which are audio-text-to-text models

# Initialize the speech-to-text translation pipeline
# Replace "facebook/s2t-small-librispeech-asr" with a speech translation model
# For example, if you want to translate English audio to French text:
translator_pipeline = pipeline("automatic-speech-recognition", model="facebook/s2t-small-en-fr-st") # Example model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Same for translationGranite Speech support translation En -> X

Co-authored-by: Steven Zheng <58599908+Deep-unlearning@users.noreply.github.com>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@Vaibhavs10Vaibhavs10Vaibhavs10 left review comments

@merveenoyanmerveenoyanmerveenoyan left review comments

@Deep-unlearningDeep-unlearningDeep-unlearning left review comments

@SBrandeisSBrandeisAwaiting requested review from SBrandeisSBrandeis is a code owner

@gary149gary149Awaiting requested review from gary149gary149 is a code owner

@WauplinWauplinAwaiting requested review from WauplinWauplin is a code owner

@julien-cjulien-cAwaiting requested review from julien-cjulien-c is a code owner

@pcuencapcuencaAwaiting requested review from pcuencapcuenca is a code owner

@ngxsonngxsonAwaiting requested review from ngxsonngxson is a code owner

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@ProCreations-Official@Vaibhavs10@merveenoyan@Deep-unlearning

[8]ページ先頭

©2009-2025 Movatter.jp