Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Whisper (speech recognition system)

From Wikipedia, the free encyclopedia
Machine learning model for speech
Whisper (speech recognition system)
Original authorOpenAI[1]
Initial releaseSeptember 21, 2022
Written inPython
Type
LicenseMIT License
Repositorygithub.com/openai/whisper
Part of a series on
OpenAI
Products
Models
People
Concepts

Whisper is a machine learning model forspeech recognition andtranscription, created byOpenAI and first released asopen-source software in September 2022.[2]

It is capable of transcribing speech in English and several other languages, and is also capable of translating several non-English languages into English.[1] OpenAI claims that the combination of different training data used in its development has led to improved recognition of accents, background noise, and jargon compared to previous approaches.[3] Whisper is aweakly-superviseddeep learningacoustic model, made using anencoder-decoder transformer architecture.[1]

Background

[edit]

Speech recognition has had a long history in research; the first approaches made use ofstatistical methods, such asdynamic time warping, and laterhidden Markov models. At around the 2010s,deep neural network approaches became more common for speech recognition models, which were enabled by the availability of large datasets ("big data") and increased computational performance.[4] Early approaches to deep learning in speech recognition includedconvolutional neural networks, which were limited due to their inability to capture sequential data, which later led to developments ofSeq2seq approaches, which includerecurrent neural networks, which made use oflong short-term memory.[5]

Transformers, introduced in 2017 byGoogle, displaced many prior state-of-the art approaches across a wide range inmachine learning, and started becoming the core neural architecture in fields such aslanguage modeling andcomputer vision.[6] Weakly-supervised approaches to training acoustic models were recognized in the early 2020s as promising for speech recognition approaches using deep neural networks.[7]

According to aNYT report, in 2021 OpenAI believed they exhausted sources of higher-quality data to train theirlarge language models and decided to complement scraped web text with transcriptions of YouTube videos and podcasts, and developed Whisper to solve this task.[8]

Whisper Large V2 was released on December 8, 2022,[9] followed by Whisper Large V3 being released in November 2023, during the OpenAI Dev Day.[10] In March 2025, OpenAI released new transcription models based onGPT-4o and GPT-4o mini, both of which have lower error rates than Whisper.[11]

Architecture

[edit]
OpenAI Whisper architecture
A standard transformer architecture, showing on the left an encoder, and on the right a decoder

The Whisper architecture is based on an encoder-decoder transformer.[1]

Input audio is resampled to 16,000 Hz and converted to an 80-channelLog-magnitude Mel spectrogram using 25 ms windows with a 10 ms stride. The spectrogram is then normalized to a [-1, 1] range with near-zero mean.

The encoder takes this Mel spectrogram as input and processes it. It first passes through twoconvolutional layers. Sinusoidal positional embeddings are added. It is then processed by a series ofTransformer encoder blocks (with pre-activationresidual connections). The encoder's output is layer normalized.

The decoder is a standard Transformer decoder. It has the same width and Transformer blocks as the encoder. It uses learned positional embeddings and tied input-output token representations (using the same weight matrix for both the input and output embeddings). It uses abyte-pair encoding tokenizer, of the same kind as used inGPT-2. English-only models use the GPT-2 vocabulary, while multilingual models employ a re-trained multilingual vocabulary with the same number of words.

Special tokens are used to allow the decoder to perform multiple tasks:

  • Tokens that denote language (one unique token per language).
  • Tokens that specify task (<|transcribe|> or<|translate|>).
  • Tokens that specify if no timestamps are present (<|notimestamps|>). If the token is not present, then the decoder predicts timestamps relative to the segment, and quantized to 20 ms intervals.
  • <|nospeech|> for voice activity detection.
  • <|startoftranscript|>, and<|endoftranscript|> . Any text that appears before<|startoftranscript|> is not generated by the decoder, but given to the decoder as context. Loss is only computed over non-contextual parts of the sequence, i.e. tokens between these two special tokens.

Training Data

[edit]

The training dataset consists of 680,000 hours of labeled audio-transcript pairs sourced from the internet usingsemi-supervised learning. This includes 117,000 hours in 96 non-English languages and 125,000 hours of X→English translation data, where X stands for any non-English language.[1]

Preprocessing involved standardization of transcripts, filtering to remove machine-generated transcripts using heuristics (e.g., punctuation, capitalization),language identification and matching with transcripts,fuzzydeduplication, and deduplication with evaluation datasets to avoid data contamination. Speechless segments were also included to allowvoice activity detection training. For the files still remaining after the filtering process, audio files were then broken into 30-second segments paired with the subset of the transcript that occurs within that time. If this predicted spoken language differed from the language of the text transcript associated with the audio, that audio-transcript pair was not used for training the speech recognition models, but instead for training translation.

The model was trained using the AdamW optimizer with gradient norm clipping and a linear learning rate decay with warmup, with batch size 256 segments. Training proceeded for 1 million updates (approximately 2-3epochs). No data augmentation or regularization, except for the Large V2 model, which used SpecAugment,Stochastic Depth, and BPE Dropout. The training useddata parallelism withfloat16, dynamic loss scaling, and activation checkpointing.

Post-training filtering

[edit]

After training the first model, researchers ran it on different subsets of the training data, each representing a distinct source. Data sources were ranked by a combination of their error rate and size. Manual inspection of the top-ranked sources (high error, large size) helped determine if the source was low quality (e.g., partial transcriptions, inaccurate alignment). After training, it was fine-tuned to suppress the prediction of speaker names and low-quality sources were then removed.[1]

Capacity

[edit]

Whisper does not outperform models which specialize in the LibriSpeechdataset, although when tested across many datasets, it is morerobust and makes 50% fewer errors than other models.[12][non-primary source needed] Whisper has a differing error rate with respect to transcribing different languages, with a higherword error rate in languages not well-represented in the training data.[13] The authors found thatmulti-task learning improved overall performance compared to models specialized to one task. They conjectured that the best Whisper model trained is stillunderfitting the dataset, and larger models and longer training can result in better models.[1]

Third-party evaluations have found varying levels ofAI hallucination. A study of transcripts of public meetings found hallucinations in eight out of every 10 transcripts, while an engineer discovered hallucinations in "about half" of 100 hours of transcriptions and a developer identified them in "nearly every one" of 26,000 transcripts.[14] A study of 13,140 short audio segments (averaging 10 seconds) found 187 hallucinations (1.4%), 38% of which generated text that could be harmful because it inserted false references to things like race, non-existent medications, or violent events that were not in the audio.[14][15]

Applications

[edit]

The model has been used as the base for many applications, such as a unified model for speech recognition and more generalsound recognition.[16]

See also

[edit]

References

[edit]
  1. ^abcdefgRadford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision".arXiv:2212.04356 [eess.AS].
  2. ^Golla, Ramsri Goutham (2023-03-06)."Here Are Six Practical Use Cases for the New Whisper API".Slator.Archived from the original on 2023-03-25. Retrieved2023-08-12.
  3. ^Wiggers, Kyle (September 21, 2022)."OpenAI open-sources Whisper, a multilingual speech recognition system".TechCrunch.Archived from the original on February 12, 2023. RetrievedFebruary 12, 2023.
  4. ^Yu, Dong; Deng, Li (2014).Automatic speech recognition: a deep learning approach. Signals and communication technology (2015th ed.). London Heidelberg: Springer. p. 9.ISBN 978-1-4471-5778-6.
  5. ^Siddique, Latif; Zaidi, Aun; Cuayahuitl, Heriberto; Shamshad, Fahad; Shoukat, Moazzam; Qadir, Junaid (2023). "Transformers in Speech Processing: A Survey".arXiv:2303.11607v1 [cs.CL].
  6. ^Kamath, Uday; Graham, Kenneth L.; Emara, Wael (2022).Transformers for machine learning: a deep dive. Chapman & Hall/CRC machine learning & pattern recognition (First ed.). Boca Raton London New York: CRC Press, Taylor & Francis Group. pp. xix.ISBN 978-0-367-76734-1.
  7. ^Paaß, Gerhard; Giesselbach, Sven (2023-02-16). "Foundation Models for Speech, Images, Videos, and Control".Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. pp. 313–382.arXiv:2302.08575.doi:10.1007/978-3-031-23190-2_7.ISBN 978-3-031-23189-6.S2CID 257019816.
  8. ^Davis, Wes (2024-04-06)."OpenAI transcribed over a million hours of YouTube videos to train GPT-4".The Verge. Retrieved2024-04-20.
  9. ^"Announcing the large-v2 model · openai/whisper · Discussion #661".GitHub. Retrieved2024-01-08.
  10. ^OpenAI DevDay: Opening Keynote, 6 November 2023, retrieved2024-01-08
  11. ^John, Doe (2025-03-20)."OpenAI releases new AI voice models with customizable speaking styles".OpenAI. Retrieved2026-01-10.
  12. ^"Introducing Whisper".openai.com. 2022-09-21.Archived from the original on 2023-08-20. Retrieved2023-08-21.
  13. ^Wiggers, Kyle (2023-03-01)."OpenAI debuts Whisper API for speech-to-text transcription and translation".TechCrunch.Archived from the original on 2023-07-18. Retrieved2023-08-21.
  14. ^abBurke, Garance; Schellmann, Hilke (2024-10-26)."Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said".AP News. Retrieved2024-10-28.
  15. ^Koenecke, Allison; Choi, Anna Seo Gyeong; Mei, Katelyn X.; Schellmann, Hilke; Sloane, Mona (2024-06-03). "Careless Whisper: Speech-to-Text Hallucination Harms".The 2024 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: ACM. pp. 1672–1681.arXiv:2402.08021.doi:10.1145/3630106.3658996.ISBN 979-8-4007-0450-5.
  16. ^Yuan, Gong; Khurana, Sameer; Karlinsky, Leonid; Glass, James (2023). "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers".Interspeech 2023. pp. 2798–2802.arXiv:2307.03183.doi:10.21437/Interspeech.2023-2193.
Products
ChatGPT
Foundation
models
GPT models
Specialized
Intelligent
agents
People
Senior
management
Current
Former
Board of
directors
Current
Former
JVs
Related
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Political
Social and economic
Retrieved from "https://en.wikipedia.org/w/index.php?title=Whisper_(speech_recognition_system)&oldid=1338611715"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp