Movatterモバイル変換

Abdeladim Fadheli · 9 min read · Updated may 2023 ·Machine Learning ·Application Programming Interfaces

Confused by complex code? Let ourAI-powered Code Explainer demystify it for you. Try it out!

Speech Recognition is the technology that allows to transform human speech into digital text. In this tutorial, you will learn how to perform automatic speech recognition with Python.

This is a hands-on guide, you will learn to use the following:

SpeechRecognition library: This library contains several engines and APIs, both online and offline. We will use Google Speech Recognition, as it's faster to get started and doesn't require any API key. We havea separate tutorial on this.
The Whisper API:Whisper is a robust general-purpose speech recognition model released byOpenAI. The API was made available on the 1st of March 2023. We will use the OpenAI API to perform speech recognition.
Perform inference directly on Whisper models: Whisper is open-source. Therefore, we can directly use our computing resources to perform ASR. We will have the flexibility and the choice of which size of the whisper model to use. We also havea separate tutorial for using the Transformers library with Whisper along with Wav2Vec2.

Table of Content

With SpeechRecognition Library

In this section, we will base our speech recognition system on this tutorial. SpeechRecognition library offers many transcribing engines like Google Speech Recognition, and that's what we'll be using.

Before we get started, let's install the required libraries:

$ pip install SpeechRecognition pydub

Open up a new file namedspeechrecognition.py, and add the following:

# importing libraries import speech_recognition as sr import os from pydub import AudioSegmentfrom pydub.silence import split_on_silence# create a speech recognition objectr = sr.Recognizer()

The below function loads the audio file, performs speech recognition, and returns the text:

# a function to recognize speech in the audio file# so that we don't repeat ourselves in in other functionsdef transcribe_audio(path):    # use the audio file as the audio source    with sr.AudioFile(path) as source:        audio_listened = r.record(source)        # try converting it to text        text = r.recognize_google(audio_listened)    return text

Next, we make a function to split the audio files into chunks in silence:

# a function that splits the audio file into chunks on silence# and applies speech recognitiondef get_large_audio_transcription_on_silence(path):    """    Splitting the large audio file into chunks    and apply speech recognition on each of these chunks    """    # open the audio file using pydub    sound = AudioSegment.from_file(path)      # split audio sound where silence is 700 miliseconds or more and get chunks    chunks = split_on_silence(sound,        # experiment with this value for your target audio file        min_silence_len = 500,        # adjust this per requirement        silence_thresh = sound.dBFS-14,        # keep the silence for 1 second, adjustable as well        keep_silence=500,    )    folder_name = "audio-chunks"    # create a directory to store the audio chunks    if not os.path.isdir(folder_name):        os.mkdir(folder_name)    whole_text = ""    # process each chunk     for i, audio_chunk in enumerate(chunks, start=1):        # export audio chunk and save it in        # the `folder_name` directory.        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")        audio_chunk.export(chunk_filename, format="wav")        # recognize the chunk        with sr.AudioFile(chunk_filename) as source:            audio_listened = r.record(source)            # try converting it to text            try:                text = r.recognize_google(audio_listened)            except sr.UnknownValueError as e:                print("Error:", str(e))            else:                text = f"{text.capitalize()}. "                print(chunk_filename, ":", text)                whole_text += text    # return the text for all chunks detected    return whole_text

Let's give it a try:

print(get_large_audio_transcription_on_silence("7601-291468-0006.wav"))

Note: You can get all the tutorial fileshere.

Output:

audio-chunks\chunk1.wav : His abode which you had fixed in a bowery or country seat. audio-chunks\chunk2.wav : Have a short distance from the city. audio-chunks\chunk3.wav : Just at what is now called dutch street. audio-chunks\chunk4.wav : Soon abounded with proofs of his ingenuity. audio-chunks\chunk5.wav : Patent smokejack. audio-chunks\chunk6.wav : It required a horse to work some. audio-chunks\chunk7.wav : Dutch oven roasted meat without fire. audio-chunks\chunk8.wav : Carts that went before the horses. audio-chunks\chunk9.wav : Weather cox that turned against the wind and other wrongheaded contrivances. audio-chunks\chunk10.wav : So just understand confound it all beholders. His abode which you had fixed in a bowery or country seat. Have a short distance from the city. Just at what is now called dutch street. Soon abounded with proofs of his ingenuity. Patent smokejack. It required a horse to work some. Dutch oven roasted meat without fire. Carts that went before the horses. Weather cox that turned against the wind and other wrongheaded contrivances. So just understand confound it all beholders.

Using the Whisper API

If you want a more reliable API, I suggest you use OpenAI's Whisper API. To get started, you have to sign up for an OpenAI accounthere.

Once you have your account, go to theAPI keys page, and create an API key:

Create an API key - OpenAI Once you've done that, installtheopenai library for Python:

$ pip install openai

Now copy the API key to a new Python file, namedwhisper_api.py:

import openai# API keyopenai.api_key = "<API_KEY>"

Put your API key there, and add the following code:

def get_openai_api_transcription(audio_filename):    # open the audio file    with open(audio_filename, "rb") as audio_file:        # transcribe the audio file        transcription = openai.Audio.transcribe("whisper-1", audio_file) # whisper-1 is the model name    return transcriptionif __name__ == "__main__":    transcription = get_openai_api_transcription("7601-291468-0006.wav")    print(transcription.get("text"))

As you can see, using the OpenAI API is ridiculously simple, we use theopenai.Audio.transcribe() method to perform speech recognition. The returning object is anOpenAIObject in which we can extract the text using the Pythondict'sget() method. Here's the output:

His abode, which he had fixed at a bowery or country seat at a short distance from the city, just at what is now called Dutch Street, soon abounded with proofs of his ingenuity —patent smoke-jacks that required a horse to work them, Dutch ovens that roasted meat without fire, carts that went before the horses, weathercocks that turned against the wind, and other wrong-headed contrivances that astonished and confounded all beholders.

For larger audio files, we can simply use theget_large_audio_transcription_on_silence() above function, but replace thetranscribe_audio() with theget_openai_api_transcription() function, you can getthe complete code here.

Using 🤗 Transformers

Whisper is a general-purpose open-source speech recognition transformer model, trained on a large dataset of diverse weakly supervised audio (680,000 hours) on multiple languages on different tasks (speech recognition, speech translation, language identification, and voice activity detection).

Whisper models demonstrate a strong ability to adjust to different datasets and domains without the necessity for fine-tuning. As inthe Whisper paper:"The goal of Whisper is to develop a single robust speech processing system that works reliably without the need for dataset-specific fine-tuning to achieve high-quality results on specific distributions".

To get started, let's install the required libraries for this:

$ pip install transformers==4.28.1 soundfile sentencepiece torchaudio pydub

We will be usingthe Huggingface Transformers library to load our Whisper models. Please checkthis tutorial if you want a more detailed guide on this.

Open up a new Python file namedtransformers_whisper.py, and add the following code:

from transformers import WhisperProcessor, WhisperForConditionalGenerationimport torchimport torchaudiodevice = "cuda:0" if torch.cuda.is_available() else "cpu"# whisper_model_name = "openai/whisper-tiny.en" # English-only, ~ 151 MB# whisper_model_name = "openai/whisper-base.en" # English-only, ~ 290 MB# whisper_model_name = "openai/whisper-small.en" # English-only, ~ 967 MB# whisper_model_name = "openai/whisper-medium.en" # English-only, ~ 3.06 GBwhisper_model_name = "openai/whisper-tiny" # multilingual, ~ 151 MB# whisper_model_name = "openai/whisper-base" # multilingual, ~ 290 MB# whisper_model_name = "openai/whisper-small" # multilingual, ~ 967 MB# whisper_model_name = "openai/whisper-medium" # multilingual, ~ 3.06 GB# whisper_model_name = "openai/whisper-large-v2" # multilingual, ~ 6.17 GB

We have imported theWhisperProcessor for processing the audio before inference, and alsoWhisperForConditionalGeneration to perform speech recognition on the processed audio.

You can see the different model versions and sizes along with their size. The bigger the model, the better the transcription, so keep that in mind. For demonstration purposes, I'm going to use the smallest one,openai/whisper-tiny; it should work on a regular laptop. Let's load it:

# load the model and the processorwhisper_processor = WhisperProcessor.from_pretrained(whisper_model_name)whisper_model = WhisperForConditionalGeneration.from_pretrained(whisper_model_name).to(device)

Now let's make the function responsible for loading an audio file:

def load_audio(audio_path):  """Load the audio file & convert to 16,000 sampling rate"""  # load our wav file  speech, sr = torchaudio.load(audio_path)  resampler = torchaudio.transforms.Resample(sr, 16000)  speech = resampler(speech)  return speech.squeeze()

Of course, the sampling rate should be fixed and only a 16,000 sampling rate works for Whisper, if you don't do this processing step, you'll have very weird and wrong transcriptions in the end.

Next, let's write the inference function:

def get_transcription_whisper(audio_path, model, processor, language="english", skip_special_tokens=True):  # resample from whatever the audio sampling rate to 16000  speech = load_audio(audio_path)  # get the input features from the audio file  input_features = processor(speech, return_tensors="pt", sampling_rate=16000).input_features.to(device)  # get the forced decoder ids  forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")  # generate the transcription  predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)  # decode the predicted ids  transcription = processor.batch_decode(predicted_ids, skip_special_tokens=skip_special_tokens)[0]  return transcription

Inside this function, we're loading our audio file, doing preprocessing, and generating the transcription before decoding the predicted IDs. Again, if you want more details, checkthis tutorial.

Let's try the function:

if __name__ == "__main__":    english_transcription = get_transcription_whisper("7601-291468-0006.wav",                            whisper_model,                            whisper_processor,                            language="english",                            skip_special_tokens=True)    print("English transcription:", english_transcription)    arabic_transcription = get_transcription_whisper("arabic-audio.wav",                          whisper_model,                          whisper_processor,                          language="arabic",                          skip_special_tokens=True)    print("Arabic transcription:", arabic_transcription)    spanish_transcription = get_transcription_whisper("cual-es-la-fecha-cumple.mp3",                          whisper_model,                          whisper_processor,                          language="spanish",                          skip_special_tokens=True)    print("Spanish transcription:", spanish_transcription)

Output:

English transcription:  His abode, which he had fixed at a bow-ray, or country seat, at a short distance from the city, just that what is now called Dutch Street, soon abounded with proofs of his ingenuity, patent-smoked jacks that required a horse to work them, Dutch ovens that roasted meat without fire, carts that went before the horses, weathercocks that turned against the wind, and other wrong-headedArabic transcription: .نرقلا اذه لاوط عافترالا يف ةبوترلا تايوتسمو ةرارحلا تاجرد رمتست نأ مولعلل ةينيصلا ةيميداكألا يف تبتلا ةبضه ثاحبأ دهعم هدعأ يذلا ريرقتلا حجروSpanish transcription:  ¿Cuál es la fecha de tu cumpleaños?

Transcribing Large Audio Files

For larger audio files, we can simply usethepipeline API offered by Huggingface to perform speech recognition:

from transformers import pipelineimport torchimport torchaudiodevice = "cuda:0" if torch.cuda.is_available() else "cpu"whisper_model_name = "openai/whisper-tiny" # multilingual, ~ 151 MBdef load_audio(audio_path):  """Load the audio file & convert to 16,000 sampling rate"""  # load our wav file  speech, sr = torchaudio.load(audio_path)  resampler = torchaudio.transforms.Resample(sr, 16000)  speech = resampler(speech)  return speech.squeeze()def get_long_transcription_whisper(audio_path, pipe, return_timestamps=True,                                    chunk_length_s=10, stride_length_s=1):    """Get the transcription of a long audio file using the Whisper model"""    return pipe(load_audio(audio_path).numpy(), return_timestamps=return_timestamps,                chunk_length_s=chunk_length_s, stride_length_s=stride_length_s)        if __name__ == "__main__":    # initialize the pipeline    pipe = pipeline("automatic-speech-recognition",                 model=whisper_model_name, device=device)    # get the transcription of a sample long audio file    output = get_long_transcription_whisper(        "7601-291468-0006.wav", pipe, chunk_length_s=10, stride_length_s=2)    print(f"Transcription: {output}")    print("="*50)    for chunk in output["chunks"]:        # print the timestamp and the text        print(chunk["timestamp"], ":", chunk["text"])

Here's the output when running the script:

Transcription: {'text': ' His abode, which he had fixed at a bowray, or country seat, at a short distance from the city, just that what is now called Dutch Street, soon abounded with proofs of his ingenuity. Patent smoked jacks that required a horse to work them. Dutch ovens that roasted meat without fire, carts that went before the horses, weathercocks that turned against the wind, and other wrong-headed quadrivances stonished and confounded all beholders.', 'chunks': [{'timestamp': (0.0, 12.56), 'text': ' His abode, which he had fixed at a bowray, or country seat, at a short distance from the city, just that what is now called Dutch Street, soon abounded'}, {'timestamp': (12.56, 15.5), 'text': ' with proofs of his ingenuity.'}, {'timestamp': (15.5, 34.64), 'text': ' Patent smoked jacks that required a horse to work them. Dutch ovens that roasted meat without fire, carts that went before the horses, weathercocks that turned against the wind, and other wrong-headed quadrivances stonished and confounded all beholders.'}]}==================================================(0.0, 12.56) :  His abode, which he had fixed at a bowray, or country seat, at a short distance from the city, just that what is now called Dutch Street, soon abounded(12.56, 15.5) :  with proofs of his ingenuity.(15.5, 34.64) :  Patent smoked jacks that required a horse to work them. Dutch ovens that roasted meat without fire, carts that went before the horses, weathercocks that turned against the wind, and other wrong-headed quadrivances stonished and confounded all beholders.

Conclusion

This guide provided a comprehensive tutorial on how to perform automatic speech recognition using Python. It covered three main approaches: using the SpeechRecognition library with Google Speech Recognition, leveraging the Whisper API from OpenAI, and performing inference directly on open-source Whisper models using the 🤗 Transformers library.

The tutorial included code examples for each method, demonstrating how to transcribe audio files, split large audio files into chunks, and utilize different models for improved transcription accuracy.

Additionally, the article showcased the multilingual capabilities of the Whisper model and provided a pipeline for transcribing large audio files using the 🤗 Transformers library.

You can get the complete code for this tutorialhere.

Learn also:Speech Recognition using Transformers in Python.

Happy transcribing ♥

Why juggle between languages when you can convert? Check out ourCode Converter. Try it out today!

View Full Code Explain The Code for Me

Sharing is caring!

Comment panel

Got a coding query or need some guidance before you comment? Check out thisPython Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!