Detect different speakers in an audio recording

This page describes how to get labels for different speakers in audiodata transcribed by Cloud Speech-to-Text.

Sometimes, audio data contains samples of more than one persontalking. For example, audio from a telephone call usually featuresvoices from two or more people. A transcription of the call ideallyincludes who speaks at which times.

Speaker diarization

Cloud Speech-to-Text can recognize multiple speakers in the same audioclip. When you send an audio transcription request toCloud Speech-to-Text, you can include a parameter telling Cloud Speech-to-Textto identify the different speakers in the audio sample. This feature,calledspeaker diarization, detects when speakers change andlabels by number the individual voices detected in the audio.

When you enable speaker diarization in your transcription request,Cloud Speech-to-Text attempts to distinguish the different voicesincluded in the audio sample. Thetranscription result tags each word with a number assigned toindividual speakers. Words spoken by the same speaker bear the samenumber. A transcription result can include numbers up to as manyspeakers as Cloud Speech-to-Text can uniquely identify in theaudio sample.

When you use speaker diarization, Cloud Speech-to-Text produces a runningaggregate of all the results provided in the transcription. Eachresult includes the words from the previous result. Thus, thewordsarray in the final result provides the complete, diarized resultsof the transcription.

Review thelanguage support pageto see if this feature is available for your language.

Enable speaker diarization in a request

To enable speaker diarization, you need to set thediarization_config field inRecognitionFeatures. You must set themin_speaker_countandmax_speaker_count values according to how many speakers you expect in the transcript.

Cloud STT supports speakerdiarization for all speech recognition methods:speech:recognize andStreaming.

Use a local file

The following code snippet demonstrates how to enable speakerdiarization in a transcription request to Cloud Speech-to-Text using a local file

REST

Refer to thespeech:recognize API endpoint forcomplete details.

To perform synchronous speech recognition, make aPOST request and provide theappropriate request body. The following shows an example of aPOST request usingcurl. The example uses theGoogle Cloud CLI to generate an accesstoken. For instructions on installing the gcloud CLI,see thequickstart.

curl-s-H"Content-Type: application/json"\-H"Authorization: Bearer$(gcloudauthapplication-defaultprint-access-token)"\https://speech.googleapis.com/v2/projects/{project}/locations/{location}/recognizers/{recognizer}:recognize\--data'{    "config": {        "features": {            "diarizationConfig": {              "minSpeakerCount": 2,              "maxSpeakerCount": 2            },        }    },    "uri": "gs://cloud-samples-tests/speech/commercial_mono.wav"}' >speaker-diarization.txt

If the request is successful, the server returns a200 OK HTTPstatus code and the response in JSON format, saved to a file namedspeaker-diarization.txt.

{"results":[{"alternatives":[{"transcript":"hi I'd like to buy a Chromecast and I was wondering whether you could help me with that certainly which color would you like we have blue black and red uh let's go with the black one would you like the new Chromecast Ultra model or the regular Chrome Cast regular Chromecast is fine thank you okay sure we like to ship it regular or Express Express please terrific it's on the way thank you thank you very much bye","confidence":0.92142606,"words":[{"startOffset":"0s","endOffset":"1.100s","word":"hi","speakerLabel":"2"},{"startOffset":"1.100s","endOffset":"2s","word":"I'd","speakerLabel":"2"},{"startOffset":"2s","endOffset":"2s","word":"like","speakerLabel":"2"},{"startOffset":"2s","endOffset":"2.100s","word":"to","speakerLabel":"2"},...{"startOffset":"6.500s","endOffset":"6.900s","word":"certainly","speakerLabel":"1"},{"startOffset":"6.900s","endOffset":"7.300s","word":"which","speakerLabel":"1"},{"startOffset":"7.300s","endOffset":"7.500s","word":"color","speakerLabel":"1"},...]}],"languageCode":"en-us"}]}

Go

To learn how to install and use the client library for Cloud STT, seeCloud STT client libraries. For more information, see theCloud STTGo API reference documentation.

To authenticate to Cloud STT, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

import("context""fmt""io""os""strings"speech"cloud.google.com/go/speech/apiv1""cloud.google.com/go/speech/apiv1/speechpb")// transcribe_diarization_gcs_beta Transcribes a remote audio file using speaker diarization.functranscribe_diarization(wio.Writer)error{ctx:=context.Background()client,err:=speech.NewClient(ctx)iferr!=nil{returnfmt.Errorf("NewClient: %w",err)}deferclient.Close()diarizationConfig:=&speechpb.SpeakerDiarizationConfig{EnableSpeakerDiarization:true,MinSpeakerCount:2,MaxSpeakerCount:2,}recognitionConfig:=&speechpb.RecognitionConfig{Encoding:speechpb.RecognitionConfig_LINEAR16,SampleRateHertz:8000,LanguageCode:"en-US",DiarizationConfig:diarizationConfig,}// Get the contents of the local audio filecontent,err:=os.ReadFile("../resources/commercial_mono.wav")iferr!=nil{returnfmt.Errorf("error reading file %w",err)}audio:=&speechpb.RecognitionAudio{AudioSource:&speechpb.RecognitionAudio_Content{Content:content},}longRunningRecognizeRequest:=&speechpb.LongRunningRecognizeRequest{Config:recognitionConfig,Audio:audio,}operation,err:=client.LongRunningRecognize(ctx,longRunningRecognizeRequest)iferr!=nil{returnfmt.Errorf("error running recognize %w",err)}response,err:=operation.Wait(ctx)iferr!=nil{returnerr}// Speaker Tags are only included in the last result object, which has only one// alternative.alternative:=response.Results[len(response.Results)-1].Alternatives[0]wordInfo:=alternative.GetWords()[0]currentSpeakerTag:=wordInfo.GetSpeakerTag()varspeakerWordsstrings.BuilderspeakerWords.WriteString(fmt.Sprintf("Speaker %d: %s",wordInfo.GetSpeakerTag(),wordInfo.GetWord()))// For each word, get all the words associated with one speaker, once the speaker changes,// add a new line with the new speaker and their spoken words.fori:=1;i <len(alternative.Words);i++{wordInfo:=alternative.Words[i]ifcurrentSpeakerTag==wordInfo.GetSpeakerTag(){speakerWords.WriteString(" ")speakerWords.WriteString(wordInfo.GetWord())}else{speakerWords.WriteString(fmt.Sprintf("\nSpeaker %d: %s",wordInfo.GetSpeakerTag(),wordInfo.GetWord()))currentSpeakerTag=wordInfo.GetSpeakerTag()}}fmt.Fprint(w,speakerWords.String())returnnil}

Python

To learn how to install and use the client library for Cloud STT, seeCloud STT client libraries. For more information, see theCloud STTPython API reference documentation.

To authenticate to Cloud STT, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

fromgoogle.cloudimportspeech_v1p1beta1asspeechclient=speech.SpeechClient()speech_file="resources/commercial_mono.wav"withopen(speech_file,"rb")asaudio_file:content=audio_file.read()audio=speech.RecognitionAudio(content=content)diarization_config=speech.SpeakerDiarizationConfig(enable_speaker_diarization=True,min_speaker_count=2,max_speaker_count=10,)config=speech.RecognitionConfig(encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=8000,language_code="en-US",diarization_config=diarization_config,)print("Waiting for operation to complete...")response=client.recognize(config=config,audio=audio)# The transcript within each result is separate and sequential per result.# However, the words list within an alternative includes all the words# from all the results thus far. Thus, to get all the words with speaker# tags, you only have to take the words list from the last result:result=response.results[-1]words_info=result.alternatives[0].words# Printing out the output:forword_infoinwords_info:print(f"word: '{word_info.word}', speaker_tag:{word_info.speaker_tag}")returnresult

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.

Movatterモバイル変換

Detect different speakers in an audio recording Stay organized with collections Save and categorize content based on your preferences.