CN113516987B

Movatterモバイル変換

Info

Publication number: CN113516987B
Application number: CN202110807643.7A
Authority: CN
Inventors: 田敬广
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2024-04-12
Anticipated expiration: 2041-07-16
Also published as: CN113516987A

Abstract

The application discloses a speaker identification method, a speaker identification device, a storage medium and a speaker identification device, wherein the speaker identification method comprises the following steps: firstly, acquiring target voice to be recognized, determining the sampling rate of the target voice, and extracting the first acoustic characteristic of the target voice; processing the first acoustic feature based on the sampling rate to obtain a second acoustic feature, inputting the second acoustic feature into a pre-constructed speaker recognition model, and recognizing to obtain a target characterization vector of a target speaker; the speaker recognition model is obtained by jointly training voices with different sampling rates; then, the target speaker can be identified according to the target characterization vector, and an identification result of the target speaker is obtained. Therefore, the second acoustic characteristics are input into the pre-constructed speaker recognition model, so that no effect loss is ensured when the high-frequency voice acoustic characteristics are input, and the effect reduction caused by the input low-frequency voice acoustic characteristics is compensated, thereby improving the accuracy of recognition results.

Description

Speaker recognition method, speaker recognition device, storage medium and equipment

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to a speaker recognition method, apparatus, storage medium, and device.

Background

With the continuous breakthrough of artificial intelligence technology and the increasing popularization of various intelligent terminal devices, the occurrence frequency of human-computer interaction in daily work and life of people is higher and higher. Voice interaction, which is a next generation man-machine interaction, can bring great convenience to people's life, and more important is a technology for recognizing a speaker based on voice, which is called speaker recognition. For example, speaker recognition may be applied to the field of confirming the identity of a speaker, such as court trial, remote financial service, security, voice search, etc., where accurate recognition of the identity of the speaker based on voice data is required.

The traditional speaker recognition method is to train and maintain speaker recognition models for voices with two different sampling rates of broadband and narrowband respectively, the deployment cost is high, and the two speaker recognition models cannot be subjected to similarity matching, so that the accuracy of recognition results is low.

Therefore, how to use the same model to perform recognition and improve the accuracy of the speaker recognition result is a technical problem to be solved urgently.

Disclosure of Invention

The main purpose of the embodiments of the present application is to provide a speaker recognition method, apparatus, storage medium and device, which can effectively improve accuracy of recognition results when performing speaker recognition.

The embodiment of the application provides a speaker identification method, which comprises the following steps:

acquiring target voice to be recognized, and determining the sampling rate of the target voice;

extracting a first acoustic feature from the target voice; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature;

inputting the second acoustic features into a pre-constructed speaker recognition model, and recognizing to obtain a target characterization vector of a target speaker; the speaker recognition model is obtained by jointly training voices with different sampling rates;

and identifying the target speaker according to the target characterization vector to obtain an identification result of the target speaker.

In one possible implementation manner, the speaker recognition model is constructed as follows:

acquiring a first sample voice corresponding to a first sampling rate and a teacher speaker recognition model; the teacher speaker recognition model is obtained based on voice training of a first sampling rate;

acquiring a second sample voice corresponding to a second sampling rate; extracting acoustic features of the second sample voice from the second sample voice; the first sample voice and the second sample voice belong to the same sample speaker;

Inputting the acoustic characteristics of the first sample voice into the teacher speaker recognition model to obtain a first sample characterization vector; inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into an initial speaker recognition model to respectively obtain a second sample characterization vector and a third sample characterization vector;

training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector and the third sample characterization vector to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.

In a possible implementation manner, the processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature includes:

when the sampling rate of the target voice is determined to be the first sampling rate, directly taking the first acoustic feature as a second acoustic feature;

and when the sampling rate of the target voice is determined to be the second sampling rate, processing the first acoustic feature to obtain a second acoustic feature.

In a possible implementation, the first sampling rate is higher than the second sampling rate, and the first acoustic feature comprises a logarithmic mel-filter bank FBANK feature; and when the sampling rate of the target voice is determined to be the second sampling rate, processing the first acoustic feature to obtain a second acoustic feature, including:

The number of filters for filtering the power spectrum of the first acoustic feature is adjusted to obtain an adjusted first acoustic feature, so that the adjusted first acoustic feature is aligned with a low-frequency band region of the acoustic feature of the voice corresponding to the first sampling rate;

and zero padding the difference dimension of the adjusted first acoustic feature and the acoustic feature of the voice corresponding to the first sampling rate so that the dimension of the first acoustic feature after zero padding is the same as the dimension of the acoustic feature of the voice corresponding to the first sampling rate, and taking the first acoustic feature after zero padding as a second acoustic feature.

In a possible implementation manner, the inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into an initial speaker recognition model includes: and after the acoustic features corresponding to the second sample voice are processed, inputting the initial speaker recognition model.

In a possible implementation manner, the training the initial speaker-id model according to the first sample characterization vector, the second sample characterization vector, and the third sample characterization vector to generate a student speaker-id model, and taking the student speaker-id model as a final speaker-id model includes:

Calculating cosine similarity between the first sample characterization vector and the second sample characterization vector as a first cosine loss;

calculating cosine similarity between the first sample characterization vector and the third sample characterization vector as a second cosine loss;

and calculating the sum value of the first cosine loss and the second cosine loss, training the initial speaker recognition model according to the sum value to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.

In one possible implementation, the target speech includes M segments of speech; m is a positive integer greater than 1; extracting a first acoustic feature from the target voice; and processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, including:

respectively extracting M first acoustic features of the M sections of voice from the M sections of voice; processing the M first acoustic features based on the respective sampling rates of the M sections of voice to obtain M second acoustic features;

the second acoustic feature is input to a pre-constructed speaker recognition model, and a target characterization vector of a target speaker is obtained through recognition, and the method comprises the following steps:

Respectively inputting the M second acoustic features into a pre-constructed speaker recognition model, and recognizing to obtain M target characterization vectors corresponding to a target speaker;

and calculating the average value of the M target characterization vectors, and taking the average value as a final target characterization vector corresponding to the target speaker.

In a possible implementation manner, the identifying the target speaker according to the target token vector to obtain an identification result of the target speaker includes:

calculating the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker;

judging whether the similarity is higher than a preset threshold, if so, determining that the target speaker is the preset speaker; if not, determining that the target speaker is not the preset speaker.

calculating N similarity between the target characterization vector of the target speaker and N preset characterization vectors of N preset speakers; the N is a positive integer greater than 1;

And selecting the maximum similarity from the N similarities, and determining the target speaker as a preset speaker corresponding to the maximum similarity.

The embodiment of the application also provides a speaker identification device, which comprises:

the first acquisition unit is used for acquiring target voice to be recognized; and determining a sampling rate of the target speech;

a processing unit for extracting a first acoustic feature from the target speech; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature;

the first recognition unit is used for inputting the second acoustic characteristics into a pre-constructed speaker recognition model, and recognizing to obtain a target characterization vector of a target speaker; the speaker recognition model is obtained by jointly training voices with different sampling rates;

and the second recognition unit is used for recognizing the target speaker according to the target characterization vector to obtain a recognition result of the target speaker.

In a possible implementation manner, the apparatus further includes:

the second acquisition unit is used for acquiring a first sample voice corresponding to the first sampling rate and a teacher speaker recognition model; the teacher speaker recognition model is obtained based on voice training of a first sampling rate;

The third acquisition unit is used for acquiring a second sample voice corresponding to the second sampling rate; extracting acoustic features of the second sample voice from the second sample voice; the first sample voice and the second sample voice belong to the same sample speaker;

the obtaining unit is used for inputting the acoustic characteristics of the first sample voice into the teacher speaker recognition model to obtain a first sample characterization vector; inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into an initial speaker recognition model to respectively obtain a second sample characterization vector and a third sample characterization vector;

the training unit is used for training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector and the third sample characterization vector, generating a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.

In a possible implementation manner, the processing unit includes:

a first processing subunit, configured to directly take the first acoustic feature as a second acoustic feature when determining that the sampling rate of the target speech is the first sampling rate;

And the second processing subunit is used for processing the first acoustic feature to obtain a second acoustic feature when the sampling rate of the target voice is determined to be the second sampling rate.

In a possible implementation, the first sampling rate is higher than the second sampling rate, and the first acoustic feature comprises a logarithmic mel-filter bank FBANK feature; the second processing subunit includes:

an adjusting subunit, configured to adjust the number of filters for filtering the power spectrum of the first acoustic feature, so as to obtain an adjusted first acoustic feature, so that the adjusted first acoustic feature is aligned with a low-frequency band region of an acoustic feature of the voice corresponding to the first sampling rate;

and the zero-filling subunit is used for zero-filling the difference dimension between the adjusted first acoustic feature and the acoustic feature of the voice corresponding to the first sampling rate so that the dimension of the first acoustic feature after zero filling is the same as the dimension of the acoustic feature of the voice corresponding to the first sampling rate, and taking the first acoustic feature after zero filling as a second acoustic feature.

In a possible implementation manner, the obtaining unit is specifically configured to:

and after the acoustic features corresponding to the second sample voice are processed, inputting the initial speaker recognition model.

In a possible implementation manner, the training unit includes:

a first calculating subunit, configured to calculate, as a first cosine loss, a cosine similarity between the first sample characterization vector and the second sample characterization vector;

a second calculating subunit, configured to calculate, as a second cosine loss, a cosine similarity between the first sample characterization vector and the third sample characterization vector;

and the training subunit is used for calculating the sum value of the first cosine loss and the second cosine loss, training the initial speaker recognition model according to the sum value, generating a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.

In one possible implementation, the target speech includes M segments of speech; m is a positive integer greater than 1; the processing unit is specifically configured to:

the first recognition unit includes:

The recognition subunit is used for respectively inputting the M second acoustic features into a pre-constructed speaker recognition model to obtain M target characterization vectors corresponding to a target speaker through recognition;

and the third calculation subunit is used for calculating the average value of the M target characterization vectors and taking the average value as a final target characterization vector corresponding to the target speaker.

In a possible implementation manner, the second identifying unit includes:

a fourth calculating subunit, configured to calculate a similarity between a target token vector of the target speaker and a preset token vector of a preset speaker;

the first determining subunit is configured to determine whether the similarity is higher than a preset threshold, and if yes, determine that the target speaker is the preset speaker; if not, determining that the target speaker is not the preset speaker.

In a possible implementation manner, the second identifying unit includes:

a fifth calculating subunit, configured to calculate N similarities between the target token vector of the target speaker and N preset token vectors of N preset speakers; the N is a positive integer greater than 1;

and the second determining subunit is used for selecting the maximum similarity from the N similarities and determining the target speaker as a preset speaker corresponding to the maximum similarity.

The embodiment of the application also provides speaker identification equipment, which comprises: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform any of the implementations of the speaker recognition method described above.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, and when the instructions are run on the terminal equipment, the terminal equipment is caused to execute any implementation mode of the speaker identification method.

The embodiment of the application also provides a computer program product, which when being run on a terminal device, causes the terminal device to execute any implementation mode of the speaker identification method.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a speaker recognition method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of constructing a speaker recognition model according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a speaker recognition model according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a speaker recognition device according to an embodiment of the present application.

Detailed Description

With the rapid development of intelligent recognition technology, more and more scenes need to use biological recognition technology to recognize speakers, such as financial security, intelligent home, administrative judicial, and the like. The traditional speaker recognition method is to train and maintain speaker recognition models for voices with two different sampling rates, namely a broadband voice and a narrowband voice respectively, the deployment cost is high, and the speaker models obtained by the voice training with different sampling rates cannot be subjected to similarity matching, so that the accuracy of recognition results is low.

In this regard, the existing speaker recognition method usually trains a mixed bandwidth speaker recognition model to recognize voices with different sampling rates, but the obtained speaker recognition result is not satisfactory. In particular, existing methods of training mixed bandwidth speaker recognition models generally include the following three types:

the first method is a downsampling method, which directly downsamples wideband speech into narrowband speech, extracts narrowband acoustic features, and uniformly trains a speaker recognition model by using the narrowband acoustic features to perform speaker recognition, but because the downsampling method ignores high-frequency information in the wideband speech, related research shows that the high-frequency information in the speech is helpful for distinguishing speakers, the method sacrifices the effect of the speaker recognition model, and the accuracy of the model recognition result is lower.

The second method is an up-sampling method, which directly up-samples the narrowband speech into wideband speech, extracts wideband acoustic features, trains a speaker recognition model by using the wideband acoustic features uniformly, carries out speaker recognition, and although the up-sampling method does not lose the high-frequency information of the wideband speech, does not compensate the high-frequency information of the missing narrowband speech, compared with the wideband speech recognition effect, the model recognition result has lower accuracy, and the two methods have the common defects that the training data of the model are required to have speaker labels and large-scale speaker-free label data cannot be utilized, but as is well known, the speaker labels are manually marked for the training data, the time consumption and the economic cost are huge, the scale of a training data set is limited, and the accuracy of the model training result cannot be ensured.

The third is a bandwidth expansion method, which extracts acoustic features of narrowband speech and wideband speech respectively, trains a bandwidth expansion neural network, converts the narrowband acoustic features into wideband acoustic features, recovers missing high-frequency band information, and trains a speaker recognition model by using the wideband acoustic features to perform speaker recognition.

Therefore, both the conventional speaker recognition method and the existing speaker recognition method are not accurate enough for speaker recognition.

In order to solve the above-mentioned defect, the present application provides a speaker recognition method, firstly, obtain a target voice to be recognized, determine a sampling rate of the target voice, and extract a first acoustic feature of the target voice; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, inputting the second acoustic feature into a pre-constructed speaker recognition model, and recognizing to obtain a target characterization vector of the target speaker; the speaker recognition model is obtained by jointly training voices with different sampling rates; then, the target speaker can be identified according to the target characterization vector, and an identification result of the target speaker is obtained. Therefore, according to the embodiment of the application, the second acoustic features corresponding to the target voice are input into the pre-built speaker recognition model, so that no effect loss is ensured when the high-frequency voice acoustic features are input, the effect reduction caused by the input low-frequency voice acoustic features is compensated, and the target characterization vector of the target speaker can be predicted, so that the high-frequency information lacking the low-frequency voice acoustic features is compensated under the condition that the parameter number of the speaker recognition model is not increased, and the good recognition effect can be obtained on the low-frequency and high-frequency target voice data by using the same speaker recognition model, so that the accuracy of the speaker recognition result is improved.

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

First embodiment

Referring to fig. 1, a flow chart of a speaker identification method provided in this embodiment includes the following steps:

s101: and acquiring target voice to be recognized, and determining the sampling rate of the target voice.

In this embodiment, any speaker that needs to be identified is defined as a target speaker, and a voice that the target speaker needs to identify is defined as a target voice. It should be noted that, the present embodiment is not limited to the language type of the target voice, for example, the target voice may be a voice formed by chinese or a voice formed by english, etc.; meanwhile, the length of the target voice is not limited in this embodiment, for example, the target voice may be one sentence, or multiple sentences.

It can be understood that the target voice can be obtained through a recording mode according to actual needs, for example, a phone call voice in daily life of people or a conference recording mode can be used as the target voice, and the sampling rate of the target voice is determined while the target voice is obtained, so that the target voice is processed by using the scheme provided by the embodiment to recognize the identity of the target speaker speaking the target voice.

The sampling rate (i.e., sampling frequency) refers to how much the recording device samples the analog signal in a unit time, and the higher the sampling frequency, the more natural the waveform of the sound wave. The units of sampling rate are expressed in hertz (Hz). Also, different voices may correspond to a variety of different sampling rates. For example, for a 4000Hz telephony signal in a network, according to nyquist's law, in order for the sampled digital signal to fully retain the information in the telephony signal, it is necessary to sample at a sampling rate of 8000 Hz. With the development of computer network technology, a sampling rate of 16000Hz is widely used for sampling voices such as internet audio.

It will be appreciated that the sampling rate may comprise a variety of different sampling rates, such as high frequency (e.g., 16000Hz, etc.) and low frequency (e.g., 8000Hz, etc.), with corresponding sampled speech being high frequency speech and low frequency speech, respectively. The high frequency speech may be referred to as wideband speech, the low frequency speech may be referred to as narrowband speech, the wideband speech may have a higher sampling rate than the narrowband speech, e.g., the wideband speech may have a sampling rate twice that of the narrowband speech (e.g., 16000Hz is twice 8000 Hz), etc.

S102: extracting a first acoustic feature from the target speech; and processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature.

In this embodiment, after obtaining the target voice to be recognized and determining the sampling rate of the target voice, in order to accurately recognize the identity information of the target speaker speaking the target voice, firstly, the acoustic feature of the voiceprint information representing the target voice needs to be extracted from the target voice by using a feature extraction method, and defined as a first acoustic feature, then the first acoustic feature is processed according to the sampling rate of the target voice to obtain a second acoustic feature, and further the obtained second acoustic feature can be used as a recognition basis to realize effective recognition of the target voice through the subsequent steps S103-S104, and further recognize the identity of the target speaker.

Specifically, when extracting the first acoustic feature of the target voice, firstly, framing the target voice to obtain a corresponding voice frame sequence, and then pre-emphasizing the framed voice frame sequence; and sequentially extracting the acoustic features of each voice frame, wherein the acoustic features refer to feature data for representing voiceprint information of the corresponding voice frame, and for example, the acoustic features can be Mel-frequency coefficient (Mel-scale Frequency Cepstral Coefficients, abbreviated as MFCC) features or Log Mel-Filter Bank (FBANK) features.

It should be noted that, the embodiment of the present application is not limited to the extraction method of the first acoustic feature of the target voice, and is not limited to a specific extraction process, and an appropriate extraction method may be selected according to the actual situation, and the corresponding feature extraction operation may be performed. For ease of understanding, this embodiment will be described below by taking the first acoustic feature of the target voice as the FBANK feature as an example.

Further, an alternative implementation manner is that when the sampling rate of the target voice is the first sampling rate, for example, when the target voice is broadband, after the first acoustic feature (such as the FBANK feature) of the target voice is extracted, the first acoustic feature is not processed any more, but is directly used as the second acoustic feature, so as to execute the subsequent steps S103-S104, realize effective recognition of the target voice, and further recognize the identity of the target speaker.

Or, in another alternative implementation manner, when the sampling rate of the target voice is the second sampling rate, for example, when the target voice is a narrowband voice, in order to reduce the difference between the narrowband acoustic feature and the wideband acoustic feature, after the FBANK feature of the target voice is extracted as the first acoustic feature, further, the number of filters for filtering the power spectrum of the first acoustic feature needs to be adjusted, so as to obtain the adjusted first acoustic feature, so that the adjusted first acoustic feature is aligned with the low-frequency band region of the acoustic feature of the voice corresponding to the first sampling rate; and further, zero padding is required to be performed on the difference dimension between the adjusted first acoustic feature and the acoustic feature of the voice corresponding to the first sampling rate, so that the dimension between the first acoustic feature after zero padding and the acoustic feature of the voice corresponding to the first sampling rate is the same, and the first acoustic feature after zero padding is used as a second acoustic feature to execute the subsequent steps S103-S104, thereby realizing effective recognition of the target voice and further recognizing the identity of the target speaker.

Illustrating: for target speech in a narrow band with a sampling rate of 8000Hz, the corresponding FBANK features may represent information in the 0-4000Hz band, while for speech with a sampling rate of 16000Hz, the corresponding FBANK features may represent information in the 0-8000Hz band. The information of 4000-8000Hz frequency band is missing for the target voice with 8000Hz sampling rate compared with the voice with 16000Hz sampling rate.

In contrast, the conversion formula m=2594 log according to the frequency f and mel scale frequency m₁₀ (1+f/700) the correlation between the number of mel filter banks for narrowband speech and wideband speech can be calculated as follows:

wherein f_N Frequency domain upper bound, f, representing the FBANK characteristics of narrowband target speech_W Frequency domain upper bound, M, representing the FBANK characteristics of wideband speech_N And M_W The number of filters that filter the power spectra of the narrowband target speech FBANK feature and the wideband speech FBANK feature, respectively.

Illustrating: when f_W ＝8000、f_N ＝4000、M_W When=40, M can be calculated according to the above formula (1)_N =30.2. At this time, a rounding operation may be performed to divide M_N Forced value is 30, and f is calculated back_N Is 3978. Thus, when the FBANK feature for speech extraction at a 16000Hz sampling rate is 40 dimensions and the FBANK feature for speech extraction at a 8000Hz sampling rate is 30 dimensions, the two are aligned in the 0-3978Hz frequency band. Then, zero values of 10 dimensions can be supplemented for 30-dimensional FBANK extracted from target voice with 8000Hz sampling rate, so that the dimensions of the FBANK characteristics extracted from target voice with 8000Hz sampling rate and wideband voice with 16000Hz sampling rate are 40 dimensions, and the difference between the two is further reduced.

S103: inputting the second acoustic features into a pre-constructed speaker recognition model, and recognizing to obtain a target characterization vector of a target speaker; the speaker recognition model is obtained by training voices with different sampling rates. Typically, if the target speech to be recognized comprises speech corresponding to the first sample rate and speech corresponding to the second sample rate, the speaker recognition model is also trained based on the sample speech at the first sample rate and the sample speech at the second sample rate.

In this embodiment, after the second acoustic feature of the target voice is obtained in step S102, in order to effectively improve the accuracy of the recognition result, the second acoustic feature may be further input into a pre-constructed speaker recognition model, so as to recognize and obtain the target characterization vector of the target speaker when speaking the voice content of the target voice, so as to execute the subsequent step S104. It should be noted that, the specific format of the target token vector may be set according to practical situations, which is not limited in this embodiment, for example, the target token vector may be a 256-dimensional vector or the like.

Compared with acoustic features (such as FBANK features) at the frame level, the target characterization vector represents the sentence-level acoustic information of the target voice, comprehensively considers the relation between each voice frame and the context of the voice frame, and can more accurately characterize the voice information of the target voice. The speaker recognition model is trained by using voices with different sampling rates (e.g., wideband voices with a first sampling rate and narrowband voices with a second sampling rate). Therefore, no matter whether the target voice is the voice with the first sampling rate or the voice with the second sampling rate, after the corresponding second acoustic feature is input into the speaker recognition model, the target characterization vector for more accurately characterizing the personalized voice information of the target voice can be obtained, and then the target characterization vector can be utilized to recognize the target speaker to which the target voice belongs through the subsequent step S104 so as to determine the identity information of the target speaker.

Next, the present embodiment will be described with reference to a process for constructing a speaker recognition model, as shown in fig. 2, which shows a schematic flow chart for constructing a speaker recognition model according to the present embodiment, where the flow chart includes the following steps A1-A4:

step A1: acquiring a first sample voice corresponding to a first sampling rate and a teacher speaker recognition model; the teacher speaker recognition model is obtained based on sample voice training corresponding to the first sampling rate.

In this embodiment, in order to construct the speaker recognition model, a lot of preparation work needs to be performed in advance, first, a lot of voices corresponding to a first sampling rate, such as wideband voice data, sent by a user during speaking, for example, voice data corresponding to a first sampling rate may be collected by a microphone array, and the pick-up device may be a tablet computer or an intelligent hardware device, such as an intelligent stereo, a television, an air conditioner, etc., generally needs to perform noise reduction processing after a lot of high-frequency voices are collected, so that each piece of collected high-frequency voice (such as wideband voice) data of each user may be used as a first sample voice, and meanwhile, the first sample voices may be used to train to obtain the speaker recognition model of the teacher for executing the subsequent step A2.

The implementation process of obtaining the teacher speaker recognition model by using the first sample voice training may specifically include the following steps a11-a12, and it should be noted that, in the following steps, the training process of the teacher speaker recognition model is described by taking the first sample voice as a wideband sample voice as an example:

step A11: wideband acoustic features characterizing acoustic information of wideband sample speech are extracted from the wideband sample speech.

In this embodiment, after the wideband sample speech is obtained, the wideband sample speech cannot be directly used for training and generating the speaker recognition model of the teacher, but a method similar to the second acoustic feature of the target speech extracted in step S102 is required to be adopted to replace the target speech with the wideband sample speech, so that the wideband acoustic features of each wideband sample speech can be extracted, and the description of step S102 is omitted here for related points.

Step A12: training according to broadband acoustic characteristics of broadband sample voice and speaker identification tags corresponding to the broadband sample voice, and generating a teacher speaker identification model.

In this embodiment, first, a neural network may be selected as the initialized speaker recognition model, and model parameters, such as the neural network shown in the left-hand diagram of fig. 3, may be initialized. It should be noted that, the specific network structure of the model is not limited in this embodiment, and may be any form of neural network, for example, a convolutional neural network (Convolutional Neural Networks, abbreviated as CNN), a cyclic neural network (Recurrent Neural Network, RNN), a deep neural network (Deep Neural Networks, abbreviated as DNN), or an x-vector system structure.

Then, as shown in fig. 3, the initialized speaker recognition model (i.e., the neural network shown in the left chart of fig. 3) may be trained in a current round by sequentially using the wideband acoustic features corresponding to each wideband sample voice, so as to update parameters, and after multiple rounds of parameter updating (i.e., after the training end condition is met, for example, the number of preset training rounds is reached or the variation of model parameters is smaller than a preset threshold value), the speaker recognition model of the teacher is obtained through training.

Specifically, during the training process, an alternative implementation may utilize a given objective function to construct a teacher speaker recognition model and update the network parameters of the model. The objective function used in this embodiment is as follows:

wherein x isⁱ An acoustic feature vector representing an ith wideband sample speech; y isⁱ A speaker tag representing a manual annotation corresponding to the ith wideband sample voice; w (w)_j And b are model parameters, specifically, w_j The j-th column of the model classification layer weight matrix is represented, and b represents a bias term; m and N represent the total number of wideband sample voices and the total number of speakers corresponding to these wideband sample voices, respectively.

When training the teacher speaker recognition model by using the objective function in the above formula (2), the training can be performed according to L_s Value change, model parameters (i.e. w and b) are updated continuously until L_s And if the value meets the requirement, for example, the change amplitude is small, updating of the model parameters is stopped, and training of the speaker recognition model of the teacher is completed.

Alternatively, the existing model already trained by wideband speech may be used as the speaker recognition model of the teacher in this embodiment for training, so long as training by wideband acoustic features of wideband speech is ensured. But it is to be ensured that the second sample speech (e.g., narrowband speech) corresponding to the second sampling rate used later in training the student speaker recognition model and the wideband speech used in training to obtain the teacher speaker recognition model belong to the same sample speaker. And the second sample voice (such as a narrow-band sample voice) input during the training of the student speaker recognition model has a sampling rate (i.e., a second sampling rate) that is lower than the sampling rate (i.e., a first sampling rate) corresponding to the first sample voice (such as a wide-band sample voice) input by the teacher network.

Step A2: acquiring a second sample voice corresponding to a second sampling rate; extracting acoustic characteristics of the second sample voice from the second sample voice; wherein the first sample speech and the second sample speech belong to the same sample speaker.

In this embodiment, in order to construct the speaker recognition model, a large number of first sample voices (such as wideband sample voices) corresponding to a first sampling rate and a large number of second sample voices (such as narrowband sample voices) corresponding to a second sampling rate are required to be acquired, where the first sampling rate is higher than the second sampling rate. After the second sample voice (e.g., the narrowband sample voice) is obtained, the second acoustic feature of the target voice is replaced with the second sample voice (e.g., the narrowband sample voice) by a method similar to the method for extracting the second acoustic feature of the target voice in step S102, so that the acoustic feature of each second sample voice (e.g., the narrowband sample voice) can be extracted, and the description in step S102 is omitted, so that the second sample voice (e.g., the narrowband sample voice) and the first sample voice (e.g., the wideband sample voice) belonging to the same sample speaker, for example, the voice with 8000Hz sampling rate and the voice with 16000Hz sampling rate uttered by the same sample speaker, can be utilized to obtain the final speaker recognition model through performing the subsequent steps A3-A4 and training.

Step A3: inputting the acoustic characteristics of the first sample voice into a teacher speaker recognition model to obtain a first sample characterization vector; and inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into an initial speaker recognition model to respectively obtain a second sample characterization vector and a third sample characterization vector.

The network structures of the initial speaker recognition model and the teacher speaker recognition model are the same, and model parameters (i.e. w and b) of the teacher speaker recognition model obtained through training in the steps A1-A2 are loaded as initial parameters.

Step A4: training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector and the third sample characterization vector to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.

In one possible implementation manner of the embodiment of the present application, the specific implementation process of the step A4 may include: firstly, calculating the cosine similarity between a first sample characterization vector and a second sample characterization vector to be used as a first cosine loss, and simultaneously, calculating the cosine similarity between the first sample characterization vector and a third sample characterization vector to be used as a second cosine loss; then calculating the sum of the first cosine loss and the second cosine loss, training the initial speaker recognition model according to the sum to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model

Specifically, in this implementation manner, in order to train a speaker recognition model with a better recognition effect, a specific calculation formula of a sum value of the first cosine loss and the second cosine loss in the training process is as follows:

L_total ＝L_COS (t_wb ,s_nb )+L_cos (t_wb ,s_wb ) (3)

wherein L is_COS (t_wb ,s_nb ) A cosine similarity, i.e., a first cosine loss, between the first sample characterization vector output by the teacher speaker recognition network and the second sample characterization vector output by the student speaker recognition model; l (L)_cos (t_wb ，s_wb ) A cosine similarity, namely a second cosine loss, between the first sample characterization vector output by the teacher speaker recognition network and the third sample characterization vector output by the student speaker recognition model is represented; l (L)_total Representing the sum of the first cosine loss and the second cosine loss.

The calculation formulas of the first cosine loss and the second cosine loss are as follows:

wherein,a first sample characterization vector representing an ith first sample speech (e.g., wideband sample speech) output by the teacher speaker recognition network; />A second sample characterization vector or a third characterization vector representing an ith sample voice (e.g., a wideband sample voice corresponding to the first sample voice or a narrowband sample voice corresponding to the second sample voice) output by the student speaker recognition network; m represents the total number of sample voices.

When training the student speaker recognition model by using the formulas (3) and (4), the training can be performed according to L_total Continuously updating the parameters of the student speaker recognition model until L_total And if the value meets the requirement, for example, the change amplitude is small, updating of the model parameters is stopped, training of the student speaker recognition model is completed, and the student speaker recognition model obtained through training is used as a final speaker recognition model.

It should be noted that, when training the speaker recognition model of the student by using the speaker recognition model of the teacher, parallel data of the first sample voice (such as wideband sample voice) and the second sample voice (such as narrowband sample voice) are adopted, such as a large number of voices with 8000HZ sampling rate and 16000HZ sampling rate uttered by the same sample speaker. When the acquired training data cannot meet the condition, the acquired first sample voice (such as wideband sample voice) can be downsampled to obtain parallel second sample voice (such as narrowband sample voice) so as to complement the training data, and model training is performed.

Thus, through the steps A1-A4, the training of the student speaker recognition model is guided by using the output characterization vector of the teacher speaker recognition model, and the student speaker recognition model can be reserved as the final speaker recognition model after the training is completed without marking the training data with the speaker. The non-supervision training mode enables the finally obtained speaker recognition model to output a more accurate target characterization vector for characterizing the individual voice information of the target voice for the input low-frequency (such as a narrow-band) target voice or high-frequency (such as a wide-band) target voice, and further can be used for more accurately recognizing the target speaker to which the target voice belongs by the subsequent step S104 so as to determine the identity information of the target speaker.

The speaker recognition model not only ensures no effect loss in the subsequent process of inputting high-frequency voice acoustic features, but also compensates the effect reduction caused by inputting low-frequency voice acoustic features, so that the same speaker recognition model is only needed to obtain better recognition effects in the process of recognizing low-frequency and high-frequency voice data. That is, by the mode of learning the teacher model and the student model, the high-frequency information lacking the low-frequency voice acoustic characteristics can be compensated under the condition that the number of the speaker recognition models is not increased, and the accuracy of the recognition result is improved.

In addition, in one possible implementation manner of the embodiment of the present application, when the target speech obtained in step S101 includes M segments of speech (where M is a positive integer greater than 1), in step S102, M first acoustic features of the M segments of speech may be extracted from the M segments of speech respectively; processing the M first acoustic features based on the respective sampling rates of the M sections of voices to obtain M second acoustic features, and further in step S103, the M second acoustic features can be input into speaker recognition models obtained through training in steps A1-A4 respectively, and M target characterization vectors corresponding to the M target speaker when speaking voice contents of the M sections of voices are obtained through recognition; then, an average value of the M target token vectors is calculated, and the average value is used as a final target token vector corresponding to the target speaker, so as to execute the subsequent step S104.

S104: and identifying the target speaker according to the target characterization vector to obtain an identification result of the target speaker.

In this embodiment, after the target token vector that the target speaker has when speaking the voice content of the target voice is obtained in step S103, the target speaker may be further identified according to the processing result by processing the target token vector, so as to obtain the identification result of the target speaker.

Specifically, an alternative implementation manner, the specific implementation procedure of the step S104 may include the following steps B1-B2:

step B1: and calculating the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker.

In this implementation manner, when the identity of the target speaker needs to be confirmed to determine whether the target speaker is a certain preset speaker, after the target token vector of the target speaker when speaking the target voice is obtained in step S103, the similarity between the target token vector of the target speaker and the preset token vector of the preset speaker may be further calculated, where the specific calculation formula is as follows:

wherein v is₁ A target token vector representing a target speaker; v₂ A preset token vector representing a preset speaker; cos (v)₁ ,v₂ ) Similarity between a target token vector representing a target speaker and a preset token vector of a preset speaker, cos (v)₁ ,v₂ ) The higher the value of (c) indicates that the target speaker is more similar to the preset speaker, i.e., the greater the likelihood that the target speaker is the same person as the preset speaker, whereas cos (v)₁ ，v₂ ) The smaller the value of (c) indicates that the target speaker is less similar to the preset speaker, i.e., the less likely that the target speaker is the same person as the preset speaker.

Step B2: judging whether the similarity is higher than a preset threshold, if so, determining that the target speaker is a preset speaker; if not, determining that the target speaker is not the preset speaker.

In the present implementation, the similarity cos (v) between the target token vector of the target speaker and the preset token vector of the preset speaker is calculated by the step B1₁ ，v₂ ) After that, it is further necessary to determine the similarity cos (v₁ ，v₂ ) Whether the target speaker is higher than a preset threshold value or not, if so, determining that the target speaker is a preset speaker; if not, determining that the target speaker is not the preset speaker.

The preset threshold value refers to a critical value for defining whether the target speaker and the preset speaker are the same person, and the specific value may be set according to actual situations, which is not limited in the embodiment of the present application, for example, the preset threshold value may be 0.8, or may be a value corresponding to an equal error rate, or may be a value corresponding to a minimum detection cost function, or may be another value determined empirically according to an actual application scenario, or the like. And when the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker exceeds the critical value, indicating that the target characterization vector of the target speaker and the preset characterization vector of the preset speaker are the same person, otherwise, indicating that the target characterization vector of the target speaker and the preset characterization vector of the preset speaker are not the same person.

Alternatively, the implementation of step S104 may further include the following steps C1-C2:

step C1: calculating N similarity between the target characterization vector of the target speaker and N preset characterization vectors of N preset speakers; wherein N is a positive integer greater than 1.

In this implementation manner, when the target speaker needs to be identified to identify which one of N (where M is a positive integer greater than 1) preset speakers is the target speaker, after the target token vector of the target speaker when speaking the target voice is obtained in step S103, N similarities between the target token vector of the target speaker and N preset token vectors of the N preset speakers may be further calculated by using the above formula (5), so as to execute the subsequent step C2.

Step C2: and selecting the maximum similarity from the N similarities, and determining a preset speaker corresponding to the target speaker as the maximum similarity.

In this implementation manner, after calculating N similarities between the target token vector of the target speaker and N preset token vectors of N preset speakers through step C1, the maximum similarity may be further selected from the N similarities, and the preset speaker corresponding to the maximum similarity is determined as the target speaker.

Illustrating: assuming that three preset speakers A, B and C are provided, and the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker A is 0.1, the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker B is 0.84, and the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker C is 0.22, the highest similarity can be determined from the similarity to be 0.84, and the identity of the target speaker is the preset speaker B according to the highest similarity to be 0.84.

Therefore, by utilizing the pre-constructed speaker recognition model, not only can low-frequency voice be processed, but also high-frequency voice can be processed, and operations such as up-sampling, down-sampling or bandwidth expansion are not needed to be carried out on the voice during recognition, so that the effect is better than that of the existing recognition scheme. Meanwhile, through the steps A1-A4, the training of the student speaker recognition model is guided by using the characterization vector output by the teacher speaker recognition model, and the speaker marking of training data is not needed.

In summary, in the speaker recognition method provided in this embodiment, first, a target voice to be recognized is obtained, a sampling rate of the target voice is determined, and a first acoustic feature of the target voice is extracted; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, inputting the second acoustic feature into a pre-constructed speaker recognition model, and recognizing to obtain a target characterization vector of the target speaker; the speaker recognition model is obtained by jointly training voices with different sampling rates; then, the target speaker can be identified according to the target characterization vector, and an identification result of the target speaker is obtained. Therefore, according to the embodiment of the application, the second acoustic features corresponding to the target voice are input into the pre-built speaker recognition model, so that no effect loss is ensured when the high-frequency voice acoustic features are input, the effect reduction caused by the input low-frequency voice acoustic features is compensated, and the target characterization vector of the target speaker can be predicted, so that the high-frequency information lacking the low-frequency voice acoustic features is compensated under the condition that the parameter number of the speaker recognition model is not increased, and the good recognition effect can be obtained on the low-frequency and high-frequency target voice data by using the same speaker recognition model, so that the accuracy of the speaker recognition result is improved.

Second embodiment

The present embodiment will be described with reference to a speaker recognition device, and the related content is referred to the above method embodiment.

Referring to fig. 4, a schematic diagram of a speaker recognition device according to the present embodiment is provided, and the device 400 includes:

a first obtaining unit 401, configured to obtain a target voice to be recognized; and determining a sampling rate of the target speech;

a processing unit 402, configured to extract a first acoustic feature from the target speech; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature;

a first recognition unit 403, configured to input the second acoustic feature to a pre-constructed speaker recognition model, and recognize to obtain a target token vector of a target speaker; the speaker recognition model is obtained by jointly training voices with different sampling rates;

and the second identifying unit 404 is configured to identify the target speaker according to the target token vector, so as to obtain an identification result of the target speaker.

In one implementation of this embodiment, the apparatus further includes:

In one implementation of the present embodiment, the processing unit 402 includes:

In one implementation of this embodiment, the first sampling rate is higher than the second sampling rate, and the first acoustic feature comprises a logarithmic mel-filter bank FBANK feature; the second processing subunit includes:

In one implementation manner of this embodiment, the obtaining unit is specifically configured to:

In one implementation of this embodiment, the training unit includes:

In one implementation of this embodiment, the target speech includes M segments of speech; m is a positive integer greater than 1; the processing unit 402 is specifically configured to:

The first recognition unit 403 includes:

In one implementation of this embodiment, the second identifying unit 404 includes:

Further, the embodiment of the application also provides a speaker identification device, which comprises: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

Further, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, which when executed on a terminal device, cause the terminal device to execute any implementation method of the speaker identification method.

Further, embodiments of the present application also provide a computer program product, which when run on a terminal device, causes the terminal device to perform any one of the implementation methods of the above-mentioned speaker recognition method.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speaker identification, comprising:

2. The method of claim 1, wherein the speaker recognition model is constructed as follows:

3. The method of claim 2, wherein processing the first acoustic feature based on the sample rate of the target speech to obtain a second acoustic feature comprises:

4. A method according to claim 3, wherein the first sampling rate is higher than the second sampling rate, and the first acoustic feature comprises a logarithmic mel-filter bank FBANK feature; and when the sampling rate of the target voice is determined to be the second sampling rate, processing the first acoustic feature to obtain a second acoustic feature, including:

5. The method of claim 3, wherein said inputting the acoustic features of the first sample speech and the acoustic features corresponding to the second sample speech into an initial speaker recognition model comprises:

When the sampling rate of the second sample voice is determined to be the first sampling rate, directly taking the acoustic feature corresponding to the second sample voice as the acoustic feature corresponding to the second sample voice and used for inputting the initial speaker recognition model; and inputting acoustic features of the first sample speech and the first sample speech into the initial speaker recognition model;

or when the sampling rate of the second sample voice is determined to be the second sampling rate, and the first sampling rate is higher than the second sampling rate, the acoustic feature corresponding to the second sample voice comprises a logarithmic mel filter bank FBANK feature; the number of filters for filtering the power spectrum of the acoustic feature corresponding to the second sample voice is adjusted to obtain the adjusted acoustic feature corresponding to the second sample voice, so that the adjusted acoustic feature corresponding to the second sample voice is aligned with a low-frequency band region of the acoustic feature corresponding to the voice with the first sampling rate;

zero padding the difference dimension between the acoustic features of the adjusted second sample voice and the acoustic features of the voice corresponding to the first sampling rate so that the dimension between the acoustic features of the second sample voice after zero padding and the acoustic features of the voice corresponding to the first sampling rate are the same, and taking the acoustic features of the second sample voice after zero padding as the acoustic features of the second sample voice for inputting the initial speaker recognition model; and inputs its acoustic features with the first sample speech into the initial speaker recognition model.

6. The method of claim 2, wherein the training the initial speaker-recognition model based on the first sample characterization vector, the second sample characterization vector, and the third sample characterization vector to generate a student speaker-recognition model, and taking the student speaker-recognition model as a final speaker-recognition model, comprises:

7. The method of claim 1, wherein the target speech comprises M segments of speech; m is a positive integer greater than 1; extracting a first acoustic feature from the target voice; and processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, including:

8. The method of claim 1, wherein the identifying the target speaker according to the target token vector to obtain the identification result of the target speaker comprises:

9. The method of claim 1, wherein the identifying the target speaker according to the target token vector to obtain the identification result of the target speaker comprises:

10. A speaker recognition device, comprising:

11. A speaker recognition device, comprising: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-9.

12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to perform the method of any of claims 1-9.