Disclosure of Invention
In view of the above, the present application provides a method and an apparatus for identity authentication, which can avoid the problems of malicious copying and reduction of the recognition accuracy, and achieve identity authentication of people with high accuracy.
The application provides an identity authentication method, which comprises the following steps:
step one, establishing a voice authentication database;
step two, receiving and storing a voice file to be authenticated;
step three, the authenticity of the voice file to be authenticated is identified, if the authenticity is true, the step four is entered, and if the authenticity is false, the direct output of the authentication failure message is finished;
step four, extracting voice signal characteristics and reconstructing reliable voice characteristics in the voice signals;
and fifthly, authenticating the reliable characteristic vector of the voice and outputting an identity authentication result.
In an embodiment of the present application, the first step includes: and collecting reliable voice of all people, extracting the characteristics of the reliable voice and recording voice characteristic information in the voice authentication database.
In a specific embodiment of the present application, the third step is: and judging whether the specific characteristics of the voice file caused by copying exceed a set first threshold value.
In a specific embodiment of the present application, calculating the specific characteristics brought by the replication includes:
if it isRepresenting the speech signal with frame number T, q: () Frame signalDiscrete fourier transform of (d):
formula (3-1)
Wherein N is the state number of a Markov chain in the voice signal;
the expression for the mean frame is then:
formula (3-2)
The difference between the original real voice and the reproduced voice in the frequency part is extracted by formula (3-3):
formula (3-3)
Wherein, filter () is an arbitrary filter function in the prior art.
In a specific embodiment of the present application, the step four includes:
by adopting a voice feature extraction algorithm used in the establishment of the voice recognition library,extracting speech signals input in daily lifeThe feature vector X of (2);
will input the voice signal in daily lifeIs divided into speech vectorsSum noise vector;
Dividing the feature vector X into speech vectors according to the mean and variance of the Gaussian function of the feature vector XSum noise vectorAnd calculating the v-th given speech vectorPrior probability of (2):
(formula 4-2)
Speech vector of speech signalAs a reliable feature vector of the input speech signal.
The application also discloses an identity authentication device, the device includes:
the voice authentication database 1 is used for storing voice characteristic information of all personnel;
the acquisition module 2 is used for acquiring voice information of a person to be authenticated;
the authenticity identification module 3 is used for identifying the authenticity of the voice file to be authenticated;
the characteristic extraction module 4 is used for extracting the characteristics of the voice signals and reconstructing the characteristics in the voice signals;
and the authentication module 5 authenticates the reliable characteristic vector of the voice and outputs an identity authentication result.
In a specific embodiment of the present application, the voice authentication database 1 is used for collecting reliable voices of all people, extracting features of the reliable voices, and recording voice feature information in the voice authentication database.
In an embodiment of the present application, the authenticity verification module 3 is specifically configured to verify whether a specific feature of the voice file due to copying exceeds a set first threshold.
In a specific embodiment of the present application, calculating the specific characteristics brought by the replication includes:
if it isRepresenting the speech signal with frame number T, q: () Frame signalDiscrete fourier transform of (d):
formula (3-1)
Wherein N is the state number of a Markov chain in the voice signal;
the expression for the mean frame is then:
formula (3-2)
The difference between the original real voice and the reproduced voice in the frequency part is extracted by formula (3-3):
formula (3-3)
Wherein, filter () is an arbitrary filter function in the prior art.
In a specific embodiment of the present application, the authentication module 5 includes:
extracting voice signals input in daily life by adopting voice feature extraction algorithm used in establishing voice recognition libraryThe feature vector X of (2);
will input the voice signal in daily lifeIs divided into speech vectorsSum noise vector;
Dividing the feature vector X into speech vectors according to the mean and variance of the Gaussian function of the feature vector XSum noise vectorAnd calculating the v-th given speech vectorPrior probability of (2):
(formula 4-2)
Speech vector of speech signalAs a reliable feature vector of the input speech signal.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.
In order to solve the problem that the voice recognition accuracy rate is reduced due to malicious copying and environmental noise in identity authentication by using voice in the prior art, the application discloses an identity authentication method and device.
The following further describes the present application with reference to the drawings.
As shown in fig. 1, before performing voice authentication, a voice authentication database needs to be established first, and the establishment process of the voice database may utilize any existing voice feature extraction technology, and preferably, the following establishment process may be used:
and collecting voice information of all n persons needing to be authenticated according to the requirement. In order to ensure the accuracy of voice recognition, the voice information of all the personnel under different conditions can be continuously increased so as to improve the recognition rate.
For the collected voice information of the person i corresponding to the audio signal x (i) stored in the original voice storage area in the voice recognition database, i =1, … …, n (i is a positive integer).
The voice feature extraction process comprises the following steps:
for each audio signal x (i) stored in the original speech storage area, the following is performed:
(1) the audio signal x (i) is divided into a series of successive frames, and a fourier transform is applied to each frame of signal.
(2) Processing the audio signal using a filter to reduce mutual leakage of spectral energy between adjacent frequency bands; the filter function used in the filter is:
) Formula (1)
Wherein:
the parameter theta is the initial phase of the filter, and n is the order of the filter;
when t <0, u (t) =0, when t >0, u (t) = 1;
B=1.019*ERB(),ERB() Is the equivalent rectangular bandwidth of the filter, which is equal to the center frequency of the filterThe relationship of (1) is:
ERB()=24.7+0.108equation (2).
(3) The mean deviation of the audio signal is removed.
After framing of the audio signal, a certain number of frames are grouped into one segment, preferably 7 frames into one segment in the present invention, which can be set according to the processing capacity of the system.
Most speech recognition systems use a frame length of 20ms to 30ms, preferably 26.5ms in the present invention as the hamming window, an overlapping frame length of 10ms, and the intermediate quantity Q (i, j) of each frame is obtained by calculating the average value of the frame energies P (i, j) within the segment:
formula (3)
In formula (3), since the present invention prefers 7 frames to constitute one segment, M = 3. i is the channel number, j is the sequence of the frames sought, and j' is the sequence of the frames in the sought segment.
In the noise energy removal process, the degree of voice signal corrosion can be represented by using the ratio (AM/GM) of the arithmetic mean value and the geometric mean value, and the ratio is obtained by logarithm:
formula (4)
In the formula (4), z is a plane coefficient, so as to avoid negative infinitesimal estimation to ensure that the deviation of the calculation result is within an allowable range; j is the total number of sequences of the frame.
Assuming that b (i) is a deviation caused by background noise, i represents a channel sequence, and the intermediate quantity Q' (i, j | b (i)) obtained by removing the deviation is obtained by a conditional probability:
formula (5)
It is possible to obtain:
formula (6)
For equation (6), when the AM/GM ratio under noise is closest to the AM/GM value of the acoustic signal, the estimated value of b (i) can be found as:
formula (7)
Wherein,representing g (i) the corresponding value in the acoustic signal, and obtained by calculating equation (7) for each channel, for each time-frequency BIN signal (i, j), the ratio of noise removal is:
formula (8)
For smooth computation, the noise removal ratios for channels i-N to i + N are averaged, and the final function after adjustment is:
formula (10)
All audio signals in the filter are processed using equation (10) with the intermediate bias removed as the output of the filter.
(4) And performing nonlinear power function operation on the audio signal data output by all the filters, wherein the used power function is as follows:
formula (11).
(5) And (4) further performing discrete cosine transform on the output of the step (4) to obtain a voice characteristic parameter.
Since Discrete Cosine Transform (DCT) is a well-known processing method in the speech processing field, it is not described herein in detail.
The calculated speech features are stored in a database.
As shown in fig. 2, the present application discloses an identity authentication method, which includes the following steps:
step 1: and establishing a voice authentication database.
The voice authentication database may be built using any voice feature extraction technique known in the art, or may be built in the preferred manner described above.
Step 2: and receiving and storing the voice file to be authenticated.
Voice prompt information can be set in the voice acquisition equipment to prompt a person to be identified to input a voice file. For example, the voice of a person is collected by a microphone. Other speech acquisition devices may also be employed.
Step 3: and (4) identifying the authenticity of the voice file to be authenticated, if the authenticity is true, entering Step4, and if the authenticity is false, ending the direct output of the authentication failure message.
In general, a fake voice file is usually obtained by copying without the consent of the current person, but copying a voice file by copying many times will inevitably change the characteristic information in the voice file, and such changes usually accompany the signal in the whole voice file uniformly. Thus, the authenticity verification is performed by verifying whether or not a specific feature due to copying in the voice file exceeds a set first threshold value.
First, letRepresenting the speech signal with frame number T, q: () Frame signal(N is the number of states of the Markov chain in the speech signal) is transformed into:
formula (3-1)
The expression for the mean frame is then:
formula (3-2)
In general, there is a difference in frequency part between original real voice and reproduced voice, and such a difference can be extracted by equation (3-3):
formula (3-3)
Wherein, filter () is any filtering function in the prior art, such as the filtering function in formula (1).
In the present application, the first threshold =0.53 is preferred by experiment.
Step 4: features of the speech signal are extracted and features in the speech signal are reconstructed.
Generally, a voice signal used in constructing a voice authentication database is professional collected in a quiet environment, various noises may exist in a daily living environment during voice input in an actual authentication process, and if the voice signal input under a noise condition is directly subjected to feature extraction, the voice signal is influenced by noise voice information, so that the accuracy of identity authentication is influenced.
Extracting voice signals input in daily life by adopting voice feature extraction algorithm used in establishing voice recognition libraryThe feature vector X. Thus, the voice signal inputted in daily life can be converted into the voice signalIs divided into speech vectorsSum noise vector。
Establishing a model of the prior probability p (X) of the voice signal, and then obtaining the following data through combination and training:
(formula 4-1)
Where V is the number of blending units, V is the number, p (V) is the prior probability of a blending unitRepresents the v-th Gaussian distribution with a mean matrix ofThe diagonal covariance matrix is. Given the characteristics of a speech signal, it is divided into speech vectors according to the mean and variance of its Gaussian functionAnd noiseVector quantityThen, the v-th given speech vector is calculatedPrior probability of (2):
(formula 4-2)
During reconstruction, the speech vector of the speech signalAre retained as reliable feature vectors for the input speech signal.
Step 5: and authenticating the reliable characteristic vector of the voice and outputting an identity authentication result.
And inputting the reliable characteristic vector of the voice signal into a voice authentication database for comparison, if the voice signal which is matched with the reliable characteristic vector is found in the voice authentication database, the verification is passed, and if the voice signal which is not matched with the reliable characteristic vector is not found in the voice authentication database, the verification is not passed.
As shown in fig. 3, the present application also provides an identity authentication apparatus, which includes:
and the voice authentication database 1 is used for storing voice characteristic information of all the persons.
The voice authentication database may be created using any of the voice feature extraction techniques known in the art, or may be created in the preferred manner described below.
As shown in fig. 1, before performing voice authentication, a voice authentication database needs to be established first, and the establishment process of the voice database may utilize any existing voice feature extraction technology, and preferably, the following establishment process may be used:
and collecting voice information of all n persons needing to be authenticated according to the requirement. In order to ensure the accuracy of voice recognition, the voice information of all the personnel under different conditions can be continuously increased so as to improve the recognition rate.
For the collected voice information of the person i corresponding to the audio signal x (i) stored in the original voice storage area in the voice recognition database, i =1, … …, n (i is a positive integer).
The voice feature extraction process comprises the following steps:
for each audio signal x (i) stored in the original speech storage area, the following is performed:
(1) the audio signal x (i) is divided into a series of successive frames, and a fourier transform is applied to each frame of signal.
(2) Processing the audio signal using a filter to reduce mutual leakage of spectral energy between adjacent frequency bands; the filter function used in the filter is:
) Formula (1)
Wherein:
the parameter theta is the initial phase of the filter, and n is the order of the filter;
when t <0, u (t) =0, when t >0, u (t) = 1;
B=1.019*ERB(),ERB() Is the equivalent rectangular bandwidth of the filter, itCenter frequency of the same filterThe relationship of (1) is:
ERB()=24.7+0.108equation (2).
(3) The mean deviation of the audio signal is removed.
After framing of the audio signal, a certain number of frames are grouped into one segment, preferably 7 frames into one segment in the present invention, which can be set according to the processing capacity of the system.
Most speech recognition systems use a frame length of 20ms to 30ms, preferably 26.5ms in the present invention as the hamming window, an overlapping frame length of 10ms, and the intermediate quantity Q (i, j) of each frame is obtained by calculating the average value of the frame energies P (i, j) within the segment:
formula (3)
In formula (3), since the present invention prefers 7 frames to constitute one segment, M = 3. i is the channel number, j is the sequence of the frames sought, and j' is the sequence of the frames in the sought segment.
In the noise energy removal process, the degree of voice signal corrosion can be represented by using the ratio (AM/GM) of the arithmetic mean value and the geometric mean value, and the ratio is obtained by logarithm:
formula (4)
In the formula (4), z is a plane coefficient, so as to avoid negative infinitesimal estimation to ensure that the deviation of the calculation result is within an allowable range; j is the total number of sequences of the frame.
Assuming that b (i) is a deviation caused by background noise, i represents a channel sequence, and the intermediate quantity Q' (i, j | b (i)) obtained by removing the deviation is obtained by a conditional probability:
formula (5)
It is possible to obtain:
formula (6)
For equation (6), when the AM/GM ratio under noise is closest to the AM/GM value of the acoustic signal, the estimated value of b (i) can be found as:
formula (7)
Wherein,representing g (i) the corresponding value in the acoustic signal, and obtained by calculating equation (7) for each channel, for each time-frequency BIN signal (i, j), the ratio of noise removal is:
formula (8)
For smooth computation, the noise removal ratios for channels i-N to i + N are averaged, and the final function after adjustment is:
formula (10)
All audio signals in the filter are processed using equation (10) with the intermediate bias removed as the output of the filter.
(4) And performing nonlinear power function operation on the audio signal data output by all the filters, wherein the used power function is as follows:
formula (11).
(5) And (4) further performing discrete cosine transform on the output of the step (4) to obtain a voice characteristic parameter.
Since Discrete Cosine Transform (DCT) is a well-known processing method in the speech processing field, it is not described herein in detail.
The calculated speech features are stored in a database.
And the acquisition module 2 is used for acquiring the voice information of the personnel to be authenticated.
Voice prompt information can be set in the voice acquisition equipment to prompt a person to be identified to input a voice file. For example, the voice of a person is collected by a microphone. Other speech acquisition devices may also be employed.
And the authenticity identification module 3 is used for identifying the authenticity of the voice file to be authenticated.
In general, a fake voice file is usually obtained by copying without the consent of the current person, but copying a voice file by copying many times will inevitably change the characteristic information in the voice file, and such changes usually accompany the signal in the whole voice file uniformly. Thus, the authenticity verification is performed by verifying whether or not a specific feature due to copying in the voice file exceeds a set first threshold value.
First, letRepresenting the speech signal with frame number T, q: () Frame signal(N is the number of states of the Markov chain in the speech signal) is transformed into:
formula (3-1)
The expression for the mean frame is then:
formula (3-2)
In general, there is a difference in frequency part between original real voice and reproduced voice, and such a difference can be extracted by equation (3-3):
formula (3-3)
Wherein, filter () is any filtering function in the prior art, such as the filtering function in formula (1).
In the present application, the first threshold =0.53 is preferred by experiment.
And the characteristic extraction module 4 is used for extracting the characteristics of the voice signals and reconstructing the characteristics in the voice signals.
Generally, a voice signal used in constructing a voice authentication database is professional collected in a quiet environment, various noises may exist in a daily living environment during voice input in an actual authentication process, and if the voice signal input under a noise condition is directly subjected to feature extraction, the voice signal is influenced by noise voice information, so that the accuracy of identity authentication is influenced.
Extracting voice signals input in daily life by adopting voice feature extraction algorithm used in establishing voice recognition libraryThe feature vector X. Thus, the voice signal inputted in daily life can be converted into the voice signalIs divided into speech vectorsSum noise vector。
Establishing a model of the prior probability p (X) of the voice signal, and then obtaining the following data through combination and training:
(formula 4-1)
Where V is the number of blending units, V is the number, p (V) is the prior probability of a blending unitRepresents the v-th Gaussian distribution with a mean matrix ofThe diagonal covariance matrix is. Given the characteristics of a speech signal, it is divided into speech vectors according to the mean and variance of its Gaussian functionSum noise vectorThen, the v-th given speech vector is calculatedPrior probability of (2):
(formula 4-2)
During reconstruction, the speech vector of the speech signalAre retained as reliable feature vectors for the input speech signal.
And the authentication module 5 authenticates the reliable characteristic vector of the voice and outputs an identity authentication result.
And inputting the reliable characteristic vector of the voice signal into a voice authentication database for comparison, if the voice signal which is matched with the reliable characteristic vector is found in the voice authentication database, the verification is passed, and if the voice signal which is not matched with the reliable characteristic vector is not found in the voice authentication database, the verification is not passed.
Of course, it is not necessary for any particular embodiment of the invention to achieve all of the above advantages at the same time.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device), or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.