Disclosure of Invention
In view of the above, the present invention provides a voice recognition-based intelligent door lock recognition system and method with voice recognition accuracy and intelligent control flexibility, which solve the above technical problems, and is specifically implemented by adopting the following technical schemes.
In a first aspect, the present invention provides an intelligent door lock recognition system based on speech recognition, including:
the preprocessing module is used for preprocessing the voice information sent by the intelligent door lock to obtain a voice signal;
the feature extraction module is used for extracting physiological features and personalized information representing a speaker from the voice signals, enhancing the voice signals according to the victory features and the personalized information to obtain voice signals, and performing voice enhancement on the voice signals to obtain test data, wherein the physiological features comprise gender and age, and the personalized information comprises tone and tone color;
the model training module is used for inputting the test data into the deep neural network to train to carry out voiceprint recognition on the voice signal to obtain a recognition result, wherein the recognition result comprises user identity and message information;
and the identification judging module is used for successfully matching the identification result with a pre-constructed corpus to obtain a control instruction, and intelligently controlling the door lock according to the control instruction.
As a further improvement of the above technical solution, the identification judgment module includes an intelligent door lock unit, a middleware unit and an application unit;
the intelligent door lock unit is used for recording the door lock state, receiving a control instruction issued by the middleware unit and uploading door lock information, the middleware unit is used for transmitting the door lock state information uploaded by the door lock, receiving the operation of the intelligent door lock unit and various index data of the intelligent door lock unit and transmitting the operation and various index data to the corpus, and the corpus analyzes the user behaviors and monitors the real-time condition of the middleware unit according to the index data; the application unit is used for displaying door lock information, operating the door lock state and upgrading the switch firmware to realize man-machine interaction.
As a further improvement of the above technical solution, the execution process of the model training module includes:
comparing the test template corresponding to the test data with the training template, identifying by comparing similarity measures between the test template and the training template, and calculating the pronunciation speed of the voice data in the test data by combining a dynamic time warping algorithm;
preset function f [ (m)
i ,n
i )]Corresponds to a grid (m
i ,n
i ) Then there is a path cost function d [ (m)
i ,n
i )]And f [ (m)
i ,n
i )]=(m
i-1 ,n
i-1 ) Initializing to n
i =i(i=1,2...,N),n
1 =m
1 =1,f[(m,n)]=(0,0),
Wherein R is a parallelogram constraint;
calculating f [ (m) by adopting a recursion method
i ,n
i )]And d [ (m)
i ,n
i )]The expression is obtained as
Wherein m=m
i ,m
i -1,m
i -2, the always true version of the speech signal d [ (M, N)]When i=n, the point (M, N) is traced back forward, thereby obtaining the optimum path (M
i-1 ,n
i-1 )=f[(m
i ,n
i )](i=n, N-1,..3, 2), when (m
i-1 ,n
i-1 ) Ending at = (0, 0).
As a further improvement of the technical scheme, the vector quantized feature data quantity is adopted to represent the overall feature vector, a plurality of sampling signals are classified into one vector for each class, the vector is quantized, and the input feature vector X= { X is preseti I=1, 2., T }, n represents the number of iterations, Li (n) represents the ith subcell of the nth iteration, Yi (n) code words representing the sub-cells, wherein the code words have J total, the maximum iteration number is M', the iteration threshold is set as epsilon, and the specific process comprises the following steps:
selecting the centroids of all input feature vectors as an initialization codebook { Y }i (0) 1.ltoreq.i.ltoreq.J }, the expressions for dividing the cell into two by adopting the smaller threshold are Y respectivelyi(1) =Yi(0) -ε、Yi(2) =Yi(0) +ε,Yi(1) And Yi(2) Respectively representing a cell code word before splitting and a cell code word after splitting into two, wherein the number of cells is doubled, and n=0;
when n=n+1, calculating characteristic vector and current codeword of each frame, if satisfied, x is C
i (n), d represents the Euclidean distance between vectors, d (x, Y)
i (n-1))≤d(x,Y
j (n-1)), i.noteq.j, 1.ltoreq.i, j.ltoreq.M', using the expression
When n=l or
At the end of the iteration, returning to dividing the cell into two by adopting a smaller threshold value, otherwise, saving and outputting the best codebook { Y
i I=1, 2..m' }, if the distortion change rate is greater than the threshold, the clustering process that occurs when n=n+1 will be skipped.
As a further improvement of the above technical solution, the executing process of the feature extraction module includes:
performing voice enhancement on the voice signal by adopting spectral subtraction, and subtracting the noise power spectrum from the power spectrum of the voice signal to obtain a pure voice spectrum;
let S (t) denote a clean speech signal, N '(t) denote an additive noise signal, Y (t) denote a noisy speech signal, then Y (t) =s (t) +n' (t), and Y (ω) =s (ω) +n '(ω) are used to represent the fourier transforms of Y (t), S (t) and N' (t), respectively;
if independent of ginkgo and additive noise, Y (omega)2 =|S(ω)|2 +|N′(ω)|2 If P is usedy (ω)、Ps (omega) and Pn (ω) represents the power spectrum of y (t), s (t) and n' (t), respectively, then Py (ω)=Ps (ω)+Pn′ (omega) estimation of noise power spectrum P by speech-free but noisy before utterancen′ (ω), then Ps (ω)=Py (ω)-Pn′ (omega) the subtracted power spectrum Ps (ω) determining a clean speech power spectrum from which the denoised speech time domain signal is recovered to obtain test data.
As a further improvement of the above technical solution, recovering the noise-reduced speech time domain signal from the power spectrum to obtain test data, including:
from noisy speech x
0 [n
0 ]Power spectrum |x (e
jw )|
2 In which the power spectrum of clean speech is estimated
The usage expression is->
Wherein x is
0 [n
0 ]Representing a discrete sequence of input signals, if in a speech signal s [ n ]
0 ]Adding additive noise n
0 [n
0 ]According to the fact that the noise is irrelevant to the signal and is non-stable, the change rate of the noise is smaller than that of the signal, x is obtained
0 [n
0 ]=s[n
0 ]+n
0 [n
0 ]And fourier transforming to obtain the expression
Wherein->
Is to the |N in no voice
0 (e
jw )|
2 Is a statistical estimate of (e) |X (e
jw )|
2 Represents the power spectrum of noisy speech, |S (e
jw )|
2 Representing the power spectrum of a speech signal, wherein, in the non-speech section, an estimate of the noise power spectrum is +.>
And updating.
As a further improvement of the above technical solution, estimating according to input noise obtained from noisy speech, removing noise from noisy speech by using a spectral subtraction algorithm to obtain an estimated value of a speech signal, re-estimating a transfer function of a wiener filter by using an output signal, and updating background noise in a speech segment and a non-speech segment, including:
calculating an initial smoothed estimate of the background noise magnitude spectrum
The N_no frame before noise is preset to be a pure noise signal, and statistical average of the amplitude is adopted to estimate +.>
The recursive expression is
Where n=1..n_no, -/-for example>
N th statistical evaluation value representing background noise, its initial value +.>
The expression for obtaining the initial smooth estimated value of background noise amplitude spectrum is expressed as the power spectrum of the nth frame of noisy speech signal
Wherein |X
N_no (e
jw ) The I represents the amplitude spectrum of the noise voice signal of the N_no frame;
let the frame variable n=n_no+1, using the expression as
Wherein->
Pinghu estimation representing noise power spectrum, +.>
An estimate representing a signal power spectrum;
filtering the amplitude spectrum of the voice signal with noise to obtain an estimated value of the background noise amplitude spectrum of the current frame
Calculating the amplitude spectrum estimated value of the signal obtained by subtracting the spectrums
Adopt the current k frame noise amplitude spectrum estimation value +.>
Smooth estimation of background noise +.>
The expression for updating is +.>
Wherein the scale factor is represented;
if the rate of change of the speech signal and the rate of change of the noise signal can be separated, a reasonable setting of p is made,
and->
Slow change of->
The change is quick;
computing a smoothed estimate of a signal magnitude spectrum
When the frame variable n=n+1, if n>The total frame number N is ended to obtain the estimated value +.>
And repeating the steps to continue to execute as output.
As a further improvement of the above technical solution, the execution process of the preprocessing module includes:
the sampling data of the preset voice signal is { Q ]
K -k=1, 2..n), n representing the total number of samples, let Δt=1, using the expression
Wherein K is E [1, n]Calculating each undetermined coefficient alpha
j (j=0, 1,2 m), let function->
And discrete sampled data Q
K Error quadratic sum E is minimum, then +.>
Satisfying the E extremum condition is expressed as
Sequentially selecting E to alpha
i Solving and compiling to generate m+1 element linear equation system
Calculating m+1 undetermined coefficients alpha
j (j=0, 1. M., m represents the order of the set polynomial, the value ranges of i and j represent 0.ltoreq.i, j.ltoreq.m, and when m=0, the trend term represents a constant,
when m=0, m represents a constant trend term, namely an arithmetic average value of signal sampling data, and when m=1, a linear trend term is represented to be +.>
When m is equal to or greater than 2, the trend term represents a curve trend term.
As a further improvement of the technical scheme, the method for successfully matching the recognition result with the pre-constructed corpus to obtain the control instruction comprises the following steps:
the R-dimensional European space R is obtained by adopting a vectorization algorithm
r Medium gain vector
Obtaining a limited number of vectors in the r dimension according to a preset criterion
To indicate +.>
Representing the input vector +.>
Representing the quantization vector or codeword,
representing a codebook, the number of codewords M being referred to as codebook capacity, the criterion for vector quantization in training data being to minimize the quantized distortion at a given codebook capacity k;
when the speaker voice to be recognized is recognized, a group of characteristic vectors are extracted according to the training time,
n codebook pairs using N speakers, respectively +.>
Is quantized to find a codebook in the feature space that is close to the set of feature vectors>
The corresponding speaker i serves as a recognition result to obtain a recognition result. />
In a second aspect, the present invention also provides an intelligent door lock recognition method based on voice recognition, which includes the following steps:
the voice information sent by the intelligent door lock is obtained, and the voice information is preprocessed to obtain a voice signal;
extracting physiological characteristics and personalized information representing a speaker from the voice signal, and enhancing the voice signal according to the winning characteristics and the personalized information to obtain the voice signal, and performing voice enhancement on the voice signal to obtain test data, wherein the physiological characteristics comprise gender and age, and the personalized information comprises tone and tone color;
inputting the test data into a deep neural network, training and carrying out voiceprint recognition on the voice signal to obtain a recognition result, wherein the recognition result comprises user identity and message information;
and successfully matching the identification result with a pre-constructed corpus to obtain a control instruction, and intelligently controlling the door lock according to the control instruction.
The invention provides an intelligent door lock recognition and method based on voice recognition, which comprises the steps of preprocessing voice information sent by an intelligent door lock to obtain the voice signal, extracting physiological characteristics and individual information representing a speaker from the voice signal, enhancing the voice signal according to winning characteristics and individual information to obtain voice signal to carry out voice enhancement to obtain test data, inputting the test data into a deep neural network to train the voice signal to carry out voiceprint recognition to obtain a recognition result, successfully matching the recognition result with a pre-constructed corpus to obtain a control instruction, and carrying out intelligent control on the door lock according to the control instruction, so that the intelligent door lock can rapidly acquire the voice information, and accurately authenticate the voice information to exclude noise and the voice information, thereby realizing flexible control of the intelligent door lock and improving the use experience of users.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
Referring to fig. 1, the invention provides an intelligent door lock recognition system based on voice recognition, which comprises:
the preprocessing module is used for preprocessing the voice information sent by the intelligent door lock to obtain a voice signal;
the feature extraction module is used for extracting physiological features and personalized information representing a speaker from the voice signals, enhancing the voice signals according to the victory features and the personalized information to obtain voice signals, and performing voice enhancement on the voice signals to obtain test data, wherein the physiological features comprise gender and age, and the personalized information comprises tone and tone color;
the model training module is used for inputting the test data into the deep neural network to train to carry out voiceprint recognition on the voice signal to obtain a recognition result, wherein the recognition result comprises user identity and message information;
and the identification judging module is used for successfully matching the identification result with a pre-constructed corpus to obtain a control instruction, and intelligently controlling the door lock according to the control instruction.
In this embodiment, the identification and judgment module includes an intelligent door lock unit, a middleware unit and an application unit; the intelligent door lock unit is used for recording the door lock state, receiving a control instruction issued by the middleware unit and uploading door lock information, the middleware unit is used for transmitting the door lock state information uploaded by the door lock, receiving the operation of the intelligent door lock unit and various index data of the intelligent door lock unit and transmitting the operation and various index data to the corpus, and the corpus analyzes the user behaviors and monitors the real-time condition of the middleware unit according to the index data; the application unit is used for displaying door lock information, operating the door lock state and upgrading the switch firmware to realize man-machine interaction. The voice recognition is to identify according to the frequency spectrum of the human voice mode, even if the human voice mode is the same, the voice print frequency spectrum of the human voice mode can change the emotion or disease of the human voice mode, the noisy environment can cause non-negligible influence, and the voice recognition is easy to be maliciously amplitude. Safety is a primary standard for measuring the performance of the intelligent door lock, and is also a necessary condition for realizing popularization of the intelligent door lock. The voice signal is an important carrier for human propagation information and emotion communication, is the perception of mechanical vibration of the acoustic media by the auditory organs, and is the most important, effective, common and convenient communication mode for human. People inevitably experience interference from the surrounding environment, noise introduced by transmission errors, electrical noise within the communication device and other speakers during voice communication, which will eventually result in the received voice signal not being a clean original voice signal, but being a noisy voice signal contaminated with noise. By preprocessing the collected voice information, noise interference can be primarily eliminated, and favorable conditions are provided for follow-up accurate judgment.
It should be noted that, in order to obtain the purest possible speech signal from the noisy speech signal, to reduce the noise interference, speech enhancement is required, where the purpose of speech enhancement is to extract the purest possible original speech from the noisy speech, and perform speech enhancement by using spectral subtraction, which has the characteristics of small operand and easy real-time implementation, and if the noise is stable or slowly-varying additive noise, and the speech signal and the noise are mutually independent, the noise mahonia spectrum is subtracted from the mahonia spectrum of the noisy speech, so as to obtain purer speech frequency. Since the human ear is insensitive to the phase of the speech, what is important to the intelligibility and quality of the speech is the short-time spectral amplitude of the speech signal instead of the phase, which perceptual characteristic of the human ear is exploited in the spectral subtraction, replacing the estimated phase of the speech with the phase of the original noisy speech.
It should be appreciated that speech information is often interspersed with a wide variety of background noise and speech fragments introduced between words of speech, effectively detecting the start and stop points of speech, greatly reducing the amount of data to be subsequently processed, and improving recognition speed and recognition rate. In pure speech, the start-stop point can be detected using the short-time average energy. The energy of the voice signal is continuously changed along with time, the energy of the initial consonant is high, the energy of the final sound is low, if the voice signal is clean, the starting point of the voice can be detected only by using short-time energy, and the boundary between the initial consonant and the final sound and the boundary between the non-speech section and the speech section are distinguished, so that the accuracy of voice recognition is effectively improved.
Optionally, the executing process of the model training module includes:
comparing the test template corresponding to the test data with the training template, identifying by comparing similarity measures between the test template and the training template, and calculating the pronunciation speed of the voice data in the test data by combining a dynamic time warping algorithm;
preset function f [ (m)
i ,n
i )]Corresponds to a grid (m
i ,n
i ) Then there is a path cost function d [ (m)
i ,n
i )]And f [ (m)
i ,n
i )]=(m
i-1 ,n
i-1 ) Initializing to n
i =i(i=1,2...,N),n
1 =m
1 =1,f[(m,n)]=(0,0),
Wherein R is a parallelogram constraint;
calculating f [ (m) by adopting a recursion method
i ,n
i )]And d [ (m)
i ,n
i )]The expression is obtained as
Wherein m=m
i ,m
i -1,m
i -2, the always true version of the speech signal d [ (M, N)]When i=n, the point (M, N) is traced back forward, thereby obtaining the optimum path (M
i-1 ,n
i-1 )=f[(m
i ,n
i )](i=n, N-1,..3, 2), when (m
i-1 ,n
i-1 ) Ending at = (0, 0).
In this embodiment, the overall feature vector is represented by using the feature data amount of vector quantization, a plurality of sampling signals are classified, each class is a vector, the vector is quantized, and the input feature vector x= { X is preset
i I=1, 2., T }, n represents the number of iterations, L
i (n) represents the ith subcell of the nth iteration, Y
i (n) code words representing the sub-cells, wherein the code words have J total, the maximum iteration number is M', the iteration threshold is set as epsilon, and the specific process comprises the following steps: selecting the centroids of all input feature vectors as an initialization codebook { Y }
i (0) 1.ltoreq.i.ltoreq.J }, the expressions for dividing the cell into two by adopting the smaller threshold are Y respectively
i(1) =Y
i(0) -ε、Y
i(2) =Y
i(0) +ε,Y
i(1) And Y
i(2) Respectively representing a cell code word before splitting and a cell code word after splitting into two, wherein the number of cells is doubled, and n=0; when n=n+1, calculating characteristic vector and current codeword of each frame, if satisfied, x is C
i (n), d represents the Euclidean distance between vectors, d (x, Y)
i (n-1))≤d(x,Y
j (n-1)), i.noteq.j, 1.ltoreq.i, j.ltoreq.M', using the expression
When n=l or->
At the end of the iteration, returning to dividing the cell into two by adopting a smaller threshold value, otherwise, saving and outputting the best codebook { Y
i I=1, 2..m' }, if the distortion change rate is greater than the threshold, the clustering process that occurs when n=n+1 will be skipped. The vector quantization is a process of using a small number of representative data volume areas to represent the overall feature vector, classifying a plurality of signals, using one vector as each class, and quantizing the vector, thus greatly compressing the data volume and improving the speech recognition efficiency.
Optionally, the executing process of the feature extraction module includes:
performing voice enhancement on the voice signal by adopting spectral subtraction, and subtracting the noise power spectrum from the power spectrum of the voice signal to obtain a pure voice spectrum;
let S (t) denote a clean speech signal, N '(t) denote an additive noise signal, Y (t) denote a noisy speech signal, then Y (t) =s (t) +n' (t), and Y (ω) =s (ω) +n '(ω) are used to represent the fourier transforms of Y (t), S (t) and N' (t), respectively;
if independent of ginkgo and additive noise, Y (omega)2 =|S(ω)|2 +|N′(ω)|2 If P is usedy (ω)、Ps (omega) and Pn (ω) represents the power spectrum of y (t), s (t) and n' (t), respectively, then Py (ω)=Ps (ω)+Pn′ (omega) estimation of noise power spectrum P by speech-free but noisy before utterancen′ (ω), then Ps (ω)=Py (ω)-Pn′ (omega) the subtracted power spectrum Ps (ω) determining a clean speech power spectrum from which the denoised speech time domain signal is recovered to obtain test data.
In this embodiment, recovering the denoised speech time domain signal from the power spectrum to obtain test data includes: from noisy speech x
0 [n
0 ]Power spectrum |x (e
jw )|
2 In which the power spectrum of clean speech is estimated
Using the expression as
Wherein x is
0 [n
0 ]Representing a discrete sequence of input signals, if in a speech signal s [ n ]
0 ]Adding additive noise n
0 [n
0 ]According to the fact that the noise is irrelevant to the signal and is non-stable, the change rate of the noise is smaller than that of the signal, x is obtained
0 [n
0 ]=s[n
0 ]+n
0 [n
0 ]And fourier transforming to obtain the expression +.>
Wherein->
Is to the |N in no voice
0 (e
jw )|
2 Is a statistical estimate of (e) |X (e
jw )|
2 Represents the power spectrum of noisy speech, |S (e
jw )|
2 Representing the power spectrum of a speech signal, wherein, in the non-speech section, an estimate of the noise power spectrum is +.>
And updating. The feature extraction is to extract feature acquisition parameters representing physiological characteristics and psychological characteristics of people in the voice signals, so that voice information can be rapidly and effectively analyzed, and noise interference is eliminated.
Optionally, estimating according to the input noise obtained from the noisy speech, removing the noise from the noisy speech by using a spectral subtraction algorithm to obtain an estimated value of the speech signal, re-estimating the transfer function of the wiener filter by using the output signal, and updating the background noise in the speech segment and the non-speech segment, including:
calculating an initial smoothed estimate of the background noise magnitude spectrum
The N_no frame before noise is preset to be a pure noise signal, and statistical average of the amplitude is adopted to estimate +.>
The recursive expression is
Where n=1..n_no, -/-for example>
N th statistical evaluation value representing background noise, its initial value +.>
The expression for obtaining the initial smooth estimated value of background noise amplitude spectrum is expressed as the power spectrum of the nth frame of noisy speech signal
Wherein |X
N_no (e
jw ) The I represents the amplitude spectrum of the noise voice signal of the N_no frame;
let the frame variable n=n_no+1, using the expression as
Wherein->
Pinghu estimation representing noise power spectrum, +.>
An estimate representing a signal power spectrum;
filtering the amplitude spectrum of the voice signal with noise to obtain an estimated value of the background noise amplitude spectrum of the current frame
Calculating the amplitude spectrum estimated value of the signal obtained by subtracting the spectrums
Adopt the current k frame noise amplitude spectrum estimation value +.>
Smooth estimation of background noise +.>
The expression for updating is +.>
Wherein the scale factor is represented;
if the rate of change of the speech signal and the rate of change of the noise signal can be separated, a reasonable setting of p is made,
and->
Slow change of->
The change is quick;
computing a smoothed estimate of a signal magnitude spectrum
When the frame variable n=n+1, if n>The total frame number N is ended to obtain the estimated value +.>
And repeating the steps to continue to execute as output.
In this embodiment, the spectral subtraction directly subtracts the noise spectrum from the noisy speech signal, and uses the clean speech after the phase reconstruction of the noisy speech to reconstruct the enhanced noisy speech signal, and under the condition that the additive noise and the short-time stationary speech signal are set to be mutually independent, subtracts the noise power spectrum from the mahonia spectrum of the noisy speech, so as to obtain a purer speech spectrum, and the speech activation detection is combined to judge the speech-free segment, so as to estimate the characteristic of the background noise, and the noise cannot be updated in the speech segment, thus affecting the accuracy of the background noise estimation, and the accuracy of the speech activation detection also has an influence on the estimation of the background noise. And estimating input noise in the noisy speech, removing the noise in the noisy speech by adopting spectrum subtraction to obtain an estimated value of a speech signal, and re-estimating a transfer function of the wiener filter by adopting an output signal to form a reverse structure. Background noise can be updated in the voice section and the non-voice section, and the estimation of the background noise is more accurate.
Optionally, the execution of the preprocessing module includes:
the sampling data of the preset voice signal is { Q ]
K -k=1, 2..n), n representing the total number of samples, let Δt=1, using the expression
Wherein K is E [1, n]Calculating each undetermined coefficient alpha
j (j=0, 1,2 m), let function->
And discrete sampled data Q
K Error quadratic sum E is minimum, then +.>
Satisfying the E extremum condition is expressed as
Sequentially selecting E to alpha
i Solving and compiling to generate m+1 element linear equation system
Calculating m+1 undetermined coefficients alpha
j (j=0, 1. M., m represents the order of the set polynomial, the value ranges of i and j represent 0.ltoreq.i, j.ltoreq.m, and when m=0, the trend term represents a constant,
when m=0, m represents a constant trend term, namely an arithmetic average value of signal sampling data, and when m=1, a linear trend term is represented to be +.>
When m is equal to or greater than 2, the trend term represents a curve trend term.
In this embodiment, successfully matching the recognition result with the pre-constructed corpus to obtain the control instruction includes: the R-dimensional European space R is obtained by adopting a vectorization algorithm
r Medium gain vector
According to a preset standardThen a finite number of vectors are found in the r dimension
To indicate +.>
Representing the input vector +.>
Representing the quantization vector or codeword,
representing a codebook, the number of codewords M being referred to as codebook capacity, the criterion for vector quantization in training data being to minimize the quantized distortion at a given codebook capacity k; when the speaker voice to be recognized is recognized, a group of characteristic vectors are extracted according to the training time, and the +.>
N codebook pairs using N speakers, respectively +.>
Is quantized to find a codebook in the feature space that is close to the set of feature vectors>
The corresponding speaker i serves as a recognition result to obtain a recognition result. />
It should be noted that, in the voice Xin Ha, the analysis processing of the voice Xin Ha is usually performed rarely from the appetite signal, the voice signal can be decomposed into a plurality of signals with separate frequencies according to the frequency analysis, the voice features in the voice signal are analyzed, and the operations of feature extraction, removal, substitution, etc. are performed, which are the key steps of the anonymization processing of the voice signal. The automatic voiceprint recognition comprises the steps of determining the identity of a speaker through probability matching in a subspace and extracting the characteristics of a deep neural network to perform identity matching, wherein each speaker can be described through modeling through a hidden Markov model and a Gaussian mixture model, or can be self-adaptive to a previous model by means of a general background model, voiceprint difference and channel difference can be respectively modeled, step length can be performed on the channel difference better, and system performance is enhanced.
Referring to fig. 2, the invention also provides an intelligent door lock recognition method based on voice recognition, which comprises the following steps:
s1: the voice information sent by the intelligent door lock is obtained, and the voice information is preprocessed to obtain a voice signal;
s2: extracting physiological characteristics and personalized information representing a speaker from the voice signal, and enhancing the voice signal according to the winning characteristics and the personalized information to obtain the voice signal, and performing voice enhancement on the voice signal to obtain test data, wherein the physiological characteristics comprise gender and age, and the personalized information comprises tone and tone color;
s3: inputting the test data into a deep neural network, training and carrying out voiceprint recognition on the voice signal to obtain a recognition result, wherein the recognition result comprises user identity and message information;
and S4, successfully matching the identification result with a pre-constructed corpus to obtain a control instruction, and intelligently controlling the door lock according to the control instruction.
In this embodiment, the voice information is preprocessed to obtain the voice signal by sending the voice information to the obtained intelligent door lock, the physiological feature and the personalized information representing the speaker are extracted from the voice signal, the voice signal is enhanced according to the victory feature and the personalized information to obtain the voice signal, the voice enhancement is performed to obtain the test data, the test data is input into the deep neural network to train the voice signal to perform voiceprint recognition to obtain the recognition result, the recognition result is successfully matched with the pre-built corpus to obtain the control instruction, the intelligent control is performed on the door lock according to the control instruction, the intelligent door lock can rapidly acquire the voice information, the voice information is accurately authenticated, and the noise and the voice information are eliminated, so that the flexible control of the intelligent door lock is realized, and the user experience is improved.
Any particular values in all examples shown and described herein are to be construed as merely illustrative and not a limitation, and thus other examples of exemplary embodiments may have different values.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the present invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention.