Interactive multifunctional elderly care robot recognition systemTechnical Field
The invention relates to the technical field of robots, in particular to an interactive multifunctional elderly care robot recognition system.
Background
At present, the problem of population aging in China is increasingly severe, and under the background of population aging, the elderly nursing robot is urgently needed in social life, so that the elderly nursing robot not only can well improve the life quality of the old, but also can help solve the problem of labor shortage caused by the aging of children, relieve the burden of nursing the children and the women, and relieve the problem caused by the aging of the population.
In the prior art, the design of an identification system of an elderly care robot has certain defects, in actual use, a specific person (old people) and a non-specific person (family members or visitors) exist in the use of the robot, when the identification system is designed, when the identification system is used for a specific population, how to ensure whether an accurate identification command sender is a set population or not can be ensured, and when the identification system is used for a non-specific population, how to ensure the identification of different voiceprint characteristics and ensure the identification accuracy of the voiceprint characteristics are ensured.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides an interactive multifunctional elderly care robot recognition system.
In order to achieve the purpose, the invention adopts the following technical scheme:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
Preferably: the endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
Preferably: the feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) — 0.97 × s (n-1), and n ═ 1, 2, … …, 51, where s (n) denotes the (n + 1) th sample point in one frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: x
k=|X
k|
2;
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2, … …, 40;
a7: discrete cosine transform:
wherein i is 0, 1 … …, M-1; m ═ 40; d ═ 13.
Preferably: the model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
Preferably: in the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
Preferably: in B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
lnP (x) therein
iI lambda
spr) Probability of the i-th frame speech on a particular human model, lnP (x)
iI lambda
UBN) Is the probability of the frame I on the UBN model, and N is the language to be recognizedThe number of frames of tones.
Preferably: the model updating module updates a specific person model in the system for the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity and time information, and after a period of time, the system retrains the specific person model and updates the database based on the recently acquired information.
Preferably: the non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi;
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj;
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
Preferably: the voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
Preferably: the robot is characterized by further comprising an autonomous function subsystem, wherein the autonomous function subsystem comprises a request help module, a danger alarm module and a notification module, the request help module is used for sending corresponding request help after being judged by an upper computer when the robot is stuck, stumbled and lost in the process of traveling, the danger alarm module is used for sending an alarm when detecting abnormal temperature and abnormal air component by combining various sensors of the robot, the notification module is used for reminding various schedules, time and events set by a user, the robot further comprises an entertainment function subsystem, the entertainment function subsystem comprises a playing module and a wireless communication module, when the system detects instructions such as 'playing music' and 'speaking joke', the upper computer controls the wireless communication module to be connected with the internet to perform corresponding search on the internet, after searching and downloading, the characters, audio and video are played through the playing module.
The invention has the beneficial effects that:
1. the invention adopts different subsystem control for specific crowd and non-specific crowd, adopts a training mode for voice recognition of specific people, and adopts a vocabulary library comparison mode for specific crowd, thereby greatly improving the accuracy of voice recognition.
2. The invention determines the starting point and the key point of the voice by detecting the end point of the voice packet and then extracts the characteristic point, thereby greatly reducing the calculated amount of an upper computer and improving the corresponding speed of the whole system.
3. The invention removes other noises in the whole voice data packet by performing post data processing on the data after feature extraction, and also updates the model base recognized by a specific person at regular time, thereby further improving the accuracy of voice recognition.
Drawings
Fig. 1 is a schematic diagram of a feature extraction process in an interactive multifunctional geriatric care robot recognition system according to the present invention;
Detailed Description
The technical solution of the present patent will be described in further detail with reference to the following embodiments.
In the description of this patent, it is noted that unless otherwise specifically stated or limited, the terms "mounted," "connected," and "disposed" are to be construed broadly and can include, for example, fixedly connected, disposed, detachably connected, disposed, or integrally connected and disposed. The specific meaning of the above terms in this patent may be understood by those of ordinary skill in the art as appropriate.
Example 1:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) — 0.97 × s (n-1), and n ═ 1, 2, … …, 51, where s (n) denotes the (n + 1) th sample point in one frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: x
k=|X
k|
2;
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
lnP (x) therein
iI lambda
spr) Probability of the i-th frame speech on a particular human model, lnP (x)
iI lambda
UBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
The model updating module updates the specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the latest acquired information
The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence:o=(o1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi;
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj;
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
Example 2:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) () -0.97 × s (n-1), n ═ 1, 2.. and.. 51, where s (n) denotes the (n + 1) th sample point in a frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: x
k=|X
k|
2;
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
lnP (x) therein
iI lambda
spr) Probability of the i-th frame speech on a particular human model, lnP (x)
iI lambda
UBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
The model updating module updates the specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the latest acquired information
The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi;
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj;
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
The voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
Example 3:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) () -0.97 × s (n-1), n ═ 1, 2.. and.. 51, where s (n) denotes the (n + 1) th sample point in a frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: x
k=|X
k|
2;
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
lnP (x) therein
iI lambda
spr) Probability of the i-th frame speech on a particular human model, lnP (x)
iI lambda
UBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
The model updating module updates the specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the latest acquired information
The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)1,o2,......,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi;
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj;
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
The voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
It still includes autonomic function branch system, autonomic function branch system is including request help module, danger alarm module and notice module, request help module, when the robot is at the in-process of going, when meetting the condition such as block, stumble, lost, through host computer judgement back, the robot just sends corresponding request help, danger alarm module combines the various sensors of robot, and when detecting conditions such as abnormal temperature, the unusual composition of air, it sends out the police dispatch newspaper, notice module is used for the warning to various schedules, time, the incident that the user set up.
Example 4:
the utility model provides a multi-functional old nursing robot identification system of interactive, includes that specific person discerns branch system, unspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and unspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.
The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.
The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:
a1: framing: the robot microphone collects voice parameters, and performs framing processing on a voice stream, wherein the frame length is 512 sampling points, and the frame is shifted by 128 sampling points;
a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) — 0.97 × s (n-1), and n ═ 1, 2, … …, 51, where s (n) denotes the (n + 1) th sample point in one frame;
a3: windowing: using a hamming window, the discontinuity of the signal on the boundaries caused by framing is attenuated:
wherein n is 0, 1, … …, T-1, and T is 512.
A4: fast Fourier transform: the discrete fourier transform using radix-2 converts the time domain energy to frequency domain energy:
where k is 0, 1, … …, N-1, and N is 512, and the flat-laying operation thereof obtains the energy of the real number domain: x
k=|X
k|
2;
A5: MEL energy: obtaining 40-dimensional MEMEL frequency sub-band energy through 40 MEL filter banks;
a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;
a7: discrete cosine transform:
wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.
The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:
b1: preparing data, namely preparing voice data for about 3 hours;
b2: carrying out primary clustering on the data by using a K-Means algorithm;
b3: expectation maximization, and further optimization of the result in the step B2;
b4: modeling a specific human target;
b5: and (5) voice recognition.
In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.
In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:
lnP (x) therein
iI lambda
spr) Probability of the i-th frame speech on a particular human model, lnP (x)
iI lambda
UBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.
The model updating module updates a specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in a database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the recently acquired information.
The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)1,o2,……,oT) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:
c1: selecting an initial state S based on the initial state distribution pi;
C2: setting t to be 1;
c3: according to state SiSelecting an output symbol according to the probability distribution of the observation symbols;
c4: according to state SiTransition probability of to the Heart State Sj;
C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;
c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.
The voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:
d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)-1-z-3-2z-4)/z4×(1-0.98z-1) Reducing the proportion of energy outside the human sound ray frequency,
d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;
d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;
d4: and (4) Gaussian regularization is carried out on the speech features.
It still includes autonomic function branch system, autonomic function branch system is including request help module, danger alarm module and notice module, request help module, when the robot is at the in-process of going, when meetting the condition such as block, stumble, lost, through host computer judgement back, the robot just sends corresponding request help, danger alarm module combines the various sensors of robot, and when detecting conditions such as abnormal temperature, the unusual composition of air, it sends out the police dispatch newspaper, notice module is used for the warning to various schedules, time, the incident that the user set up.
The system also comprises an entertainment function subsystem, wherein the entertainment function subsystem comprises a playing module and a wireless communication module, when the system detects instructions such as 'playing music', 'speaking joke' and the like, the upper computer controls the wireless communication module to be connected with the Internet, corresponding search is carried out on the Internet, and after searching and downloading, characters, audio, videos and the like are played through the playing module.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.