Preferably: the model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:

b1: preparing data, namely preparing voice data for about 3 hours;

b2: carrying out primary clustering on the data by using a K-Means algorithm;

b3: expectation maximization, and further optimization of the result in the step B2;

b4: modeling a specific human target;

b5: and (5) voice recognition.

Preferably: in the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.

Preferably: in B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:

lnP (x) therein_iI lambda_spr) Probability of the i-th frame speech on a particular human model, lnP (x)_iI lambda_UBN) Is the probability of the frame I on the UBN model, and N is the language to be recognizedThe number of frames of tones.

Preferably: the model updating module updates a specific person model in the system for the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity and time information, and after a period of time, the system retrains the specific person model and updates the database based on the recently acquired information.

Preferably: the non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)₁，o₂，......，o_T) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:

c1: selecting an initial state S based on the initial state distribution p_i；

C2: setting t to be 1;

c3: according to state S_iSelecting an output symbol according to the probability distribution of the observation symbols;

c4: according to state S_iTransition probability of to the Heart State S_j；

C5: setting T to be T +1, if T is less than T, jumping to C3, otherwise, ending;

c6: and scoring each model according to the observation sequence, wherein the recognition result is a word in the vocabulary library corresponding to the model with the highest score.

Preferably: the voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:

d1: according to the characteristic that human voice auditory perception is insensitive to the slow change of an excitation source, an empirical formula is established: h (z) 0.1 × (2+ z)^-1-z^-3-2z^-4)/z⁴×(1-0.98z^-1) Reducing the proportion of energy outside the human sound ray frequency,

d2: high-order difference, which is to do first-order and second-order difference to the model between the adjacent 13-dimensional features to form the improved model characteristic of 39;

d3: reducing the cepstrum mean value, correspondingly modulating the original signal, and denoising;

d4: and (4) Gaussian regularization is carried out on the speech features.

Preferably: the robot is characterized by further comprising an autonomous function subsystem, wherein the autonomous function subsystem comprises a request help module, a danger alarm module and a notification module, the request help module is used for sending corresponding request help after being judged by an upper computer when the robot is stuck, stumbled and lost in the process of traveling, the danger alarm module is used for sending an alarm when detecting abnormal temperature and abnormal air component by combining various sensors of the robot, the notification module is used for reminding various schedules, time and events set by a user, the robot further comprises an entertainment function subsystem, the entertainment function subsystem comprises a playing module and a wireless communication module, when the system detects instructions such as 'playing music' and 'speaking joke', the upper computer controls the wireless communication module to be connected with the internet to perform corresponding search on the internet, after searching and downloading, the characters, audio and video are played through the playing module.

The invention has the beneficial effects that:

1. the invention adopts different subsystem control for specific crowd and non-specific crowd, adopts a training mode for voice recognition of specific people, and adopts a vocabulary library comparison mode for specific crowd, thereby greatly improving the accuracy of voice recognition.

2. The invention determines the starting point and the key point of the voice by detecting the end point of the voice packet and then extracts the characteristic point, thereby greatly reducing the calculated amount of an upper computer and improving the corresponding speed of the whole system.

3. The invention removes other noises in the whole voice data packet by performing post data processing on the data after feature extraction, and also updates the model base recognized by a specific person at regular time, thereby further improving the accuracy of voice recognition.

Drawings

Fig. 1 is a schematic diagram of a feature extraction process in an interactive multifunctional geriatric care robot recognition system according to the present invention;

Detailed Description

The technical solution of the present patent will be described in further detail with reference to the following embodiments.

In the description of this patent, it is noted that unless otherwise specifically stated or limited, the terms "mounted," "connected," and "disposed" are to be construed broadly and can include, for example, fixedly connected, disposed, detachably connected, disposed, or integrally connected and disposed. The specific meaning of the above terms in this patent may be understood by those of ordinary skill in the art as appropriate.

Example 1:

The endpoint detection module is characterized by endpoint detection and can determine the starting point and the end point of a voice in a section of voice signal containing the voice and distinguish the voice signal from a non-voice signal.

The feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and comprises the following specific steps:

wherein n is 0, 1, … …, T-1, and T is 512.

a6: mel logarithmic energy: for each mel video subband energy, go pair: mel (i) ═ ln (filt (i)), i ═ 1, 2,..., 40;

a7: discrete cosine transform:

wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.

The model training and recognition module adopts a Gaussian mixture-general background model, the model training and recognition module fits a human sounding system through unsupervised learning, different specific human models are constructed according to the sound characteristics of each person, the model training and recognition module performs adaptive training on a small amount of user voice data to be recognized on a pre-trained background model, and the specific process is as follows:

b1: preparing data, namely preparing voice data for about 3 hours;

b2: carrying out primary clustering on the data by using a K-Means algorithm;

b4: modeling a specific human target;

b5: and (5) voice recognition.

In the speech data in B1, the number of specific persons is 80-100 persons, the two groups of data are divided into two groups according to gender, feature extraction is respectively carried out on the two groups of data, in B2, the number of classes is 1024, and 1024 classes including class centers and variances, which are components of a Gaussian mixture model, can be obtained through 5 iterations.

In B5, when the user gives a command to the robot or performs voice communication, the robot calculates scores of the collected voice data on each model in the specific person model library, and after performing necessary normalization processing on the scores, the specific person to which the model with the highest score belongs is the specific person result recognized by the system, and the score is calculated as follows:

lnP (x) therein_iI lambda_spr) Probability of the i-th frame speech on a particular human model, lnP (x)_iI lambda_UBN) Is the probability of the I frame on the UBN model, and N is the number of frames of speech to be recognized.

The model updating module updates the specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in the database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the latest acquired information

The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence:o＝(o₁，o₂，......，o_T) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:

c1: selecting an initial state S based on the initial state distribution p_i；

C2: setting t to be 1;

c4: according to state S_iTransition probability of to the Heart State S_j；

Example 2:

a2: pre-emphasis to counteract the mouth's squelching of high frequency sounds, for each frame, the following are processed: y (0) ═ 0.03 × s (1), y (n) ═ s (n) () -0.97 × s (n-1), n ═ 1, 2.. and.. 51, where s (n) denotes the (n + 1) th sample point in a frame;

wherein n is 0, 1, … …, T-1, and T is 512.

a7: discrete cosine transform:

wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.

b1: preparing data, namely preparing voice data for about 3 hours;

b2: carrying out primary clustering on the data by using a K-Means algorithm;

b4: modeling a specific human target;

b5: and (5) voice recognition.

The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)₁，o₂，......，o_T) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:

c1: selecting an initial state S based on the initial state distribution p_i；

C2: setting t to be 1;

c4: according to state S_iTransition probability of to the Heart State S_j；

The voice collecting and processing module also comprises a post-processing module which can effectively control the influence of environmental noise on voice recognition, and the method comprises the following steps:

d4: and (4) Gaussian regularization is carried out on the speech features.

Example 3:

wherein n is 0, 1, … …, T-1, and T is 512.

a7: discrete cosine transform:

wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.

b1: preparing data, namely preparing voice data for about 3 hours;

b2: carrying out primary clustering on the data by using a K-Means algorithm;

b4: modeling a specific human target;

b5: and (5) voice recognition.

c1: selecting an initial state S based on the initial state distribution p_i；

C2: setting t to be 1;

c4: according to state S_iTransition probability of to the Heart State S_j；

d4: and (4) Gaussian regularization is carried out on the speech features.

It still includes autonomic function branch system, autonomic function branch system is including request help module, danger alarm module and notice module, request help module, when the robot is at the in-process of going, when meetting the condition such as block, stumble, lost, through host computer judgement back, the robot just sends corresponding request help, danger alarm module combines the various sensors of robot, and when detecting conditions such as abnormal temperature, the unusual composition of air, it sends out the police dispatch newspaper, notice module is used for the warning to various schedules, time, the incident that the user set up.

Example 4:

wherein n is 0, 1, … …, T-1, and T is 512.

a7: discrete cosine transform:

wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.

b1: preparing data, namely preparing voice data for about 3 hours;

b2: carrying out primary clustering on the data by using a K-Means algorithm;

b4: modeling a specific human target;

b5: and (5) voice recognition.

The model updating module updates a specific person model in the recognition system, when the system confirms the identity of a speaker once with a high success rate, the voice data of the system is stored in a database together with the identity, time and other information, and after a period of time, the system retrains the specific person model and updates the database based on the recently acquired information.

The non-specific person recognition subsystem completes non-specific person voice recognition by utilizing a hidden Markov model, when the system is actually used, firstly, a vocabulary library is established, parameters are required to be given, a state data N, an observation symbol number M set, three probability distribution state transition probability matrixes A, an observation symbol probability matrix B and an initial state probability vector p are given, a specific model l ═ (p, A, B) is established once, and for the given N, M, A, B and p, the model can generate an observation sequence: o ═ o (o)₁，o₂，……，o_T) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:

c1: selecting an initial state S based on the initial state distribution p_i；

C2: setting t to be 1;

c4: according to state S_iTransition probability of to the Heart State S_j；

d4: and (4) Gaussian regularization is carried out on the speech features.

The system also comprises an entertainment function subsystem, wherein the entertainment function subsystem comprises a playing module and a wireless communication module, when the system detects instructions such as 'playing music', 'speaking joke' and the like, the upper computer controls the wireless communication module to be connected with the Internet, corresponding search is carried out on the Internet, and after searching and downloading, characters, audio, videos and the like are played through the playing module.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. The utility model provides a multi-functional old nursing robot identification system of interactive, its characterized in that, includes that specific person discerns branch system, nonspecific person discerns branch system and speech acquisition processing module, specific person discerns branch system and nonspecific person discerns branch system sharing same speech acquisition primary processing module, speech acquisition processing module includes that endpoint detection module and characteristic draw the module, specific person discerns branch system and includes model training and identification module and model update module.

2. The system of claim 1, wherein the endpoint detection module is characterized by endpoint detection, and can determine a start point and an end point of a voice in a segment of voice-containing sound signal, and distinguish between a voice signal and a non-voice signal.

3. The system of claim 1, wherein the feature extraction module comprises three processes of front-end processing, feature extraction and feature normalization, and the specific steps are as follows:

wherein n is 0, 1, … …, T-1, and T is 512.

a7: discrete cosine transform:

wherein, i is 0, 1.. times, M-1; m ═ 40; d ═ 13.

4. The system of claim 1, wherein the model training and recognition module adopts a Gaussian mixture-general background model, which is adapted to a human voice system through unsupervised learning, constructs different human-specific models according to the voice characteristics of each person, and adaptively trains a small amount of user voice data to be recognized on the pre-trained background model, and the specific process flow is as follows:

b1: preparing data, namely preparing voice data for about 3 hours;

b2: carrying out primary clustering on the data by using a K-Means algorithm;

b4: modeling a specific human target;

b5: and (5) voice recognition.

5. The interactive multifunctional elderly nursing robot recognition system of claim 4, wherein said B1 comprises 80-100 specific people, which are divided into two groups according to gender, and feature extraction is performed on the two groups of data, wherein said B2 comprises 1024 classes, and after 5 iterations, 1024 classes including class center and variance are obtained, which are each component of gaussian mixture model.

6. An interactive multifunctional system for recognizing a robot as a senior care robot in claim 5, wherein in B5, when the user gives a command to the robot or makes a voice communication with the robot, the robot calculates scores of the collected voice data on each model in the specific person model library, and after the scores are normalized as required, the specific person with the highest score is the specific person result recognized by the system, and the scores are calculated as follows:

7. The system of claim 1, wherein the model update module updates the speaker-specific model in the system, when the system confirms the speaker identity once with a high success rate, the voice data of the system is stored in the database together with the identity and time information, and after a while, the system retrains the speaker-specific model and updates the database based on the latest collected information.

8. The system of claim 1, wherein the unspecific person recognition subsystem uses hidden markov models to perform unspecific person voice recognition, and when in actual use, a vocabulary library is first established, which requires a given parameter, state data N, M sets of observation symbol numbers, three probability distribution state transition probability matrices a, an observation symbol probability matrix B, an initial state probability vector p, and a specific model l ═ p, a, B is established once, and for a given N, M, a, B, and p, the model generates an observation sequence: o ═ o (o)₁，o₂，......，o_T) T is the data of the observation symbol, i.e. the length of the observation sequence, which is generated as follows:

c1: selecting an initial state S based on the initial state distribution p_i；

C2: setting t to be 1;

c4: according to state S_iTransition probability of to the Heart State S_j；

9. The system of claim 1, wherein the voice collecting and processing module further comprises a post-processing module for effectively controlling the influence of environmental noise on voice recognition, and the method comprises the following steps:

d4: and (4) Gaussian regularization is carried out on the speech features.

10. The system as claimed in claim 1, further comprising an autonomous function subsystem, wherein the autonomous function subsystem comprises a help request module, a danger alarm module and a notification module, the help request module is used for the robot to send out help request when the robot is jammed, stumbled and lost during traveling and the upper computer judges that the robot is in the process of being stuck, stumbled and lost, the danger alarm module is combined with various sensors of the robot to send out alarm when abnormal temperature and abnormal air components are detected, the notification module is used for reminding various schedules, time and events set by the user, the entertainment function subsystem comprises a playing module and a wireless communication module, and when the system detects that 'music playing' is available, When the 'speak joke' instruction is given, the upper computer controls the wireless communication module to be connected with the Internet, corresponding search is carried out on the Internet, and after the search and downloading, characters, audio and video are played through the playing module.