Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first", "second" and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any indication of the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise.
Voiceprints are a very important speech feature, which can be used to distinguish and identify different users, and have characteristics of specificity and relative stability. After the adult, the voice of the human can be kept relatively stable and unchanged for a long time. Voiceprint recognition belongs to one of biological recognition technologies, and is a technology for automatically recognizing the identity of a speaker according to voice parameters reflecting the physiological and behavioral characteristics of the speaker in a voice waveform. The voiceprint recognition technology in the application can be applied to awakening of the intelligent robot.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a voiceprint recognition method according to the present application, where the voiceprint recognition method specifically includes the following steps:
s1, establishing a voiceprint library containing the voiceprint characteristics.
In this embodiment, by presetting the voiceprint library including the voiceprint features, the accuracy of the robot in the subsequent voice recognition process can be improved. And the voiceprint library has a higher signal-to-noise ratio with respect to the recording including noise with the ambient speech data (noise) removed, the step S1 further includes the following sub-steps with reference to fig. 2:
and S11, collecting voice data, wherein the voice data comprises environment voice data and human voice data.
In this embodiment, it is first necessary to collect voice data including human voice data, and generally, it is required that the collection environment is relatively quiet, so that the noise influence in the collected recording data can be reduced. Alternatively, in this embodiment, the collected voice data may be regarded as including only human voice data and ambient voice data (noise). The voice data can be only one, that is, the robot only identifies the voice print of a certain person. Of course, in other embodiments, the voice data may be multiple, that is, the robot may recognize voice prints of multiple persons.
And S12, filtering and denoising the voice data to obtain the human voice data.
In practical situations, when voice data is collected, the environment may be flooded with various environmental noises, such as noise of common refrigerator operation, noise of vehicle running, noise of air conditioner operation, and some human voice noise (i.e. sound that the robot does not recognize its voice), and the like, and these environmental noises are often collected when voice data is collected. In order to reduce the influence of reducing the environmental noise on the subsequent voiceprint extraction and speech extraction recognition, in this embodiment, filtering and denoising processing is performed on the collected speech data, so as to filter out the environmental speech data, i.e., the noise part, in the speech data. Optionally, the filtering and noise reduction may be implemented by filtering and noise reduction hardware, or may be implemented by a filtering and noise reduction software algorithm, which may be selected by a technician according to actual conditions, and is not further limited herein.
Optionally, in a specific filtering and denoising process, if the robot identifies only one voiceprint, other voice data in the collected voice data may be correspondingly filtered, and correspondingly, if the robot can identify a plurality of voice data, a plurality of voice data obtained after the filtering and denoising process are also provided.
Optionally, after filtering and denoising the collected voice data to obtain human voice data with relatively less environmental noise, step S13 may be further performed.
And S13, extracting the voiceprint features in the human voice data to obtain a voiceprint library.
Optionally, after obtaining the voice data with relatively less environmental noise, further performing feature extraction on the processed voice data is needed. The task of feature extraction is to extract and select acoustic or language features with characteristics of strong separability, high stability and the like for voiceprints of speakers. Optionally, in this embodiment, the following manner may be used:
1. using the spectral envelope parameters, the speech information is output through a filter bank, which outputs ozone at a suitable rate to the filter, and takes them as voiceprint characteristics.
2. Parameters extracted based on physiological structures of vocal organs (such as glottis, vocal tract, and nasal cavity) are taken as vocal print features, and for example, a pitch contour, a formant frequency bandwidth, and a locus thereof contained in speech information are extracted as vocal print features.
3. Linear Predictive Coefficient (LPC) including various parameters derived from Linear prediction, such as parameters of Linear prediction Coefficient, autocorrelation Coefficient, emission Coefficient, logarithmic area, Linear prediction residual, and combinations thereof, is used as the voiceprint recognition feature.
4. Reflecting the parameters of the hearing characteristics, and simulating the characteristics of human ears on sound frequency perception, wherein the parameters comprise Mel cepstrum coefficients MFCC, perception linear prediction coefficients PLP, depth characteristics Deep Feature, energy regularization spectrum coefficients PNCC and the like.
5. Lexical features, speaker dependent word n-grams and phoneme n-grams
6. Prosodic features, pitch and energy "poses" described by n-grams "
7. Language, dialect and accent information, channel information, etc.
Of course, the performance of the actual system is improved for the combination of different feature yields, and when the correlation between the parameters of each combination is not large, the effect is better because it reflects the different features of the speech signal respectively. The method may use any one or combination of the above methods to extract the voiceprint features in the human voice data, and is not limited specifically here. Optionally, after extracting the voiceprint features of the voice data, a voiceprint model library is constructed and stored in a processing center or a control center of the robot. The voiceprint library may include only one voiceprint feature of one person, or may include voiceprint features of multiple persons.
Optionally, the establishment of the voiceprint library may be repeated multiple times, that is, when voice data including the same person or multiple persons are collected, different collection environments may be changed, or voice data in different states (such as cold and emotion change) of a speaker may be collected, so as to improve the accuracy of subsequent voiceprint recognition of a subsequent robot.
S2, acquiring a first voiceprint feature in the current voice data.
In the subsequent speech recognition process of the robot, the matching of the voiceprint features is performed on the current speech data to be acquired first, that is, referring to fig. 3, step S2 further includes the following sub-steps:
and S21, processing the current voice data to obtain first person voice data.
Similarly, after the current voice data is obtained, the current voice data needs to be further subjected to filtering and noise reduction processing to obtain first person voice data. The first person voice data is only used for descriptive purposes, that is, the first person voice data may include voice data of a plurality of speakers or may be voice data of only one speaker.
Optionally, in this embodiment, details of the filtering and denoising processing method for the current speech data may be described in the foregoing embodiment, and are not described herein again.
And S22, extracting the voiceprint features in the first person voice data to obtain first voiceprint features.
Optionally, after obtaining the first person voice data with relatively less environmental noise, further performing feature extraction on the processed first person voice data to obtain the first voiceprint feature. Similarly, the first voiceprint feature is also used for descriptive purposes only, that is, the first voiceprint feature may include voiceprint features of a plurality of speakers or may be a voiceprint feature of only one speaker. For a detailed process of extracting the voiceprint feature, reference may also be made to the detailed description of the foregoing embodiment, which is not described herein again.
And S3, judging whether the first voiceprint feature is matched with the voiceprint features in the voiceprint library.
After the first voiceprint feature of the voice data is acquired, the first voiceprint feature needs to be matched with the voiceprint feature in the voiceprint library, wherein the matching method comprises a plurality of methods, such as a probability statistical method, a dynamic time warping method, a vector quantization method, a hidden markov model method, an artificial neural network method and the like. The present application may adopt any one of the above-mentioned methods to match the first voiceprint feature with the voiceprint features in the voiceprint library, so as to determine whether the first voiceprint feature exists in the voiceprint library, if so, execute step S4, and if not, enter step S2. Of course, in other embodiments, the first voiceprint feature and the voiceprint features in the voiceprint library can be matched in other manners, which is not further limited herein.
And S4, if the voice data are matched with the voice data, extracting the first person voice data of the current voice data for voice recognition.
In step S4, when it is determined that the voiceprint feature matching the first voiceprint feature of the first voiceprint data in the current voice data exists in the voiceprint library, the first voiceprint data in the current voice data may be extracted for voice recognition to wake up the robot.
Optionally, if there are a plurality of voice data in the current voice data and the voice print library, that is, the robot may recognize voice data of a plurality of persons, the robot may be awakened by setting the awakening voice data, that is, the acquired voice data may include a keyword or a lingering phrase for awakening the robot, and the robot may be awakened after recognizing the keyword or the lingering phrase.
In a specific application scenario of the application, when the robot recognizes that there are a plurality of pieces of vocal data matching with the voiceprint library in the current voice data, the robot may further process the first vocal data, and extract the first vocal data including the awakening voice data to awaken the robot. For example, if the collected first-person voice data is voice data of three different speakers, which are respectively "welcome", "turn on air conditioner" and "start program", and the first voiceprint features extracted from the three different speakers all exist in the voiceprint library, then if the awakening voice data is "start program", the first-person voice data including the "start program" is automatically extracted, so as to awaken the robot.
In this embodiment, when the robot recognizes that there are a plurality of voice data matching with the voiceprint library in the current voice data, the robot may be awakened by preferentially recognizing and awakening the voice data (keywords or linguistics).
If the first voiceprint feature in the current voice data is judged not to be matched with the voiceprint feature in the voiceprint library, step S2 is executed to continue to obtain the first voiceprint feature in the current voice data.
In the above embodiment, the accuracy and the signal-to-noise ratio of the subsequent voice recognition can be improved by comparing and matching the first voiceprint feature in the obtained current voice data with the voiceprint feature in the preset voiceprint library.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of a voiceprint recognition device according to the present application. As shown in fig. 4, the apparatus includes aprocessor 11 and amemory 12, and theprocessor 11 is connected to thememory 12.
Theprocessor 11 is configured to obtain a first voiceprint feature in current voice data; judging whether the first voiceprint features are matched with the voiceprint features in the voiceprint library or not; and if so, extracting the first person voice data of the current voice data for voice recognition.
Wherein theprocessor 11 is further configured to create a voiceprint library containing voiceprint features.
Theprocessor 11 may also be referred to as a CPU (Central Processing Unit). Theprocessor 11 may be an integrated circuit chip having signal processing capabilities. Theprocessor 11 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The processor in the apparatus may respectively execute the corresponding steps in the method embodiments, and thus details are not repeated here, and please refer to the description of the corresponding steps above.
In the above embodiment, the accuracy and the signal-to-noise ratio of the subsequent voice recognition can be improved by comparing and matching the first voiceprint feature in the obtained current voice data with the voiceprint feature in the preset voiceprint library.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a memory device according to the present application. The storage device of the present application stores aprogram file 21 capable of implementing all the methods described above, wherein theprogram file 21 may be stored in the storage device in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. The aforementioned storage device includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
In summary, it is easily understood by those skilled in the art that a voiceprint recognition method, a voiceprint recognition device and a storage device are provided, and by comparing and matching a first voiceprint feature in the obtained current voice data with a voiceprint feature in a preset voiceprint library, the accuracy and the signal-to-noise ratio of subsequent voice recognition can be improved.
The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application or are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.