Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
The embodiment of the application provides a data processing scheme, which can determine a first command word hit by voice data of a target time window in a command word set according to audio features corresponding to the voice data of K voice frames in the target time window, which is equivalent to preliminarily determining the command word contained in the continuously input voice data, and further determine a feature time window based on feature information of the first command word, such as command word length, so that the audio features corresponding to the voice data of a plurality of voice frames in the feature time window respectively determine a second command word hit by the voice data of the feature time window in the command word set, which is equivalent to determining a new feature time window, and performing secondary verification on whether the command word is contained in the continuously input voice data or not. Alternatively, after the second command word is detected, the operation indicated by the second command word may be performed. Therefore, after the command word hit by the voice data is preliminarily determined based on the target time window, a new characteristic time window is determined to carry out secondary verification on whether the voice data contains the command word, so that the accuracy of detecting the command word of the voice data can be improved.
In a possible implementation manner, the embodiment of the present application can be applied to a data processing system, please refer to fig. 1, and fig. 1 is a schematic structural diagram of a data processing system provided in the embodiment of the present application. As shown in FIG. 1, the data processing system may include a voice-initiated object and a data processing device. The voice initiating object may be used to send voice data to the data processing device, and the voice initiating object may be a user or a device that needs to request the data processing device to respond, and is not limited herein. The data processing device may execute the data processing scheme, and may perform corresponding operations based on the received voice data, for example, the data processing device may be an in-vehicle system, a smart speaker, a smart appliance, or the like. That is to say, after the voice data is output by the voice initiating object, the data processing device may receive the voice data, and then the data processing device may detect a command word in the voice data based on the data processing scheme, and then execute an operation corresponding to the detected command word. It is to be understood that, before the data processing device detects the voice data, a command word set may be preset, where the command word set includes a command word or a plurality of command words, each command word may be associated with a corresponding operation, for example, the command word "turn on the air conditioner" is associated with an operation of turning on the air conditioner, and when the data processing device detects the voice data including the command word "turn on the air conditioner", the data processing device may perform the operation of turning on the air conditioner. According to the data processing scheme, after the command word hit by the voice data is preliminarily determined based on the target time window, a new characteristic time window is determined to carry out secondary verification on whether the voice data contains the command word or not, so that the accuracy of the data processing equipment in the data processing system for detecting the command word of the voice data can be improved, and a user can conveniently and accurately perform corresponding operation through the voice indication data processing equipment.
It should be noted that, before collecting the relevant data of the user and in the process of collecting the relevant data of the user, the present application may display a prompt interface, a popup window, or output voice prompt information, where the prompt interface, the popup window, or the voice prompt information is used to prompt the user to currently collect the relevant data, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation sent by the user to the prompt interface or the popup window, otherwise (that is, when the confirmation operation sent by the user to the prompt interface or the popup window is not obtained), the relevant step of obtaining the relevant data of the user is ended, that is, the relevant data of the user is not obtained. In other words, all user data collected in the present application is collected under the approval and authorization of the user, and the collection, use and processing of the relevant user data need to comply with relevant laws and regulations and standards of relevant countries and regions.
In one possible implementation, the embodiments of the present application may be applied in the field of Artificial Intelligence (AI), which is a theory, method, technique, and application system that simulates, extends, and expands human Intelligence, senses the environment, acquires knowledge, and uses the knowledge to obtain optimal results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In a possible implementation manner, the embodiment of the present application can also be applied in the field of voice technology, such as detecting a command word hit in voice data as described above. Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
The technical scheme of the application can be applied to electronic equipment, such as the data processing equipment. The electronic device may be a terminal, a server, or other devices for performing data processing, and the present application is not limited thereto. And (4) optional. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform. The terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, an aircraft, an intelligent speaker, an intelligent appliance, and the like.
It should be understood that the foregoing scenarios are only examples, and do not constitute a limitation on application scenarios of the technical solutions provided in the embodiments of the present application, and the technical solutions of the present application may also be applied to other scenarios. For example, as can be known by those skilled in the art, with the evolution of system architecture and the emergence of new service scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
Based on the above description, the embodiments of the present application provide a data processing method. Referring to fig. 2, fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure. The method may be performed by the electronic device described above. The data processing method may include the following steps.
S201, determining a target time window corresponding to the current voice frame, and acquiring audio features corresponding to voice data of K voice frames in the target time window respectively.
The current speech frame may be any one of the acquired speech frames. It is to be understood that the obtained speech may be real-time speech, and for real-time continuously input speech data, the current speech frame may be a latest speech frame in the continuously input speech data. The obtained speech may also be non-real-time speech, for example, for a whole pre-generated speech data, each speech frame may also be determined as the current speech frame in sequence according to the sequence of each speech frame in the speech data.
A speech frame may include several sampling points, that is, speech data of several consecutive sampling points form speech data of a speech frame. It will be appreciated that the time difference between adjacent sample points is the same. There may be partially repeated sampling points in two adjacent speech frames, or completely different sampling points, which is not limited herein. For example, in a section of 10s speech data input, one sample is determined every 10ms, and 20 consecutive samples are determined as one speech frame, for example, in the section of 10s speech data, the 1 st to 20 th samples are determined as one speech frame, the 21 st to 40 th samples are determined as one speech frame, and so on, a plurality of speech frames are obtained. For another example, in order to avoid excessive audio data variation of two adjacent speech frames, an overlapped sample may be provided between two adjacent speech frames, for example, in the 10s speech data, the 1 st to 20 th samples are determined as a speech frame, the 15 th to 35 th samples are determined as a speech frame, the 30 th to 40 th samples are determined as a speech frame, and so on, a plurality of speech frames are obtained.
The target time window corresponding to the current speech frame may be a time window in which the current speech frame is used as a reference speech frame, that is, the target time window includes the current speech frame. For example, K speech frames may be included in the target time window, where K is a positive integer, that is, K may be the number of all speech frames in the target time window. Optionally, the K speech frames may also be part of speech frames selected from all speech frames in the target time window, that is, K may be less than or equal to the number of all speech frames in the target time window, for example, after the target time window is determined, the energy of each speech frame in the target time window is calculated, and then the speech frames with energy lower than a certain threshold are removed, so as to obtain the K speech frames, thereby filtering out some speech frames with small sound, and reducing the calculation amount in the subsequent processing process. A reference speech frame of a target time window indicates that the time window is divided based on the reference speech frame, for example, the first speech frame, the last speech frame or the speech frame at the center of the time window, which is not limited herein. The description of the first speech frame and the last speech frame is characterized according to the time sequence, wherein the first speech frame represents the speech frame with the earliest input time in the time window, and the last speech frame represents the speech frame with the latest input time in the time window. Then, the target time window corresponding to the current speech frame may be a time window in which the current speech frame is used as the first speech frame, or may be a time window in which the current speech frame is used as the last speech frame, or may be a time window in which the current speech frame is used as a speech frame at the center position, which is not limited herein. K may be preset, or may be determined based on the length of the obtained speech, or may be determined based on the length of the command word in the command word set, such as the maximum length or the average length, and is not limited here.
Optionally, the target time window corresponding to the current speech frame may not include the current speech frame. For example, when the reference speech frame is the first speech frame of a time window, the next speech frame of the current speech frame may be used as the reference speech frame of the target time window, that is, the first speech frame of the target time window is the next speech frame of the current speech frame; for another example, when the reference speech frame is the last speech frame of a time window, the previous speech frame of the current speech frame may be used as the reference speech frame of the target time window, that is, the last speech frame of the target time window is the previous speech frame of the current speech frame, and so on, which is not described herein again.
In the present application, the determination of the subsequent target time window and the characteristic time window is mainly described by taking the case where the current speech frame is taken as the last speech frame (i.e. the reference speech frame) of the corresponding target time window as an example. For example, the continuously input speech data includes n speech frames 1, 2, and 3.. once.n, if the current speech frame is the 200 th speech frame, the reference speech frame is the last speech frame of the time window, and the size of the target time window is 100 speech frames (that is, the target time window corresponding to the current speech frame includes 100 speech frames, that is, K is 100), the time window with the size of 100 and the 200 th speech frame as the last speech frame may be determined as the target time window corresponding to the 200 th speech frame, that is, 100 speech frames (the 100 th and 200 th speech frames) before the 200 th speech frame are determined as the speech frames in the target time window corresponding to the 200 th speech frame.
As another example, the target time window is described by taking an illustration as an example, please refer to fig. 3, and fig. 3 is a schematic diagram of an effect of the target time window provided in the embodiment of the present application. As shown in (1) in fig. 3, in the received voice data, each voice frame may be represented as one of the square blocks, and if the gray square block shown as 301 in fig. 3 is determined as the current voice frame and the size of the preset target time window is 8 voice frames, 8 voice frames before 301 (including the voice frame indicated by 301) may be determined as the target time window corresponding to 301 (as shown by 302 in fig. 3); with continuous input of voice data, if a missed command word is detected based on the time window shown by 302, a new current voice frame may be determined based on a sliding window, for example, when the sliding window is 1, a next voice frame of the voice frame shown by 301 may be determined as a new current voice frame (as shown by 303 in (2) of fig. 3), so that 8 voice frames before 303 (including the voice frame indicated by 303) may be determined as a target time window (as shown by 304 in fig. 3) corresponding to 303, and so on, thereby realizing detection of the command word in the continuously input voice data.
The audio features corresponding to the speech data of the K speech frames in the target time window are obtained, and the corresponding audio features can be determined for the speech data based on each speech frame. In one possible implementation, the audio feature may be an FBank feature (an audio feature of speech data). Specifically, if the voice data of one voice frame is a time domain signal, the FBank feature corresponding to the one voice frame is obtained, the time domain signal of the voice data of the one voice frame may be converted into a frequency domain signal through fourier transform, and then the corresponding FBank feature is determined based on the frequency domain signal obtained through calculation, which is not described herein again. It will be appreciated that the audio features may also be features determined based on other means, such as MFCC features (an audio feature of speech data), and are not limited herein.
S202, determining a first command word hit by the voice data of the target time window in the command word set according to the audio characteristics corresponding to the voice data of the K voice frames respectively.
The first command word refers to a command word hit by the voice data in the target time window, and is also called a command word hit by the target time window, and the first command word belongs to the command word set. It can be understood that the premise of determining the first command word hit by the voice data of the target time window in the command word set is that the voice data of the target time window has the hit command word in the command word set, and if the voice data of the target time window does not have the hit command word in the command word set, the first command word hit by the voice data of the target time window in the command word set cannot be determined. The voice data of the target time window is short for the voice data of the K voice frames in the target time window, for example, a command word hit by the voice data of the target time window in the command word set may refer to a command word hit by the voice data of the K voice frames of the target time window in the command word set; the command word hit by the voice data of the target time window in the command word set can also be briefly described as the command word hit by the target time window in the command word set.
As described above, the command word set includes at least one command word, and any command word in the command word set may have a plurality of syllables. Syllable is the most natural phonetic unit sensed by hearing, and is formed by combining one or several phonemes according to a certain rule. In mandarin, except for individual cases, a chinese character is a syllable, and for example, the command word "turn on the air conditioner" includes 4 syllables.
In a possible implementation manner, when a first command word hit by the voice data of the target time window in the command word set is determined, a first confidence coefficient corresponding to the voice data of the target time window and each command word may be determined according to audio features corresponding to the voice data of the K voice frames, and then the hit first command word is determined based on the first confidence coefficient corresponding to each command word. Here, each command word herein refers to each command word in the above-mentioned command word set. The first confidence level may characterize a likelihood that the speech data of the target time window is each command word, and each command word may have a corresponding first confidence level.
Specifically, determining the hit first command word based on the first confidence corresponding to each command word may be: if command words with the first confidence degree larger than or equal to the first threshold exist in the command word set, determining the command words with the first confidence degree larger than or equal to the first threshold as the first command words hit by the voice data of the target time window in the command word set. The first threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable first threshold may be set to determine the first command word. Optionally, for better performance, different first thresholds may be set for command words of different lengths, thereby balancing the detection rate and the false detection rate for command words of different command lengths. It is to be understood that, if there are a plurality of first confidence levels that are greater than or equal to the first threshold, the command word corresponding to each first confidence level that is greater than or equal to the first threshold may be determined as the first command word, that is, the number of the first command words may be multiple.
And if no command word with the first confidence degree larger than or equal to the first threshold value exists in the command word set, the command word indicates that the voice data of the target time window does not have a hit command word.
For example, if the command word set includescommand word 1,command word 2, command word 3, and command word 4, a first confidence corresponding to each command word is obtained according to the audio features of K speech frames in the target time window, where the first confidence corresponding to commandword 1 is 0.3, the first confidence corresponding to commandword 2 is 0.75, the first confidence corresponding to command word 3 is 0.45, and the first confidence corresponding to command word 4 is 0.66, and if the first threshold is 0.6, there is a command word in the command word set whose first confidence is greater than or equal to the first threshold, that is,command word 2 and command word 4, and the speech data in the target time window ofcommand word 2 and command word 4 hits the first command word in the command word set.
In a possible implementation manner, if there is no hit command word in the speech data of the target time window, the subsequent operation may not be performed, so as to determine the target time window corresponding to the new current speech frame, and further detect whether there is a hit command word in the audio data of the new target time window, and so on, to detect whether there is a hit command word in the audio data of the target time window corresponding to each speech frame. In addition, when the target time window is detected to have no hit command words, the subsequent step of secondary verification is not executed directly, and the data processing efficiency is improved.
S203, determining a characteristic time window associated with the current voice frame based on the command word length of the first command word, and acquiring audio characteristics corresponding to the voice data of the plurality of voice frames in the characteristic time window.
The characteristic time window may be a time window for performing secondary verification on the command word, and the characteristic time window may include a plurality of speech frames. Repeated speech frames may be included in the characteristic time window and the target time window, but the included speech frames may not be completely the same or may be completely the same, and are not limited herein. The audio features corresponding to the speech data of the multiple speech frames in the feature time window may be determined for the speech data based on each speech frame, and the audio features may be FBank features.
It can be understood that the step S203 is executed on the premise that the command word hit in the voice data of the target time window is detected, which is equivalent to determining a new time window (i.e. the characteristic time window) after detecting that the voice data of the target time window hits the first command word, so as to implement secondary verification through the characteristic time window, and improve the accuracy of detecting the command word.
In a possible embodiment, the range of the characteristic time window associated with the current speech frame needs to cover the speech frame of the first command word in the speech data as much as possible, so that the first number of speech frames preceding the current speech frame can be determined based on the command word length (length for short) of the first command word, and the characteristic time window is determined according to the first number of speech frames preceding the current speech frame. Wherein, the command word length refers to the number of syllables in the command word. For a normal chinese command word, one word corresponds to one syllable, for example, the command word "turn on air conditioner" includes four words, corresponding to 4 syllables, that is, the length of the command word is 4. It is understood that, if the number of the first command words is multiple, the number of the speech frames included in the characteristic time window may be determined based on the length of the command word of the first command word having the largest length.
Specifically, the characteristic time window may be determined according to the command word length of the first command word and the target preset value. The method specifically comprises the following steps:
determining a first number according to the command word length of the first command word and a target preset value. The target preset value may be a preset value, because generally speaking speed and the like may cause that a plurality of speech frames may be involved due to pronunciation (a syllable) of a word, and a command word may have a number of speech frames that is greater than or equal to the number of syllables of the command word, the first number may be determined by determining the target preset value, so that the size of the characteristic time window covers the speech frames involved by the first command word as much as possible. In a possible implementation manner, the first number may be obtained by multiplying the command word length of the first command word by the target preset value, so that the number of the speech frames included in the obtained characteristic time window is the first number. For example, if the length of the first command word is 4 and the target preset value is 25, the first number may be 4 × 25 — 100, that is, 100 speech frames are included in the characteristic time window.
Determining a characteristic time window associated with the current voice frame according to the first number of voice frames before the current voice frame. The method comprises the steps of obtaining a current speech frame, determining a characteristic time window associated with the current speech frame according to the first number of speech frames before the current speech frame, and taking the current speech frame as the last frame of the characteristic time window. For example, the continuously input speech data includes 1 st, 2 nd, 3.... n speech frames, and if the current speech frame is the 120 th speech frame and the first number is 100 speech frames, a time window with a size of 100, in which the 120 th speech frame is used as the last speech frame, may be determined as the characteristic time window associated with the 120 th speech frame, that is, 100 speech frames (20 th to 120 th speech frames) before the 120 th speech frame are determined as speech frames in the characteristic time window associated with the 120 th speech frame.
In a possible implementation manner, the command word set includes command words with different command word lengths, and there are situations such as the same prefix or similar words that are easy to be confused, for example, "turn on heating" and "turn on heating mode" are two command words with the same prefix but different indicated operations, in the actual processing process, since the voice data is input one by one, it is likely that when the current voice frame is a voice immediately after the input of "turn on heating", the command word of "turn on heating" is hit based on the target time window corresponding to the current voice frame, but it is likely that the command word to be actually triggered is the "turn on heating mode", therefore, a voice frame after the "turn on heating" can be also included in the characteristic time window, that is, when the characteristic time window associated with the current voice frame is determined, a voice frame after the current voice frame can also be determined as a voice frame in the characteristic time window, therefore, more accurate command word detection is carried out, namely, a delay waiting strategy is introduced when the characteristic time window is determined, and the situation of early false recognition occurs when the command word is determined through the target time window.
Specifically, determining the characteristic time window associated with the current speech frame based on the command word length of the first command word may include the following steps: determining a first number according to the command word length of the first command word and a target preset value. The step (i) may refer to the above description, and is not described herein again. Determining a characteristic time window associated with the current voice frame according to the first number of voice frames before the current voice frame and the second number of voice frames after the current voice frame. The first number of speech frames before the current speech frame includes the current speech frame, the second number of speech frames after the current speech frame also includes the current speech frame, but the plurality of speech frames in the characteristic time window only include one current speech frame. The second quantity may be a preset numerical value, the second quantity may be an empirical value, or the second quantity may be determined according to the command word lengths of the longest command word and the first command word in the command word set, specifically, the length difference may be obtained by subtracting the command word length of the first command word from the command word length of the longest command word, and the length difference may be further multiplied by the target preset value to obtain the second quantity. For example, if the command word length of the longest command word is 8, and the command word length of the first command word degree is 5, the length difference is 8-5 to 3, and if the target preset value is 25, 3 to 25 may be 75, and the second number may be 75. Here, an example is shown to illustrate how to determine the characteristic time window, n speech frames are included in the continuously input speech data, and if the current speech frame is the 120 th speech frame, the first number is 100 speech frames, and the second number is 75, 100 speech frames before the 120 th speech frame (20 th to 120 th speech frames) and 75 speech frames after the 120 th speech frame (120 th and 195 th speech frames) may be determined as speech frames in the characteristic time window associated with the 120 th speech frame, i.e., the speech frames in the characteristic time window include the 20 th to 195 th speech frames.
In a possible implementation manner, when the electronic device receives continuously input voice data, the audio features of each voice frame can be extracted and cached in the storage area, and after the characteristic time window is determined, the audio features corresponding to the voice frames in the characteristic time window can be directly extracted from the storage area, so that the efficiency of data processing can be improved without repeatedly calculating the audio features of the voice frames. It can be understood that the number of the audio features in the cache storage area can be determined according to the number of the voice frames in the maximum feature time window, so that it can be ensured that the audio features of the voice frames in the feature time window can be quickly acquired from the storage area after the feature time window determined based on any first command word. The maximum characteristic time window may be a characteristic time window determined based on the command length of the command word having the largest length among the command words. It will be appreciated that, in order to avoid buffering too much data, the audio characteristics of the voice frame with the longest input time can be deleted every new input of a voice frame as the voice data is input, thereby avoiding waste of storage space.
And S204, determining second command words hit by the voice data of the characteristic time window in the command word set based on the audio characteristics respectively corresponding to the voice data of the plurality of voice frames in the characteristic time window.
The second command word refers to a command word hit by the voice data in the characteristic time window, and the second command word belongs to the command word set. Optionally, after the second command word is determined, the operation indicated by the second command word may be performed. The voice data of the characteristic time window is short for the voice data of the voice frames in the characteristic time window, for example, a command word hit by the voice data of the characteristic time window in the command word set may refer to a command word hit by the voice data of a plurality of voice frames of the characteristic time window in the command word set; the second command word hit by the speech data of the characteristic time window in the command word set can also be briefly described as the second command word hit by the characteristic time window in the command word set.
It can be understood that the second command word hit by the voice data of the characteristic time window in the command word set is determined on the premise that the hit command word exists in the command word set in the voice data of the characteristic time window, and if the hit command word does not exist in the command word set in the voice data of the characteristic time window, the second command word hit by the voice data of the characteristic time window in the command word set cannot be determined. Therefore, whether the voice data of the characteristic time window hits the command word in the command word set or not can be detected, which is equivalent to performing secondary verification on whether the command word hits in the continuously input voice data, so that the detection result of the voice data of the characteristic time window is used as a final detection result, and if a second command word hitting in the command word set is detected, the operation indicated by the second command word is executed. For example, if the second command word "turn on heating" hit by the voice data of the characteristic time window is detected, an operation of turning on heating may be performed.
In one possible embodiment, no action may be performed if the speech data of the characteristic time window does not have a hit second command word in the set of command words. And then, the target time window of the new current voice frame can be determined, the steps are repeated until whether the voice data of the characteristic time window hits the command word in the command word set is determined based on the audio characteristics of the voice frame in the characteristic time window associated with the new current voice frame, and the like, so that the detection of the time window corresponding to each voice frame is realized.
In a possible implementation manner, when the second command word hit in the characteristic time window is detected, the second command word may be further used for other purposes, such as training other models by the extracted command word, storing the extracted command word, and the like, which is not limited herein.
In one possible implementation, some time information, place information, and the like may be further included in the command word, so that the corresponding operation may be performed according to the detected time information, the time indicated by the time information, and the place indicated by the place information of the second command word. For example, when it is detected that the target command word is "10 o ' clock turn on the air conditioner", 10 o ' clock of which is time information of the command word, the operation of turning on the air conditioner may be performed at 10 o ' clock.
An example of how command word detection for speech data may be implemented is set forth herein. Firstly, voice data can be received, a target time window corresponding to a current voice frame in the received voice data is determined, whether the target time window hits a command word in a command word set is further determined, and specifically, the determination can be performed through the audio characteristics of the voice data of each voice frame in the target time window; if the target time window does not hit the command word in the command word set, the operation can not be executed, and the target time window of the new current voice frame is determined; if the target time window hits the command word in the command word set, performing secondary verification, which may specifically be determining a characteristic time window associated with the current voice frame, and further determining whether the characteristic time window hits the command word in the command word set, if the characteristic time window does not hit the command word in the command word set, not performing the operation, and determining a target time window of a new current voice frame, and if the characteristic time window hits the command word in the command word set, performing the operation indicated by the hit command word. The secondary verification can be achieved by determining the characteristic time window to improve the accuracy of detection of command words in the voice data.
In a possible scenario, the method and the device can be applied to detecting whether the received voice data hits the command word or not under the condition that the electronic device is already awakened. That is, after the electronic device has been awakened by the voice initiating object through the awakening word, the hit command word is detected based on the received voice data.
In a possible scenario, the present application may also be applied to a scenario that the electronic device does not need to be woken up, that is, the electronic device directly determines whether a command word is hit according to the received voice data without waking up by a wake-up word, which is equivalent to waking up the electronic device and executing an operation indicated by the command word when it is detected that the received voice data hits the command word in the command word set. The command words in the command word set are preset, the electronic equipment is triggered to execute corresponding operations only when the voice data contains the command words, and the accuracy of command word detection is high, so that the voice initiating object can quickly instruct the electronic equipment to execute the corresponding operations through voice instructions without waking the equipment first and then issuing the instructions. It can be understood that, in order to reduce the error recognition rate of the command words, some words that are not very common may be set in the command words in the preset command word set, or the error recognition rate of the command words may be reduced by adding a word group that is not common in the command words, thereby greatly improving the interactive experience.
The embodiment of the application provides a data processing scheme, which can determine a first command word hit by voice data of a target time window in a command word set according to audio features corresponding to the voice data of K voice frames in the target time window, which is equivalent to preliminarily determining the command word contained in the continuously input voice data, and further determine a feature time window based on feature information of the first command word, such as command word length, so that the audio features corresponding to the voice data of a plurality of voice frames in the feature time window respectively determine a second command word hit by the voice data of the feature time window in the command word set, which is equivalent to determining a new feature time window, and performing secondary verification on whether the command word is contained in the continuously input voice data or not. Alternatively, after the second command word is detected, the operation indicated by the second command word may be performed. Therefore, after the command word hit by the voice data is preliminarily determined based on the target time window, a new characteristic time window is determined to carry out secondary verification on whether the voice data contains the command word, so that the accuracy of detecting the command word of the voice data can be improved.
Referring to fig. 4, fig. 4 is a schematic flow chart of another data processing method according to an embodiment of the present application. The method may be performed by the electronic device described above. The data processing method may include the following steps.
S401, determining a target time window corresponding to the current voice frame, and acquiring audio characteristics corresponding to voice data of K voice frames in the target time window.
The related description of step S401 may refer to the related description of step S201, and is not described herein again.
S402, determining a first command word hit by the voice data of the target time window in the command word set according to the audio characteristics corresponding to the voice data of the K voice frames.
In one possible embodiment, each command word in the command word set has a corresponding syllable identification sequence, which refers to a sequence composed of syllable identifications of syllables that the command word has, which can be used to characterize the syllables. In a possible embodiment, the syllable identification sequence of each command word may be determined by a pronunciation dictionary, which is a pre-processed dictionary and may include a mapping relationship between each word in the command word and the syllable identification of the syllable, so that the syllable identification of the syllable of each command word may be determined according to the pronunciation dictionary, that is, the syllable of the command word is determined. It will be appreciated that different words may have the same syllable, for example, the command words "play song" and "cancel heating" both include the syllable "qu".
In one possible implementation, as described above, the command word set includes at least one command word, each command word having a plurality of syllables, then step S402 may include the following steps:
determining the probability that the K voice frames correspond to each syllable output unit in the syllable output unit set respectively according to the audio characteristics corresponding to the voice data of the K voice frames respectively; the syllable output unit set is determined based on a plurality of syllables that each command word has, and syllables corresponding to different syllable output units are different. The syllable output unit set refers to a set of classification items capable of classifying syllables corresponding to the speech data of each speech frame, and the output unit set includes a plurality of output units. For example, if the syllable output unit A, B, C is included in the syllable output unit set, the speech data representing each speech frame can be classified into A, B or C, so that the probability that K speech frames correspond to the syllable output unit A, B, C can be determined. The syllable output unit set determined based on the plurality of syllables possessed by each command word may be determined based on a syllable identification determined based on the plurality of syllables possessed by each command word, specifically, a union of syllable identifications of the plurality of syllables possessed by each command word is determined, and each syllable identification of each of the union of syllable identifications corresponds to one syllable output unit.
In one embodiment, the set of syllable output units further comprises a garbage syllable output unit, so that in the subsequent classification process, the syllables that the command word that does not belong to the command word set has can be classified into the garbage syllable output unit. For example, the command word set includescommand word 1,command word 2, and command word 3, the syllable identifiers of the syllables of thecommand word 1 are s1, s2, s3, and s4, the syllable identifiers of the syllables of thecommand word 2 are s1, s4, s5, and s6, and the syllable identifiers of the syllables of the command word 3 are s7, s2, s3, and s1, respectively, so that the syllable output units corresponding to the syllables of thecommand words 1 to 3 can be determined as s1, s2, s3, s4, s5, s6, and s7, respectively, and s1, s2, s3, s4, s5, s6, and s7, respectively, and the syllable output unit corresponding to each syllable and the spam output unit are determined as the output unit set.
Determining a first confidence coefficient corresponding to the voice data of the target time window and each command word according to the probability that the K voice frames respectively correspond to each syllable output unit. The first confidence level for determining any command word can be obtained by determining the maximum value of the product of the probabilities corresponding to each syllable that the command word has, that is, the first confidence level is determined according to the product of the maximum probabilities corresponding to each syllable that the command word has.
And if the command word set has the command word with the first confidence coefficient larger than or equal to the first threshold value, determining the command word with the first confidence coefficient larger than or equal to the first threshold value as the first command word hit by the voice data of the target time window in the command word set. This step can refer to the above description, and is not described herein again.
In a possible implementation manner, if any command word in the command set is represented as a target command word, determining a first confidence level of the speech data of the target time window corresponding to each command word according to the probability that K speech frames respectively correspond to each syllable output unit may specifically include the following steps: determining a syllable output unit corresponding to each syllable of the target command word as a target syllable output unit to obtain a plurality of target syllable output units corresponding to the target command word. The target syllable output unit is a syllable output unit corresponding to each syllable of the target command word, and the target syllable output unit can be determined through the syllable identification sequence of the target command word, because each syllable output unit has a corresponding syllable, the target syllable output unit can be determined from the plurality of syllable output units through the syllables in the syllable identification sequence. For example, the target command word is "open heating", the syllable identifiers of the syllables included in the target command word are determined as s1, s2, s3, and s4 (which may be referred to as a syllable identifier series of the target command word) from the pronunciation dictionary, and the syllable output units corresponding to s1, s2, s3, and s4 can be collectively determined as the target syllable output units from the syllable identifier series.
Secondly, determining the probability of the K voice frames corresponding to each target syllable output unit from the probability of the K voice frames corresponding to each syllable output unit respectively, and obtaining K candidate probabilities corresponding to each target syllable output unit respectively. Wherein, the candidate probability is the probability of the target syllable output unit corresponding to any voice frame. For example, if the target syllable output unit has syllable output units corresponding to s1, s2, s3 and s4 (referred to as syllable output units s1, s2, s3 and s4 here), it is possible to determine the probabilities of s1 corresponding to K speech frames, the probabilities of s2 corresponding to K speech frames, the probabilities of s3 corresponding to K speech frames, and the probabilities of s4 corresponding to K speech frames, that is, the total number of the obtained candidate probabilities corresponds to K × 4.
And thirdly, determining the maximum candidate probability corresponding to each target syllable output unit from the K candidate probabilities corresponding to each target syllable output unit, and determining the first confidence coefficient corresponding to the voice data of the target time window and the target command word according to the maximum candidate probability corresponding to each target syllable output unit. Wherein, each target syllable output unit corresponds to each target syllableThe first confidence coefficient corresponding to the voice data of the target time window and the target command word may be determined specifically by a product of the maximum candidate probabilities corresponding to each target syllable output unit, where the first confidence coefficient corresponding to the voice data of the target time window and the target command word may be determined, and for example, the product of the candidate probabilities may be directly determined as the first confidence coefficient, or the first confidence coefficient may be obtained through other mathematical calculations, which is not limited here. For example, s1 has a probability of corresponding to each of the K speech frames { G1 }1 、G12 、G13 ......G1K The maximum probability is the probability G1 corresponding to the 10 th speech frame in the target time window10 (ii) a The probability that s2 corresponds to each of the K speech frames is G21 、G22 、G23 ......G2K The maximum probability is the probability G2 corresponding to the 25 th speech frame in the target time window25 (ii) a The probability that s3 corresponds to each of the K speech frames is G31 、G32 、G33 ......G3K The maximum probability is the probability G3 corresponding to the 34 th speech frame in the target time window34 (ii) a s4 has a probability of corresponding to each of the K speech frames { G4 }1 、G42 、G43 ......G4K With the maximum probability being the probability G4 corresponding to the 39 th speech frame in the target time window39 Further, G1 can be mentioned10 、G225 、G334 And G439 The product of the first confidence level and the second confidence level determines a first confidence level corresponding to the voice data of the target time window and the target command word. It is understood that, by performing the above operation on each command word in the command word set, the first confidence corresponding to each command word can be determined.
In one possible embodiment, the first confidence level of the speech data of the target time window corresponding to the target command word is determined according to the maximum candidate probability corresponding to each target syllable output unit, and may be calculated by the following formula (formula 1):
where C may represent a first confidence that the audio data of the target time window corresponds to the target command word. n-1 represents the number of target syllable output units corresponding to the target command word, and n represents the number of target syllable output units and garbage syllable output units. i denotes the ith target syllable output unit, j denotes the jth speech frame of the target time window, then p
ij Indicates the probability of the ith target syllable output unit from the jth speech frame, thus max p
ij Represents the maximum candidate probability of the ith target syllable output unit corresponding to each voice frame,
and the product of the maximum candidate probabilities respectively corresponding to each target syllable output unit is represented, so that the first confidence coefficient of the audio data of the target time window corresponding to the target command word can be obtained based on
formula 1.
In a possible implementation manner, the first command word is determined by a trained primary detection network, and how the primary detection network determines the first command word may refer to the above description, which is not described herein again. In one implementation, the trained primary detection network may be divided into an acoustic model and a confidence level generation module. The acoustic model is used for executing the step of determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames. The acoustic model is typically a deep neural network, such as a DNN model (a neural network model), a CNN model (a neural network model), an LSTM model (a neural network model), and the like, without limitation. The confidence generating module may be configured to perform the step of determining the first confidence corresponding to the speech data of the target time window and each command word based on the probability corresponding to each syllable output unit according to the K speech frames, and details of the step are not repeated here. Optionally, the dimension of the result output by the secondary detection network is the number of command words in the command word set, and each dimension corresponds to the first confidence of one command word.
For example, please refer to fig. 5, fig. 5 is a schematic diagram of a framework of a primary detection network according to an embodiment of the present application. As shown in fig. 5, the speech data of K speech frames within the target time window may be obtained first (as shown at 501 in fig. 5), the audio characteristics of each speech frame are then determined based on 501 (as shown at 502 in fig. 5), and then input into an acoustic model in a trained primary detection network (as shown at 503 in fig. 5), the results from the acoustic model are then input to a confidence level generation module (shown as 504 in fig. 5), such that the confidence level generation module determines that each command word has a target syllable output unit corresponding to a syllable in conjunction with a pronunciation dictionary (shown as 505 in fig. 5), and then, determining a first confidence corresponding to each command word, such as acommand word 1 confidence, acommand word 2 confidence, a command word m confidence, and the like, so as to obtain the first command word hit by the audio data based on the target time window. It is to be understood that if the first confidence of each command word is not greater than or equal to the first threshold, the audio data of the target time window does not have the first command word hit.
In a possible implementation manner, before determining the first command word by the trained primary detection network, the training of the primary detection network is required, which may specifically include the following steps:
firstly, first sample voice data is obtained, and the first sample voice data carries a syllable output unit label. The first sample voice data is used for training the primary detection network, and the first sample voice data can be voice data containing a command word, namely positive sample data, or voice data not containing the command word, namely negative sample data, so that the training effect is better through training of the positive and negative sample data. The syllable output unit label is to label the syllable output unit actually corresponding to each speech frame in the first sample speech data. It can be understood that, if a speech frame in the first sample speech data actually corresponds to a syllable corresponding to each command word in the command word set, the syllable output unit actually corresponding to the speech frame is the syllable output unit corresponding to the actually corresponding syllable, and if the speech frame actually corresponds to a syllable corresponding to each command word in the command word set, the syllable output unit actually corresponding to the speech frame is the garbage syllable output unit.
And calling an initial primary detection network to determine a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data. The initial primary detection network also includes an acoustic model, where the determined predicted syllable output unit can be determined by the acoustic model in the initial primary detection network, specifically, the probability that each speech frame corresponds to each syllable output unit in the syllable output unit set is determined according to the audio characteristics corresponding to the speech data of each speech frame in the first sample speech data, and the determined predicted syllable output unit is further determined based on the probability that each speech frame corresponds to each syllable output unit. The audio characteristics corresponding to the speech data of each speech frame in the first sample speech data are the same as the audio characteristics corresponding to each speech frame in the target time window, and are not described herein again.
And thirdly, training based on the predicted syllable output unit and the syllable output unit label corresponding to the voice data of each voice frame in the first sample voice data to obtain the trained primary detection network. In the training process, the network parameters of the initial primary detection network are adjusted to enable the predicted syllable output unit corresponding to each voice frame to be gradually close to the actual syllable output unit marked by the syllable output unit label, so that the trained primary detection network can accurately predict the probability of each voice frame corresponding to each syllable output unit. It is understood that the predicted syllable output unit is determined by the acoustic model in the primary detection network, that is, the primary detection network is trained to adjust the model parameters of the acoustic model in the primary detection network.
In a possible implementation manner, if the determination of whether the command word is hit is implemented by a Keyword/Filler HMM Model (a command word detection Model), the primary detection network may be the Keyword/Filler HMM Model. Then, according to the audio features corresponding to the speech data of the K speech frames, the probabilities of the K speech frames corresponding to each syllable output unit in the syllable output unit set are determined, then an optimal decoding path is determined based on the probability corresponding to each syllable output unit, and then whether the optimal decoding path passes through an HMM path (hidden markov path) of the command word is determined to determine whether the command word is hit, or a confidence corresponding to each HMM path is determined based on the probability corresponding to each syllable output unit to determine whether the command word is hit, which is not limited herein. The HMM path may be a command word HMM path or a filled HMM path, where each command word HMM path may be formed by connecting HMM states corresponding to a plurality of syllables of a command word in series, and the filled HMM path is formed by a set of well-designed HMM states corresponding to non-command word pronunciation units. It is thus possible to determine the confidence with respect to each HMM state based on the probability corresponding to each syllable output unit, thereby determining whether a command word is hit, and which command word is hit.
S403, determining a characteristic time window associated with the current voice frame based on the command word length of the first command word, and acquiring audio characteristics corresponding to the voice data of the plurality of voice frames in the characteristic time window.
The related description of step S403 may refer to the related description of step S203, which is not described herein again.
Optionally, the first number may also be determined in other manners, for example, the first number may be a preset number, and the first number may also be determined according to the earliest occurrence time of the first command word in the target time window, which is not limited herein. And then determining a characteristic time window associated with the current speech frame based on the first number.
In a possible implementation manner, as described above, the first number may be a preset number, and the preset number should cover the first command word as much as possible, and the preset number may be set based on the longest command word length in the command word set. Specifically, the preset number may be determined based on the longest command word length and the target preset value, and then the preset number is determined as the first number, and then the characteristic time window is determined according to the first number of speech frames before the current speech frame.
In a possible implementation, the first number may be further determined according to an earliest occurrence time of the first command word in the target time window, and the determining the characteristic time window may specifically include the following steps: obtaining a syllable output unit set, wherein the syllable output unit set is determined based on a plurality of syllables of each command word, and the syllables corresponding to different syllable output units are different. Secondly, according to the audio characteristics corresponding to the voice data of the K voice frames, determining the probability that the K voice frames correspond to each syllable output unit in the syllable output unit set. The description of the first to the second is referred to above, and the description is omitted here. And thirdly, determining a syllable output unit corresponding to the syllable of the command word hit by the voice data of the target time window as a verification syllable output unit, and determining a voice frame with the highest probability corresponding to the verification syllable output unit in the K voice frames as a target voice frame. The target speech frame is equivalent to a speech frame in which any syllable of the first command word is detected in the K speech frames, that is, the occurrence time of the first command word can be determined. And fourthly, determining a characteristic time window associated with the current voice frame according to the voice frame between the target voice frame and the current voice frame. The characteristic time window associated with the current speech frame may be determined according to a target speech frame with the largest number of speech frames between the current speech frame and the current speech frame, that is, a target speech frame with the largest number of speech frames spaced from the current speech frame is determined, where the target speech frame with the largest number of speech frames spaced from the current speech frame is determined to be used to represent the earliest occurrence time of the first command word in the target time window, and the first number is the number of speech frames between the current speech frame and the target speech frame with the largest number of spaced speech frames, so as to determine the speech frame between the current speech frame and the target speech frame with the largest number of spaced speech frames as the speech frame in the characteristic time window. It is understood that the speech frames between the current speech frame and the target speech frame include the current speech frame and the target speech frame. By the method, a more accurate characteristic time window can be determined, and the accuracy is higher when command word detection is carried out on the voice data in the characteristic time window. For example, the continuously input speech data includes 1 st, 2 nd, 3.... No. n speech frames, if the current speech frame is the 120 th speech frame, the target speech frame with the largest number of speech frames between the current speech frame and the current speech frame is the 20 th speech frame, and the 20 th to 120 th speech frames are determined as the speech frames in the characteristic time window associated with the 120 th speech frame.
S404, determining a second confidence corresponding to the voice data of the characteristic time window and each command word according to the audio characteristics corresponding to the voice data of the voice frames in the characteristic time window.
Here, each command word herein refers to each command word in the command word set described above. The second confidence level may characterize a likelihood that the speech data of the feature time window is each command word, and each command word may have a corresponding second confidence level.
In one possible implementation, when determining the second confidence level that the speech data of the characteristic time window corresponds to each command word, the second confidence level that the speech data of the characteristic time window corresponds to the garbage class may also be determined, that is, the probability that the speech data of the characteristic time window is not a command word is characterized by the second confidence level that the speech data of the characteristic time window corresponds to the garbage class.
S405, if the command word with the second confidence degree larger than or equal to the second threshold exists in the command word set, determining the command word with the second confidence degree larger than or equal to the second threshold and the maximum second confidence degree as the second command word hit by the voice data of the characteristic time window in the command word set, and executing the operation indicated by the second command word.
The second threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable second threshold may be set to determine the second command word. It is to be understood that if there is no command word in the command word set whose second confidence is greater than or equal to the second threshold, it is determined that there is no hit second command word in the command word set by the speech data of the characteristic time window. Optionally, after the second command word is determined, the operation indicated by the second command word may be performed.
In a possible implementation manner, if a second confidence level corresponding to the spam class is determined when the second confidence level is determined, a maximum second confidence level may be determined among the second confidence levels except the second confidence level corresponding to the spam class, if the maximum second confidence level is greater than or equal to a second threshold, the command word corresponding to the maximum second confidence level is determined as a hit second command word, and if the maximum second confidence level is less than the second threshold, the voice data of the feature time window is classified as the spam class, that is, the voice data of the feature time window does not have a hit second command word in the command word set.
For example, if the command word set includescommand word 1,command word 2, command word 3, and command word 4, the second confidence corresponding to each command word is obtained based on the audio features in the feature time window, where the second confidence corresponding to commandword 1 is 0.3, the second confidence corresponding to commandword 2 is 0.73, the second confidence corresponding to command word 3 is 0.42, the second confidence corresponding to command word 4 is 0.58, and the second confidence corresponding to spam is 0.61; if the preset second threshold is 0.6, a command word with a second confidence degree greater than or equal to the first threshold, that is, the command word 4, exists in the command word set, and the command word 4 is a second command word hit by the voice data of the characteristic time window in the command word set, that is, the input voice data hits the command word 4, and thus the operation indicated by the command word 4 can be executed. If the preset second threshold is 0.75, the command word with the second confidence degree larger than or equal to the first threshold does not exist in the command word set, the command word which is not hit by the voice data of the characteristic time window in the command word set is determined, namely the voice data which is equivalent to the characteristic time window is classified into garbage, and then a new current voice frame is determined, so that the steps are repeatedly executed, and the detection of the command word is realized.
In a possible embodiment, the second command word is determined by a trained secondary detection network, which may be a deep neural network, such as a CLDNN model (a neural network model). How the second level detection network determines the second command word may refer to the related description of steps S404-S405, which is not described herein again. In one implementation, when the secondary detection network is called to determine the second hit command word according to the audio features corresponding to the voice data of the voice frames in the feature time window, the voice data of the voice frames in the feature time window may be sequentially input, so as to obtain a second confidence corresponding to the voice data of the feature time window and each command word. Optionally, the dimension of the result output by the secondary detection network is the dimension of adding 1 to the number of command words in the command word set, where the added 1 is the dimension of adding the second confidence corresponding to the spam class.
In a possible implementation manner, before determining the second command word through the trained secondary detection network, the training of the secondary detection network is required, which may specifically include the following steps:
firstly, second sample voice data is obtained, and the second sample voice data carries a command word label. The second sample voice data may be positive sample data or negative sample data. The positive sample data may be audio data in a characteristic time window determined based on the trained primary detection network. The negative sample data may be voice data including various non-command words. The negative sample data may also be audio data with interference noise, such as noise added to music and television, synthesized or real audio data in various far-field environments, so that accuracy of command word detection in far-field environments or noisy environments can be improved. It can be understood that, in the training process of the primary detection network, the adopted negative data does not include audio data with various interference noises, because when the primary detection network is trained through the audio data with various interference noises, the classification effect of the primary detection network on the syllable output unit is rather poor, so that the secondary detection network is trained through the audio data with the interference noises during the training of the secondary detection network, the accuracy of command word detection under the condition of more interference factors is improved, the defect of the primary detection network is effectively compensated, and the secondary detection network has good complementarity to the primary detection network. The syllable output unit label is to label the command word actually corresponding to the second sample voice data, and it can be understood that if the second sample voice data actually has the corresponding command word, the syllable output unit label labels the command word actually corresponding to the second sample voice data, and if the second sample voice data does not actually have the corresponding command word, the syllable output unit label labels that the second sample voice data actually belongs to the garbage category.
And secondly, calling a secondary detection network to determine a prediction command word corresponding to the second sample voice data. The determining of the predicted command word may be performed in an initial secondary detection network, and specifically, a second confidence degree corresponding to each command word in the second sample voice data is determined according to the audio features corresponding to the voice data of each voice frame in the second sample voice data, and then the predicted command word corresponding to the second sample voice data is determined based on the second confidence degree corresponding to each command word. The audio features corresponding to the speech data of each speech frame in the second sample speech data are calculated in the same manner as the audio features corresponding to the speech frames in the target time window, which is not repeated herein.
And training based on the predicted command words and the command word labels to obtain a trained secondary detection network. In the training process, the network parameters of the initial secondary detection network are adjusted to make the predicted command words corresponding to the second sample voice data gradually approximate to the actual corresponding command words labeled by the command word labels, so that the trained secondary detection network can accurately predict the command words corresponding to the voice data in each characteristic time window.
Optionally, the method may further determine, based on the first confidence degree corresponding to each command word and the second confidence degree corresponding to each command word, a third confidence degree corresponding to each command word, of the voice data of the feature time window, and further determine, based on the third confidence degree, a second command word hit by the voice data of the feature time window. And if command words with the second confidence degree larger than or equal to a second threshold exist in the command word set, determining the command words with the second confidence degree larger than or equal to the second threshold and the maximum second confidence degree as second command words hit by the voice data of the characteristic time window in the command word set. And if no command word with the second confidence coefficient larger than or equal to the second threshold exists in the command word set, not executing the operation and determining the target time window of the new current voice frame. Therefore, the finally hit command word can be determined by combining the first confidence coefficient and the second confidence coefficient, and the accuracy of command word detection can be improved.
In a possible implementation manner, the determining, based on the first confidence degree corresponding to each command word and the second confidence degree corresponding to each command word, the third confidence degree corresponding to the speech data of the feature time window and each command word may specifically be: and performing splicing processing based on the first confidence coefficient corresponding to each command word and the second confidence coefficient corresponding to each command word to obtain verification characteristics, and further determining the third confidence coefficient corresponding to the voice data of the characteristic time window and each command word based on the verification characteristics. Determining a third confidence level of the speech data of the feature time window corresponding to each command word based on the verification feature may be determined based on a trained neural network, such as a simple multi-layer DNN network (a neural network model).
In a possible embodiment, the third confidence level corresponding to the voice data of the feature time window and each command word is determined based on the first confidence level corresponding to each command word and the second confidence level corresponding to each command word, or the third confidence level corresponding to the voice data of the feature time window and each command word is obtained by performing mathematical computation on the first confidence level and the second confidence level corresponding to each command word, for example, the third confidence level corresponding to a command word may be determined based on an average value or a weighted average value of the first confidence level and the second confidence level. Optionally, since the speech frame covered by the characteristic time window may be more accurate, a higher weight may be determined for the second confidence level when determining the weighted average of the first confidence level and the second confidence level.
It can be understood that, because hardware configurations such as a CPU processor (central processing unit), a memory, and a flash memory, which are generally used by an electronic device that needs to detect an instruction in voice data, are low, there is a relatively strict requirement on resource occupation of each function, in the present application, command word detection in voice data is mainly determined by the trained primary detection network and secondary detection network, the network structure is relatively simple, the resource occupation of the electronic device is relatively small, and command word detection performance can be effectively improved. Compared with the method for recognizing the content of the received voice data based on the voice recognition technology, the method for recognizing the content of the voice data based on the voice recognition technology has the advantages that a better recognition effect can be achieved only by using a larger-scale acoustic model and a larger-scale language model, namely, the good recognition effect can be achieved only by occupying more equipment resources.
How to implement command word detection on voice data through secondary verification is explained by using an example, please refer to fig. 6, and fig. 6 is a frame diagram of a data processing method provided by an embodiment of the present application. As shown in fig. 6, the flow of the entire data processing method may be abstracted into a first-level verification (as shown in 601 in fig. 6) and a second-level verification (as shown in 602 in fig. 6), so that the voice data may be input in the first-level verification, which may specifically include determining a first confidence of each command word based on a trained first-level detection network based on audio features of the voice data in a target time window corresponding to a current voice frame, and further performing a threshold judgment to determine a hit first command word. Therefore, the characteristic time window is determined based on the first command word, and the audio characteristics of the voice data in the characteristic time window associated with the current voice frame are further obtained. And inputting the audio features corresponding to the feature time window into the trained secondary detection network to obtain a second confidence coefficient of each command word, so that the second command word hit by the feature time window is determined through threshold judgment, and the accuracy of command word detection can be improved through secondary verification of voice data.
The embodiment of the application provides a data processing scheme, which can determine a first command word hit by voice data of a target time window in a command word set according to audio features corresponding to the voice data of K voice frames in the target time window, which is equivalent to preliminarily determining the command word contained in the continuously input voice data, and further determine a feature time window based on feature information of the first command word, such as command word length, so that the audio features corresponding to the voice data of a plurality of voice frames in the feature time window respectively determine a second command word hit by the voice data of the feature time window in the command word set, which is equivalent to determining a new feature time window, and performing secondary verification on whether the command word is contained in the continuously input voice data or not. Alternatively, after the second command word is detected, the operation indicated by the second command word may be performed. Therefore, after the command word hit by the voice data is preliminarily determined based on the target time window, a new characteristic time window is determined to carry out secondary verification on whether the voice data contains the command word, so that the accuracy of detecting the command word of the voice data can be improved.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. Alternatively, the data processing apparatus may be disposed in the electronic device. As shown in fig. 7, the data processing apparatus described in the present embodiment may include:
an obtainingunit 701, configured to determine a target time window corresponding to a current speech frame, and obtain audio features corresponding to speech data of K speech frames in the target time window, where K is a positive integer;
aprocessing unit 702, configured to determine, according to audio features corresponding to the voice data of the K voice frames, a first command word hit by the voice data of the target time window in a command word set;
theprocessing unit 702 is further configured to determine a characteristic time window associated with the current speech frame based on the command word length of the first command word, and obtain audio characteristics corresponding to speech data of a plurality of speech frames in the characteristic time window;
theprocessing unit 702 is further configured to determine, based on the audio features respectively corresponding to the voice data of the multiple voice frames in the characteristic time window, a second command word hit by the voice data of the characteristic time window in the command word set.
In one implementation, the set of command words includes at least one command word, each command word having a plurality of syllables; theprocessing unit 702 is specifically configured to:
determining the probability that the K voice frames respectively correspond to each syllable output unit in a syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames; the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different;
determining a first confidence coefficient corresponding to the voice data of the target time window and each command word according to the probability that the K voice frames correspond to each syllable output unit respectively;
if command words with the first confidence degree larger than or equal to a first threshold value exist in the command word set, determining the command words with the first confidence degree larger than or equal to the first threshold value as the first command words hit by the voice data of the target time window in the command word set.
In one implementation, any one command word in the command set is represented as a target command word; theprocessing unit 702 is specifically configured to:
determining a syllable output unit corresponding to each syllable of the target command word as a target syllable output unit to obtain a plurality of target syllable output units corresponding to the target command word;
determining the probability that the K voice frames respectively correspond to each target syllable output unit from the probability that the K voice frames respectively correspond to each syllable output unit to obtain K candidate probabilities respectively corresponding to each target syllable output unit;
and determining the maximum candidate probability corresponding to each target syllable output unit from the K candidate probabilities corresponding to each target syllable output unit, and determining the first confidence coefficient corresponding to the voice data of the target time window and the target command word according to the maximum candidate probability corresponding to each target syllable output unit.
In one implementation, the set of command words includes at least one command word; theprocessing unit 702 is specifically configured to:
determining a second confidence coefficient corresponding to the voice data of the characteristic time window and each command word according to the audio characteristics corresponding to the voice data of the voice frames in the characteristic time window;
and if command words with the second confidence degree larger than or equal to a second threshold exist in the command word set, determining the command words with the second confidence degree larger than or equal to the second threshold and the maximum second confidence degree as the second command words hit by the voice data of the characteristic time window in the command word set.
In an implementation manner, theprocessing unit 702 is specifically configured to:
determining a first number according to the command word length of the first command word and a target preset value;
determining a characteristic time window associated with the current speech frame based on the first number of speech frames preceding the current speech frame and a second number of speech frames following the current speech frame.
In one implementation, the first command word is determined by a trained primary detection network, and theprocessing unit 702 is further configured to:
acquiring first sample voice data, wherein the first sample voice data carries a syllable output unit label;
calling an initial primary detection network, and determining a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data;
and training based on the predicted syllable output unit and the syllable output unit label which respectively correspond to the voice data of each voice frame in the first sample voice data to obtain the trained primary detection network.
In one implementation, the second command word is determined by a trained secondary detection network, and theprocessing unit 702 is further configured to:
acquiring second sample voice data, wherein the second sample voice data carries a command word label;
calling a secondary detection network to determine a prediction command word corresponding to the second sample voice data;
and training based on the predicted command words and the command word labels to obtain the trained secondary detection network.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device described in this embodiment includes: aprocessor 801 and amemory 802. Optionally, the electronic device may further include a network interface or a power supply module. Theprocessor 801 and thememory 802 can exchange data with each other.
TheProcessor 801 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The network interface may include an input device, such as a control panel, a microphone, a receiver, etc., and/or an output device, such as a display screen, a transmitter, etc., to name but a few.
Thememory 802, which may include both read-only memory and random-access memory, provides program instructions and data to theprocessor 801. A portion of thememory 802 may also include non-volatile random access memory. Wherein, theprocessor 801 is configured to execute, when calling the program instruction:
determining a target time window corresponding to a current voice frame, and acquiring audio characteristics corresponding to voice data of K voice frames in the target time window respectively, wherein K is a positive integer;
determining a first command word hit by the voice data of the target time window in a command word set according to the audio characteristics corresponding to the voice data of the K voice frames respectively;
determining a characteristic time window associated with the current voice frame based on the command word length of the first command word, and acquiring audio characteristics corresponding to the voice data of a plurality of voice frames in the characteristic time window;
and determining second command words hit by the voice data of the characteristic time window in the command word set based on the audio characteristics corresponding to the voice data of the plurality of voice frames in the characteristic time window respectively.
In one implementation, the set of command words includes at least one command word, each command word having a plurality of syllables; theprocessor 801 is specifically configured to:
determining the probability that the K voice frames respectively correspond to each syllable output unit in a syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames; the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different;
determining a first confidence coefficient corresponding to the voice data of the target time window and each command word according to the probability that the K voice frames respectively correspond to each syllable output unit;
if command words with the first confidence degree larger than or equal to a first threshold value exist in the command word set, determining the command words with the first confidence degree larger than or equal to the first threshold value as the first command words hit by the voice data of the target time window in the command word set.
In one implementation, any one command word in the command set is represented as a target command word; theprocessor 801 is specifically configured to:
determining a syllable output unit corresponding to each syllable of the target command word as a target syllable output unit to obtain a plurality of target syllable output units corresponding to the target command word;
determining the probability that the K voice frames respectively correspond to each target syllable output unit from the probability that the K voice frames respectively correspond to each syllable output unit to obtain K candidate probabilities respectively corresponding to each target syllable output unit;
and determining the maximum candidate probability corresponding to each target syllable output unit from the K candidate probabilities corresponding to each target syllable output unit, and determining the first confidence coefficient corresponding to the voice data of the target time window and the target command word according to the maximum candidate probability corresponding to each target syllable output unit.
In one implementation, the set of command words includes at least one command word; theprocessor 801 is specifically configured to:
determining a second confidence coefficient corresponding to the voice data of the characteristic time window and each command word according to the audio characteristics corresponding to the voice data of the voice frames in the characteristic time window;
and if the command word with the second confidence coefficient larger than or equal to a second threshold exists in the command word set, determining the command word with the second confidence coefficient larger than or equal to the second threshold and the maximum second confidence coefficient as the second command word hit by the voice data of the characteristic time window in the command word set.
In one implementation, theprocessor 801 is specifically configured to:
determining a first number according to the command word length of the first command word and a target preset value;
determining a characteristic time window associated with the current speech frame based on the first number of speech frames preceding the current speech frame and a second number of speech frames following the current speech frame.
In one implementation, the first command word is determined by a trained primary detection network, and theprocessor 801 is further configured to:
acquiring first sample voice data, wherein the first sample voice data carries a syllable output unit label;
calling an initial primary detection network, and determining a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data;
and training a predicted syllable output unit and a syllable output unit label which respectively correspond to the voice data of each voice frame in the first sample voice data to obtain the trained primary detection network.
In one implementation, the second command word is determined by a trained secondary detection network, and theprocessor 801 is further configured to:
acquiring second sample voice data, wherein the second sample voice data carries a command word label;
calling a secondary detection network to determine a prediction command word corresponding to the second sample voice data;
and training based on the predicted command words and the command word labels to obtain the trained secondary detection network.
Optionally, the program instructions may also implement other steps of the method in the above embodiments when executed by the processor, and details are not described here.
The present application further provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the above method, such as performing the above method performed by an electronic device, which is not described herein in detail.
Optionally, the storage medium, such as a computer-readable storage medium, referred to herein may be non-volatile or volatile.
Alternatively, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
Embodiments of the present application also provide a computer program product or a computer program, which includes computer instructions that, when executed by a processor, implement some or all of the steps of the above-described method. The computer instructions are stored in a computer readable storage medium, for example. The computer instructions are read by a processor of a computer device (i.e., the electronic device) from a computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps performed in the embodiments of the methods described above. For example, the computer device may be a terminal, or may be a server.
The data query method, the data query device, the electronic device, and the storage medium provided in the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.