where C may represent a first confidence that the audio data of the target time window corresponds to the target command word. n-1 represents the number of target syllable output units corresponding to the target command word, and n represents the number of target syllable output units and garbage syllable output units. i denotes the ith target syllable output unit, j denotes the jth speech frame of the target time window, then p_ij Indicates the probability of the ith target syllable output unit from the jth speech frame, thus max p_ij Represents the maximum candidate probability of the ith target syllable output unit corresponding to each voice frame,

and the product of the maximum candidate probabilities respectively corresponding to each target syllable output unit is represented, so that the first confidence coefficient of the audio data of the target time window corresponding to the target command word can be obtained based onformula 1.

In a possible implementation manner, the first command word is determined by a trained primary detection network, and how the primary detection network determines the first command word may refer to the above description, which is not described herein again. In one implementation, the trained primary detection network may be divided into an acoustic model and a confidence level generation module. The acoustic model is used for executing the step of determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames. The acoustic model is typically a deep neural network, such as a DNN model (a neural network model), a CNN model (a neural network model), an LSTM model (a neural network model), and the like, without limitation. The confidence generating module may be configured to perform the step of determining the first confidence corresponding to the speech data of the target time window and each command word based on the probability corresponding to each syllable output unit according to the K speech frames, and details of the step are not repeated here. Optionally, the dimension of the result output by the secondary detection network is the number of command words in the command word set, and each dimension corresponds to the first confidence of one command word.

For example, please refer to fig. 5, fig. 5 is a schematic diagram of a framework of a primary detection network according to an embodiment of the present application. As shown in fig. 5, the speech data of K speech frames within the target time window may be obtained first (as shown at 501 in fig. 5), the audio characteristics of each speech frame are then determined based on 501 (as shown at 502 in fig. 5), and then input into an acoustic model in a trained primary detection network (as shown at 503 in fig. 5), the results from the acoustic model are then input to a confidence level generation module (shown as 504 in fig. 5), such that the confidence level generation module determines that each command word has a target syllable output unit corresponding to a syllable in conjunction with a pronunciation dictionary (shown as 505 in fig. 5), and then, determining a first confidence corresponding to each command word, such as acommand word 1 confidence, acommand word 2 confidence, a command word m confidence, and the like, so as to obtain the first command word hit by the audio data based on the target time window. It is to be understood that if the first confidence of each command word is not greater than or equal to the first threshold, the audio data of the target time window does not have the first command word hit.

In a possible implementation manner, before determining the first command word by the trained primary detection network, the training of the primary detection network is required, which may specifically include the following steps:

firstly, first sample voice data is obtained, and the first sample voice data carries a syllable output unit label. The first sample voice data is used for training the primary detection network, and the first sample voice data can be voice data containing a command word, namely positive sample data, or voice data not containing the command word, namely negative sample data, so that the training effect is better through training of the positive and negative sample data. The syllable output unit label is to label the syllable output unit actually corresponding to each speech frame in the first sample speech data. It can be understood that, if a speech frame in the first sample speech data actually corresponds to a syllable corresponding to each command word in the command word set, the syllable output unit actually corresponding to the speech frame is the syllable output unit corresponding to the actually corresponding syllable, and if the speech frame actually corresponds to a syllable corresponding to each command word in the command word set, the syllable output unit actually corresponding to the speech frame is the garbage syllable output unit.

And calling an initial primary detection network to determine a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data. The initial primary detection network also includes an acoustic model, where the determined predicted syllable output unit can be determined by the acoustic model in the initial primary detection network, specifically, the probability that each speech frame corresponds to each syllable output unit in the syllable output unit set is determined according to the audio characteristics corresponding to the speech data of each speech frame in the first sample speech data, and the determined predicted syllable output unit is further determined based on the probability that each speech frame corresponds to each syllable output unit. The audio characteristics corresponding to the speech data of each speech frame in the first sample speech data are the same as the audio characteristics corresponding to each speech frame in the target time window, and are not described herein again.

And thirdly, training based on the predicted syllable output unit and the syllable output unit label corresponding to the voice data of each voice frame in the first sample voice data to obtain the trained primary detection network. In the training process, the network parameters of the initial primary detection network are adjusted to enable the predicted syllable output unit corresponding to each voice frame to be gradually close to the actual syllable output unit marked by the syllable output unit label, so that the trained primary detection network can accurately predict the probability of each voice frame corresponding to each syllable output unit. It is understood that the predicted syllable output unit is determined by the acoustic model in the primary detection network, that is, the primary detection network is trained to adjust the model parameters of the acoustic model in the primary detection network.

In a possible implementation manner, if the determination of whether the command word is hit is implemented by a Keyword/Filler HMM Model (a command word detection Model), the primary detection network may be the Keyword/Filler HMM Model. Then, according to the audio features corresponding to the speech data of the K speech frames, the probabilities of the K speech frames corresponding to each syllable output unit in the syllable output unit set are determined, then an optimal decoding path is determined based on the probability corresponding to each syllable output unit, and then whether the optimal decoding path passes through an HMM path (hidden markov path) of the command word is determined to determine whether the command word is hit, or a confidence corresponding to each HMM path is determined based on the probability corresponding to each syllable output unit to determine whether the command word is hit, which is not limited herein. The HMM path may be a command word HMM path or a filled HMM path, where each command word HMM path may be formed by connecting HMM states corresponding to a plurality of syllables of a command word in series, and the filled HMM path is formed by a set of well-designed HMM states corresponding to non-command word pronunciation units. It is thus possible to determine the confidence with respect to each HMM state based on the probability corresponding to each syllable output unit, thereby determining whether a command word is hit, and which command word is hit.

S403, determining a characteristic time window associated with the current voice frame based on the command word length of the first command word, and acquiring audio characteristics corresponding to the voice data of the plurality of voice frames in the characteristic time window.

The related description of step S403 may refer to the related description of step S203, which is not described herein again.

Optionally, the first number may also be determined in other manners, for example, the first number may be a preset number, and the first number may also be determined according to the earliest occurrence time of the first command word in the target time window, which is not limited herein. And then determining a characteristic time window associated with the current speech frame based on the first number.

In a possible implementation manner, as described above, the first number may be a preset number, and the preset number should cover the first command word as much as possible, and the preset number may be set based on the longest command word length in the command word set. Specifically, the preset number may be determined based on the longest command word length and the target preset value, and then the preset number is determined as the first number, and then the characteristic time window is determined according to the first number of speech frames before the current speech frame.

In a possible implementation, the first number may be further determined according to an earliest occurrence time of the first command word in the target time window, and the determining the characteristic time window may specifically include the following steps: obtaining a syllable output unit set, wherein the syllable output unit set is determined based on a plurality of syllables of each command word, and the syllables corresponding to different syllable output units are different. Secondly, according to the audio characteristics corresponding to the voice data of the K voice frames, determining the probability that the K voice frames correspond to each syllable output unit in the syllable output unit set. The description of the first to the second is referred to above, and the description is omitted here. And thirdly, determining a syllable output unit corresponding to the syllable of the command word hit by the voice data of the target time window as a verification syllable output unit, and determining a voice frame with the highest probability corresponding to the verification syllable output unit in the K voice frames as a target voice frame. The target speech frame is equivalent to a speech frame in which any syllable of the first command word is detected in the K speech frames, that is, the occurrence time of the first command word can be determined. And fourthly, determining a characteristic time window associated with the current voice frame according to the voice frame between the target voice frame and the current voice frame. The characteristic time window associated with the current speech frame may be determined according to a target speech frame with the largest number of speech frames between the current speech frame and the current speech frame, that is, a target speech frame with the largest number of speech frames spaced from the current speech frame is determined, where the target speech frame with the largest number of speech frames spaced from the current speech frame is determined to be used to represent the earliest occurrence time of the first command word in the target time window, and the first number is the number of speech frames between the current speech frame and the target speech frame with the largest number of spaced speech frames, so as to determine the speech frame between the current speech frame and the target speech frame with the largest number of spaced speech frames as the speech frame in the characteristic time window. It is understood that the speech frames between the current speech frame and the target speech frame include the current speech frame and the target speech frame. By the method, a more accurate characteristic time window can be determined, and the accuracy is higher when command word detection is carried out on the voice data in the characteristic time window. For example, the continuously input speech data includes 1 st, 2 nd, 3.... No. n speech frames, if the current speech frame is the 120 th speech frame, the target speech frame with the largest number of speech frames between the current speech frame and the current speech frame is the 20 th speech frame, and the 20 th to 120 th speech frames are determined as the speech frames in the characteristic time window associated with the 120 th speech frame.

S404, determining a second confidence corresponding to the voice data of the characteristic time window and each command word according to the audio characteristics corresponding to the voice data of the voice frames in the characteristic time window.

Here, each command word herein refers to each command word in the command word set described above. The second confidence level may characterize a likelihood that the speech data of the feature time window is each command word, and each command word may have a corresponding second confidence level.

In one possible implementation, when determining the second confidence level that the speech data of the characteristic time window corresponds to each command word, the second confidence level that the speech data of the characteristic time window corresponds to the garbage class may also be determined, that is, the probability that the speech data of the characteristic time window is not a command word is characterized by the second confidence level that the speech data of the characteristic time window corresponds to the garbage class.

S405, if the command word with the second confidence degree larger than or equal to the second threshold exists in the command word set, determining the command word with the second confidence degree larger than or equal to the second threshold and the maximum second confidence degree as the second command word hit by the voice data of the characteristic time window in the command word set, and executing the operation indicated by the second command word.

The second threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable second threshold may be set to determine the second command word. It is to be understood that if there is no command word in the command word set whose second confidence is greater than or equal to the second threshold, it is determined that there is no hit second command word in the command word set by the speech data of the characteristic time window. Optionally, after the second command word is determined, the operation indicated by the second command word may be performed.

In a possible implementation manner, if a second confidence level corresponding to the spam class is determined when the second confidence level is determined, a maximum second confidence level may be determined among the second confidence levels except the second confidence level corresponding to the spam class, if the maximum second confidence level is greater than or equal to a second threshold, the command word corresponding to the maximum second confidence level is determined as a hit second command word, and if the maximum second confidence level is less than the second threshold, the voice data of the feature time window is classified as the spam class, that is, the voice data of the feature time window does not have a hit second command word in the command word set.

For example, if the command word set includescommand word 1,command word 2, command word 3, and command word 4, the second confidence corresponding to each command word is obtained based on the audio features in the feature time window, where the second confidence corresponding to commandword 1 is 0.3, the second confidence corresponding to commandword 2 is 0.73, the second confidence corresponding to command word 3 is 0.42, the second confidence corresponding to command word 4 is 0.58, and the second confidence corresponding to spam is 0.61; if the preset second threshold is 0.6, a command word with a second confidence degree greater than or equal to the first threshold, that is, the command word 4, exists in the command word set, and the command word 4 is a second command word hit by the voice data of the characteristic time window in the command word set, that is, the input voice data hits the command word 4, and thus the operation indicated by the command word 4 can be executed. If the preset second threshold is 0.75, the command word with the second confidence degree larger than or equal to the first threshold does not exist in the command word set, the command word which is not hit by the voice data of the characteristic time window in the command word set is determined, namely the voice data which is equivalent to the characteristic time window is classified into garbage, and then a new current voice frame is determined, so that the steps are repeatedly executed, and the detection of the command word is realized.

In a possible embodiment, the second command word is determined by a trained secondary detection network, which may be a deep neural network, such as a CLDNN model (a neural network model). How the second level detection network determines the second command word may refer to the related description of steps S404-S405, which is not described herein again. In one implementation, when the secondary detection network is called to determine the second hit command word according to the audio features corresponding to the voice data of the voice frames in the feature time window, the voice data of the voice frames in the feature time window may be sequentially input, so as to obtain a second confidence corresponding to the voice data of the feature time window and each command word. Optionally, the dimension of the result output by the secondary detection network is the dimension of adding 1 to the number of command words in the command word set, where the added 1 is the dimension of adding the second confidence corresponding to the spam class.

In a possible implementation manner, before determining the second command word through the trained secondary detection network, the training of the secondary detection network is required, which may specifically include the following steps:

firstly, second sample voice data is obtained, and the second sample voice data carries a command word label. The second sample voice data may be positive sample data or negative sample data. The positive sample data may be audio data in a characteristic time window determined based on the trained primary detection network. The negative sample data may be voice data including various non-command words. The negative sample data may also be audio data with interference noise, such as noise added to music and television, synthesized or real audio data in various far-field environments, so that accuracy of command word detection in far-field environments or noisy environments can be improved. It can be understood that, in the training process of the primary detection network, the adopted negative data does not include audio data with various interference noises, because when the primary detection network is trained through the audio data with various interference noises, the classification effect of the primary detection network on the syllable output unit is rather poor, so that the secondary detection network is trained through the audio data with the interference noises during the training of the secondary detection network, the accuracy of command word detection under the condition of more interference factors is improved, the defect of the primary detection network is effectively compensated, and the secondary detection network has good complementarity to the primary detection network. The syllable output unit label is to label the command word actually corresponding to the second sample voice data, and it can be understood that if the second sample voice data actually has the corresponding command word, the syllable output unit label labels the command word actually corresponding to the second sample voice data, and if the second sample voice data does not actually have the corresponding command word, the syllable output unit label labels that the second sample voice data actually belongs to the garbage category.

And secondly, calling a secondary detection network to determine a prediction command word corresponding to the second sample voice data. The determining of the predicted command word may be performed in an initial secondary detection network, and specifically, a second confidence degree corresponding to each command word in the second sample voice data is determined according to the audio features corresponding to the voice data of each voice frame in the second sample voice data, and then the predicted command word corresponding to the second sample voice data is determined based on the second confidence degree corresponding to each command word. The audio features corresponding to the speech data of each speech frame in the second sample speech data are calculated in the same manner as the audio features corresponding to the speech frames in the target time window, which is not repeated herein.

And training based on the predicted command words and the command word labels to obtain a trained secondary detection network. In the training process, the network parameters of the initial secondary detection network are adjusted to make the predicted command words corresponding to the second sample voice data gradually approximate to the actual corresponding command words labeled by the command word labels, so that the trained secondary detection network can accurately predict the command words corresponding to the voice data in each characteristic time window.

Optionally, the method may further determine, based on the first confidence degree corresponding to each command word and the second confidence degree corresponding to each command word, a third confidence degree corresponding to each command word, of the voice data of the feature time window, and further determine, based on the third confidence degree, a second command word hit by the voice data of the feature time window. And if command words with the second confidence degree larger than or equal to a second threshold exist in the command word set, determining the command words with the second confidence degree larger than or equal to the second threshold and the maximum second confidence degree as second command words hit by the voice data of the characteristic time window in the command word set. And if no command word with the second confidence coefficient larger than or equal to the second threshold exists in the command word set, not executing the operation and determining the target time window of the new current voice frame. Therefore, the finally hit command word can be determined by combining the first confidence coefficient and the second confidence coefficient, and the accuracy of command word detection can be improved.

In a possible implementation manner, the determining, based on the first confidence degree corresponding to each command word and the second confidence degree corresponding to each command word, the third confidence degree corresponding to the speech data of the feature time window and each command word may specifically be: and performing splicing processing based on the first confidence coefficient corresponding to each command word and the second confidence coefficient corresponding to each command word to obtain verification characteristics, and further determining the third confidence coefficient corresponding to the voice data of the characteristic time window and each command word based on the verification characteristics. Determining a third confidence level of the speech data of the feature time window corresponding to each command word based on the verification feature may be determined based on a trained neural network, such as a simple multi-layer DNN network (a neural network model).

In a possible embodiment, the third confidence level corresponding to the voice data of the feature time window and each command word is determined based on the first confidence level corresponding to each command word and the second confidence level corresponding to each command word, or the third confidence level corresponding to the voice data of the feature time window and each command word is obtained by performing mathematical computation on the first confidence level and the second confidence level corresponding to each command word, for example, the third confidence level corresponding to a command word may be determined based on an average value or a weighted average value of the first confidence level and the second confidence level. Optionally, since the speech frame covered by the characteristic time window may be more accurate, a higher weight may be determined for the second confidence level when determining the weighted average of the first confidence level and the second confidence level.

It can be understood that, because hardware configurations such as a CPU processor (central processing unit), a memory, and a flash memory, which are generally used by an electronic device that needs to detect an instruction in voice data, are low, there is a relatively strict requirement on resource occupation of each function, in the present application, command word detection in voice data is mainly determined by the trained primary detection network and secondary detection network, the network structure is relatively simple, the resource occupation of the electronic device is relatively small, and command word detection performance can be effectively improved. Compared with the method for recognizing the content of the received voice data based on the voice recognition technology, the method for recognizing the content of the voice data based on the voice recognition technology has the advantages that a better recognition effect can be achieved only by using a larger-scale acoustic model and a larger-scale language model, namely, the good recognition effect can be achieved only by occupying more equipment resources.

How to implement command word detection on voice data through secondary verification is explained by using an example, please refer to fig. 6, and fig. 6 is a frame diagram of a data processing method provided by an embodiment of the present application. As shown in fig. 6, the flow of the entire data processing method may be abstracted into a first-level verification (as shown in 601 in fig. 6) and a second-level verification (as shown in 602 in fig. 6), so that the voice data may be input in the first-level verification, which may specifically include determining a first confidence of each command word based on a trained first-level detection network based on audio features of the voice data in a target time window corresponding to a current voice frame, and further performing a threshold judgment to determine a hit first command word. Therefore, the characteristic time window is determined based on the first command word, and the audio characteristics of the voice data in the characteristic time window associated with the current voice frame are further obtained. And inputting the audio features corresponding to the feature time window into the trained secondary detection network to obtain a second confidence coefficient of each command word, so that the second command word hit by the feature time window is determined through threshold judgment, and the accuracy of command word detection can be improved through secondary verification of voice data.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. Alternatively, the data processing apparatus may be disposed in the electronic device. As shown in fig. 7, the data processing apparatus described in the present embodiment may include:

an obtainingunit 701, configured to determine a target time window corresponding to a current speech frame, and obtain audio features corresponding to speech data of K speech frames in the target time window, where K is a positive integer;

aprocessing unit 702, configured to determine, according to audio features corresponding to the voice data of the K voice frames, a first command word hit by the voice data of the target time window in a command word set;

theprocessing unit 702 is further configured to determine a characteristic time window associated with the current speech frame based on the command word length of the first command word, and obtain audio characteristics corresponding to speech data of a plurality of speech frames in the characteristic time window;

theprocessing unit 702 is further configured to determine, based on the audio features respectively corresponding to the voice data of the multiple voice frames in the characteristic time window, a second command word hit by the voice data of the characteristic time window in the command word set.

In one implementation, the set of command words includes at least one command word, each command word having a plurality of syllables; theprocessing unit 702 is specifically configured to:

determining the probability that the K voice frames respectively correspond to each syllable output unit in a syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames; the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different;

determining a first confidence coefficient corresponding to the voice data of the target time window and each command word according to the probability that the K voice frames correspond to each syllable output unit respectively;

if command words with the first confidence degree larger than or equal to a first threshold value exist in the command word set, determining the command words with the first confidence degree larger than or equal to the first threshold value as the first command words hit by the voice data of the target time window in the command word set.

In one implementation, any one command word in the command set is represented as a target command word; theprocessing unit 702 is specifically configured to:

determining a syllable output unit corresponding to each syllable of the target command word as a target syllable output unit to obtain a plurality of target syllable output units corresponding to the target command word;

determining the probability that the K voice frames respectively correspond to each target syllable output unit from the probability that the K voice frames respectively correspond to each syllable output unit to obtain K candidate probabilities respectively corresponding to each target syllable output unit;

and determining the maximum candidate probability corresponding to each target syllable output unit from the K candidate probabilities corresponding to each target syllable output unit, and determining the first confidence coefficient corresponding to the voice data of the target time window and the target command word according to the maximum candidate probability corresponding to each target syllable output unit.

In one implementation, the set of command words includes at least one command word; theprocessing unit 702 is specifically configured to:

determining a second confidence coefficient corresponding to the voice data of the characteristic time window and each command word according to the audio characteristics corresponding to the voice data of the voice frames in the characteristic time window;

and if command words with the second confidence degree larger than or equal to a second threshold exist in the command word set, determining the command words with the second confidence degree larger than or equal to the second threshold and the maximum second confidence degree as the second command words hit by the voice data of the characteristic time window in the command word set.

In an implementation manner, theprocessing unit 702 is specifically configured to:

determining a first number according to the command word length of the first command word and a target preset value;

determining a characteristic time window associated with the current speech frame based on the first number of speech frames preceding the current speech frame and a second number of speech frames following the current speech frame.

In one implementation, the first command word is determined by a trained primary detection network, and theprocessing unit 702 is further configured to:

acquiring first sample voice data, wherein the first sample voice data carries a syllable output unit label;

calling an initial primary detection network, and determining a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data;

and training based on the predicted syllable output unit and the syllable output unit label which respectively correspond to the voice data of each voice frame in the first sample voice data to obtain the trained primary detection network.

In one implementation, the second command word is determined by a trained secondary detection network, and theprocessing unit 702 is further configured to:

acquiring second sample voice data, wherein the second sample voice data carries a command word label;

calling a secondary detection network to determine a prediction command word corresponding to the second sample voice data;

and training based on the predicted command words and the command word labels to obtain the trained secondary detection network.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device described in this embodiment includes: aprocessor 801 and amemory 802. Optionally, the electronic device may further include a network interface or a power supply module. Theprocessor 801 and thememory 802 can exchange data with each other.

TheProcessor 801 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The network interface may include an input device, such as a control panel, a microphone, a receiver, etc., and/or an output device, such as a display screen, a transmitter, etc., to name but a few.

Thememory 802, which may include both read-only memory and random-access memory, provides program instructions and data to theprocessor 801. A portion of thememory 802 may also include non-volatile random access memory. Wherein, theprocessor 801 is configured to execute, when calling the program instruction:

In one implementation, the set of command words includes at least one command word, each command word having a plurality of syllables; theprocessor 801 is specifically configured to:

determining a first confidence coefficient corresponding to the voice data of the target time window and each command word according to the probability that the K voice frames respectively correspond to each syllable output unit;

In one implementation, any one command word in the command set is represented as a target command word; theprocessor 801 is specifically configured to:

In one implementation, the set of command words includes at least one command word; theprocessor 801 is specifically configured to:

and if the command word with the second confidence coefficient larger than or equal to a second threshold exists in the command word set, determining the command word with the second confidence coefficient larger than or equal to the second threshold and the maximum second confidence coefficient as the second command word hit by the voice data of the characteristic time window in the command word set.

In one implementation, theprocessor 801 is specifically configured to:

In one implementation, the first command word is determined by a trained primary detection network, and theprocessor 801 is further configured to:

and training a predicted syllable output unit and a syllable output unit label which respectively correspond to the voice data of each voice frame in the first sample voice data to obtain the trained primary detection network.

In one implementation, the second command word is determined by a trained secondary detection network, and theprocessor 801 is further configured to:

Optionally, the program instructions may also implement other steps of the method in the above embodiments when executed by the processor, and details are not described here.

The present application further provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the above method, such as performing the above method performed by an electronic device, which is not described herein in detail.

Optionally, the storage medium, such as a computer-readable storage medium, referred to herein may be non-volatile or volatile.

Alternatively, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Embodiments of the present application also provide a computer program product or a computer program, which includes computer instructions that, when executed by a processor, implement some or all of the steps of the above-described method. The computer instructions are stored in a computer readable storage medium, for example. The computer instructions are read by a processor of a computer device (i.e., the electronic device) from a computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps performed in the embodiments of the methods described above. For example, the computer device may be a terminal, or may be a server.

The data query method, the data query device, the electronic device, and the storage medium provided in the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the set of command words includes at least one command word, each command word having a plurality of syllables;

determining a first command word hit by the voice data of the target time window in a command word set according to the audio characteristics corresponding to the voice data of the K voice frames, respectively, including:

determining the probability that the K voice frames correspond to each syllable output unit in the syllable output unit set respectively according to the audio characteristics corresponding to the voice data of the K voice frames respectively; the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different;

3. The method of claim 2, wherein any command word in the command set is represented as a target command word;

determining a first confidence degree corresponding to the voice data of the target time window and each command word according to the probability that each syllable output unit corresponds to each of the K voice frames, including:

4. The method of claim 1, wherein the set of command words includes at least one command word; the determining, based on audio features respectively corresponding to the speech data of the plurality of speech frames in the feature time window, a second command word hit by the speech data of the feature time window in the command word set includes:

5. The method of claim 1, wherein the determining a characteristic time window associated with the current speech frame based on the command word length of the first command word comprises:

6. The method of claim 1, wherein the first command word is determined by a trained primary detection network, the method further comprising:

7. The method of claim 6, wherein the second command word is determined by a trained secondary detection network, the method further comprising:

8. A data processing apparatus, characterized in that the apparatus comprises:

9. An electronic device comprising a processor, a memory, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-7.

10. A computer program product comprising computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 7.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.