Disclosure of Invention
The invention solves the technical problem of how to efficiently and accurately identify illegal voices.
To solve the foregoing technical problem, an embodiment of the present invention provides a speech recognition method, including: extracting emotion feature vectors from a set of voice data and converting the set of voice data into text data; training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and judging whether the voice data to be detected has violation risk or not based on the emotion score.
Optionally, the determining whether the voice data to be detected has the violation risk based on the emotion score includes: and when the emotion score is higher than a preset threshold value, determining that the voice data to be detected has violation risk.
Optionally, the speech recognition method further includes: and marking the voice data to be detected with violation risk.
optionally, the training to obtain the emotion detection model based on the emotion feature vector and the text data includes: and training by adopting a neural network algorithm based on the emotion feature vector and the text data to obtain the emotion detection model.
Optionally, the training to obtain the emotion detection model based on the feature vector and the text data includes: and training by adopting a logistic regression algorithm based on the emotion feature vector and the text data to obtain the emotion detection model.
Optionally, the emotion feature vector is used to represent an emotion type, and the emotion type is selected from: happiness, sadness, anger, fear, disgust.
optionally, the converting the set of voice data into text data includes: and converting the voice data into the text data by adopting a voice-to-text technology.
Optionally, the voice data includes voice data of a first character and voice data of a second character, and the extracting an emotion feature vector from a set of voice data and converting the set of voice data into text data includes: distinguishing voice data of a first role from voice data of a second role in the group of voice data to obtain the voice data of the first role and the voice data of the second role; and extracting emotion characteristic vectors of the voice data of the first role and the voice data of the second role respectively, and converting the voice data of the first role and the voice data of the second role into text data respectively.
In order to solve the above technical problem, an embodiment of the present invention further provides a speech recognition apparatus, including: the extraction module is used for extracting emotion characteristic vectors from a group of voice data and converting the group of voice data into text data; the training module is used for training to obtain an emotion detection model based on the emotion feature vector and the text data, and the emotion detection model is used for calculating an emotion score; the calculation module is used for calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and the judging module is used for judging whether the voice data to be detected has violation risks or not based on the emotion scores.
To solve the above technical problem, an embodiment of the present invention further provides a storage medium having stored thereon computer instructions, where the computer instructions execute the steps of the above method when executed.
in order to solve the above technical problem, an embodiment of the present invention further provides a computing device, including a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the computer instructions to perform the steps of the above method.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
The embodiment of the invention provides a voice recognition method, which comprises the following steps: extracting emotion feature vectors from a set of voice data and converting the set of voice data into text data; training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and judging whether the voice data to be detected has violation risk or not based on the emotion score. According to the embodiment of the invention, the emotion feature vector extracted from the voice data and the text data are used as input data, and the emotion detection model is obtained through training. Because a large amount of voice data can be used as input data of the training model, the statistical advantages can be exerted, and the emotion detection model with high accuracy can be obtained through the training method. The voice data to be detected is determined based on the emotion detection model with high accuracy, so that the detection of the voice data can be completed more efficiently and accurately, and the detection rate of illegal voices is improved. Furthermore, the embodiment of the invention is suitable for mass voice detection and can expand voice detection scenes.
Further, the training of the emotion detection model based on the emotion feature vector and the text data includes: and training by adopting a neural network algorithm based on the emotion feature vector and the text data to obtain the emotion detection model. According to the embodiment of the invention, the neural network model is adopted as the emotion detection model, and the emotion detection model with higher accuracy can be trained by virtue of the advantages of the neural network, so that the detection rate of illegal voices is further improved.
Detailed Description
As for the background art, the prior art uses a manual sampling inspection method to search for illegal voices, which is inefficient.
the inventor of the present application has found that, in the prior art, the following steps may also be adopted to determine whether the voice data is an illegal voice: firstly, converting voice data to be detected into text data, and extracting emotion characteristic vectors of the voice data to be detected; secondly, determining voice characteristics according to the emotion characteristic vector, and searching whether text data obtained through conversion contains preset keywords or not; and then, comprehensively determining whether the audio data is illegal voice data or not by combining the voice features and preset keywords.
However, when the prior art scheme is adopted to analyze each telephone recording in a large number of telephone recording files, the common information of the illegal voice data statistics cannot be acquired, and the accuracy is low.
The embodiment of the invention provides a voice recognition method, which comprises the following steps: extracting emotion feature vectors from a set of voice data and converting the set of voice data into text data; training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and judging whether the voice data to be detected has violation risk or not based on the emotion score.
According to the embodiment of the invention, the emotion feature vector extracted from the voice data and the text data are used as input data, and the emotion detection model is obtained through training. Because a large amount of voice data can be used as input data of the training model, the statistical advantages can be exerted, and the emotion detection model with high accuracy can be obtained through the training method. The voice data to be detected is determined based on the emotion detection model with high accuracy, so that the detection of the voice data can be completed more efficiently and accurately, and the detection rate of illegal voices is improved. Furthermore, the embodiment of the invention is suitable for mass voice detection and can expand voice detection scenes.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention. The speech recognition method may be performed by a computing device, such as a server, a personal terminal, or the like.
Specifically, the speech recognition method may include the steps of:
Step S101, extracting emotion characteristic vectors from a group of voice data, and converting the group of voice data into text data;
Step S102, training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score;
Step S103, calculating emotion scores of the voice data to be detected based on the voice data to be detected and the emotion detection model;
And step S104, judging whether the voice data to be detected has violation risk or not based on the emotion score.
More specifically, each recording file of the call center can be used as one voice data, so that a huge amount of voice data can be obtained.
in step S101, at least a portion of the mass voice data may be treated as a set of voice data. And extracting the emotion feature vector of each voice data from the group of voice data, and further obtaining a plurality of emotion feature vectors.
wherein, the emotion feature vector can be used to represent or describe the emotion type, and the emotion type can be happy (happy), sad (sadness), angry (anger), fear (fear), disgust (distust).
those skilled in the art understand that each voice data may generally contain voice output by a plurality of characters. For example, a voice recording recorded at a call center will typically include voices output by two roles, e.g., by customer service personnel and customer personnel, respectively.
Taking the example that the voice data includes voice data of two characters, the voice data may include voice data of a first character and voice data of a second character. At this time, the voice data of the first character and the voice data of the second character may be first distinguished to obtain the voice data of the first character and the voice data of the second character.
In a specific implementation, the voice data of the customer service personnel and the voice data of the customer service personnel can be distinguished in advance in the voice data recorded for evaluation. For example, the customer service person outputs voice data through a first frequency, and the customer service person outputs data through a second frequency, which is different from the first frequency. Also for example, the distinction may be made by keywords or common languages of different characters.
Thereafter, emotion feature vectors of the voice data of the first character and the voice data of the second character may be extracted, respectively.
Further, each voice data may be converted into text data. In one embodiment, an Automatic Speech Recognition (ASR) technique may be used to convert each piece of Speech data into text data, so as to obtain a plurality of pieces of text data.
Take the example that the voice data includes voice data of a first character and voice data of a second character. In a specific implementation, the voice data of the first character and the voice data of the second character can be distinguished in advance. And then, respectively converting the voice data of the first role and the voice data of the second role into text data.
In step S102, the emotion feature vectors and the text data may be trained, so as to obtain an emotion detection model. And taking the voice data to be detected as the input of the emotion detection model, and outputting the emotion score of the voice data to be detected.
In one embodiment, the emotion detection model may be obtained by training using a neural network algorithm based on the emotion feature vector and the text data. Preferably, the neural network algorithm may adopt a long short-Term Memory network (LSTM) algorithm. LSTM is a time-cycled neural network.
In another embodiment, the feature vector and the text data may be used as input data of a Logistic Regression (LR) algorithm, and the emotion detection model is obtained through training.
In a specific implementation, the speech data and the text data of each character can be input to the emotion detection model together to train the emotion detection model. For example, voice data and text data of each character are marked so as to distinguish voice data and text data of different characters.
in step S103, an emotion score of the voice data to be detected is calculated based on the voice data to be detected and the emotion detection model. The voice data to be detected can be voice data in a preset time period or a recorded voice file. And inputting the voice data to be detected into the emotion detection model, and calculating the emotion score of the voice data to be detected through the emotion detection model.
In one embodiment, the voice data to be detected is a voice file, and the voice file includes voice data of a first role and voice data of a second role. It is assumed that the voice data of the first role is the voice data of the customer service personnel, and the voice data of the second role is the voice data of the customer service personnel. After the voice file is distinguished and marked, the voice file may be input to the emotion detection model, which outputs an emotion score that is an emotion score for a first character (e.g., a customer service person).
It should be noted that the speech data of the second character facilitates the emotion detection model to calculate the emotion score of the first character.
In step S104, it may be determined whether the voice data to be detected has a violation risk based on the emotion score. Taking the call center attendant as an example, the violation can refer to the attendant's occurrence of aggressive, abusive, or the like language during the course of a conversation with the attendant.
in specific implementation, a preset threshold may be set by the emotion detection model, and whether the voice data to be detected has a violation risk is determined by using the preset threshold.
if the emotion score is not higher than the preset threshold, it may be determined that the voice data to be detected is not at risk of violation.
If the emotion score is higher than the preset threshold, it may be determined that the voice data to be detected has a violation risk. Further, the voice data to be detected with violation risk can be labeled.
In practical applications, the voice data with the tag may be further confirmed manually to review the voice data.
fig. 2 is a flowchart illustrating a speech recognition method in a typical scenario according to an embodiment of the present invention. As shown in fig. 2, in a typical scenario, a recording file recorded by a call center may be used as voice data, and after obtaining an emotion detection model, the emotion detection model is used to determine whether the recording file has a violation risk.
Specifically, first, operation S201 may be performed to acquire voice data, for example, to acquire a recording file of a call center.
Next, operation S202 may be performed to convert the voice data into text data. Specifically, ASR technology can be used to obtain the text content corresponding to each audio file, and distinguish two conversational roles, i.e., customer service staff and client staff.
Again, operation S203 may be performed to extract an emotion feature vector from the voice data. Specifically, it is possible to derive an emotion feature vector using an acoustic emotion model in the related art, determine which of five emotions, i.e., joy, hurry, anger, fear, neutrality, and neutral, the emotion of two conversational characters belongs to, and output the corresponding emotion feature vector.
Further, an operation S204 may be performed to train the emotion detection model. Specifically, the text content and the emotion feature vector can be used as the input of the emotion detection model, and the emotion detection model is obtained by training through a neural network algorithm or a logistic regression algorithm.
Thereafter, operations S205 and S206 may be performed to input the voice file to be detected to the emotion detection model and calculate an emotion score. Specifically, a voice file to be detected is input to the emotion detection model, and an emotion score is output.
further, if the emotion score output by the emotion detection model exceeds a preset threshold, the sound recording file may be tagged (not shown).
Further, the tagged audio may be provided to a human for further confirmation. The preset threshold may be determined comprehensively according to the rechecking manpower condition and the accuracy related index (not shown).
therefore, the embodiment of the invention fully utilizes the mass voice data to train, so as to obtain the training model (namely the emotion detection model) with higher accuracy, the training model is suitable for mass voice detection, the detection of the voice data can be efficiently and accurately completed, and the detection rate of the illegal voice is improved.
Fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention. The speech recognition device 3 may implement the method solutions shown in fig. 1 and 2, and be executed by a computing device.
specifically, the speech recognition apparatus 3 may include: an extraction module 31, configured to extract an emotion feature vector from a set of voice data, and convert the set of voice data into text data; the training module 32 is used for training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; the calculation module 33 is used for calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and the judging module 34 is used for judging whether the voice data to be detected has violation risk or not based on the emotion score.
In a specific implementation, the determining module 34 may include: a determining submodule 341, configured to determine that the voice data to be detected has a violation risk when the emotion score is higher than a preset threshold.
In a specific implementation, the speech recognition apparatus 3 may further include: and the marking module 35 is used for marking the voice data to be detected with violation risk.
In one embodiment, the training module 32 may include: the first training submodule 321 obtains the emotion detection model by training with a neural network algorithm based on the emotion feature vector and the text data.
in another embodiment, the training module 32 may include: and the second training submodule 322 obtains the emotion detection model by training through a logistic regression algorithm based on the feature vector and the text data.
In a specific implementation, the emotional feature vector may be used to represent a type of emotion, which may be selected from: happiness, sadness, anger, fear, disgust.
in a specific implementation, the extraction module 31 may include: the conversion sub-module 311 is configured to convert the voice data into the text data by using a voice-to-text technique.
In a specific implementation, the voice data may include voice data of a first character and voice data of a second character, and the extraction module 31 may include: a distinguishing submodule 312, configured to distinguish voice data of a first role from voice data of a second role in the group of voice data to obtain voice data of the first role and voice data of the second role; the extracting sub-module 313 is configured to extract emotion feature vectors of the voice data of the first character and the voice data of the second character, and convert the voice data of the first character and the voice data of the second character into text data.
For more details of the operation principle and the operation mode of the speech recognition apparatus 3, reference may be made to the related description in fig. 1 and fig. 2, and details are not repeated here.
Further, the embodiment of the present invention also discloses a storage medium, on which computer instructions are stored, and when the computer instructions are executed, the technical solution of the method in the embodiment shown in fig. 1 and fig. 2 is executed. Preferably, the storage medium may include a computer-readable storage medium such as a non-volatile (non-volatile) memory or a non-transitory (non-transient) memory. The storage medium may include ROM, RAM, magnetic or optical disks, etc.
further, the embodiment of the present invention also discloses a computing device, which includes a memory and a processor, where the memory stores computer instructions capable of running on the processor, and the processor executes the computer instructions to execute the technical solutions of the methods described in the embodiments shown in fig. 1 and fig. 2.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.