Disclosure of Invention
The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides an abnormal language detection method, electronic equipment and a storage medium, which can improve the accuracy of abnormal language detection.
The abnormal language detection method according to the embodiment of the first aspect of the present application comprises:
acquiring target audio and target text, wherein the target audio and the target text are used for reflecting the same target language content;
inputting the target audio into a pre-trained audio encoder for audio feature extraction to obtain audio feature information;
inputting the target text into a pre-trained text encoder for text feature extraction to obtain text feature information;
integrating the audio characteristic information and the text characteristic information to obtain fusion embedded information;
and inputting the fusion embedded information into a pre-trained modal fusion decoder to perform abnormal language detection to obtain an abnormal language detection result, wherein the abnormal language detection result is used for indicating that the target language content reaches a language abnormal standard or is used for indicating that the target language content does not reach the language abnormal standard.
According to some embodiments of the present application, before the audio feature extraction is performed on the target audio input pre-trained audio encoder to obtain the audio feature information, the method further includes performing joint pre-training on the audio encoder, the text encoder and the fusion mode decoder, and specifically includes:
acquiring a training data set, wherein the training data set comprises a plurality of training language samples, and each training language sample is configured with a corresponding training label; the training language sample comprises training audio and training text representing the same training language content, and the training label is used for identifying whether the training language content reaches the language anomaly benchmark;
inputting the training audio into the original audio encoder, inputting the training text into the original text encoder, and inputting the training label into the original fusion mode decoder to perform iterative training;
and when the iterative training meets a first preset condition, obtaining the pre-trained audio encoder, the pre-trained text encoder and the pre-trained fusion mode decoder.
According to some embodiments of the present application, the iterative training of the input of the training audio to the original audio encoder, the input of the training text to the original text encoder, and the input of the training tag to the original fusion modality decoder includes:
In each round of iterative training, carrying out audio feature extraction on the training audio based on the audio encoder to obtain training audio information;
extracting text features of the training text based on the text encoder to obtain training text information;
integrating the training audio information and the training text information to generate training fusion information, and inputting the training fusion information into the fusion mode decoder to detect abnormal language so as to obtain a detection result of the round;
and after the detection result of the present round is obtained, updating the audio encoder, the text encoder and the fusion mode decoder based on the detection result of the present round and the training label.
According to some embodiments of the present application, the obtaining the target audio and the target text includes:
acquiring the target audio of the target language content;
and performing voice recognition processing on the target audio to obtain a target text corresponding to the target language content.
According to some embodiments of the present application, the inputting the fusion embedded information into a pre-trained modal fusion decoder performs abnormal language detection to obtain an abnormal language detection result, including:
inputting the fusion embedded information into a pre-trained modal fusion decoder for decoding processing to obtain decoding classification parameters corresponding to the fusion embedded information;
And detecting whether the target language content reaches the language abnormality reference or not based on the decoding classification parameters to obtain the abnormal language detection result.
According to some embodiments of the present application, the abnormal language detection result includes a first detection result or a second detection result, where the first detection result is used to indicate that the target language content reaches the language abnormal reference, and the second detection result is used to indicate that the target language content does not reach the language abnormal reference;
the detecting whether the target language content reaches the language abnormal standard based on the decoding classification parameter to obtain the abnormal language detection result comprises the following steps:
determining abnormal language content in the target language content based on the decoding classification parameters;
when the abnormal language content accords with a second preset condition, determining that the target language content reaches the language abnormal standard; determining the abnormal language detection result as the first detection result;
and when the abnormal language content does not meet a second preset condition, determining that the target language content does not reach the language abnormal standard, and determining the abnormal language detection result as the second detection result.
According to some embodiments of the present application, the audio feature extraction of the target audio input pre-trained audio encoder to obtain audio feature information includes:
inputting the target audio into a multi-head attention module of a pre-trained audio encoder for encoding processing to generate an audio query matrix, an audio key matrix and an audio value matrix;
and performing scaling dot product attention operation on the audio query matrix, the audio key matrix and the audio value matrix to obtain the audio characteristic information.
According to some embodiments of the present application, the text feature extraction of the target text input pre-trained text encoder to obtain text feature information includes:
encoding the target text input multi-head attention module of the pre-trained text encoder to generate a text query matrix, a text key matrix and a text value matrix;
and performing scaling dot product attention operation on the text query matrix, the text key matrix and the text value matrix to obtain the text characteristic information.
In a second aspect, an embodiment of the present application provides an electronic device, including: the abnormal language detection method according to any one of the embodiments of the first aspect of the present application is implemented when the processor executes the computer program.
In a third aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, where the computer program is executed by a processor to implement the abnormal language detection method according to any one of the embodiments of the first aspect of the present application.
According to the abnormal language detection method, the electronic equipment and the storage medium, the abnormal language detection method and the storage medium have at least the following beneficial effects:
according to the abnormal language detection method, target audio and target text are required to be acquired firstly, and the target audio and the target text are used for reflecting the content of the same target language; inputting the target audio into a pre-trained audio encoder for audio feature extraction to obtain audio feature information, and inputting the target text into a pre-trained text encoder for text feature extraction to obtain text feature information; further, integrating the audio characteristic information and the text characteristic information to obtain fusion embedded information; still further, the fusion embedded information is input into a pre-trained modal fusion decoder to perform abnormal language detection, so as to obtain an abnormal language detection result, wherein the abnormal language detection result is used for indicating that the target language content reaches a language abnormal standard or is used for indicating that the target language content does not reach the language abnormal standard. In this way, the accuracy of abnormal language detection can be improved. Especially in the relevant fields of finance and insurance, the method provides convenience for screening bad information such as sensitive words, inelegant words and the like in business data.
Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
In the description of the present application, the meaning of a number is one or more, the meaning of a number is two or more, greater than, less than, exceeding, etc. are understood to not include the present number, and the meaning of a number above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present application, it should be understood that the direction or positional relationship indicated with respect to the description of the orientation, such as up, down, left, right, front, rear, etc., is based on the direction or positional relationship shown in the drawings, is merely for convenience of describing the present application and simplifying the description, and does not indicate or imply that the apparatus or element to be referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application.
In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific content of the technical solution. In addition, the following description of specific steps does not represent limitations on the order of steps or logic performed, and the order of steps and logic performed between steps should be understood and appreciated with reference to what is described in the embodiments.
Abnormal language detection refers to a technology for identifying and filtering bad information such as sensitive words and inelegant words in text or voice. The importance of abnormal language detection is to maintain the health and order of the network environment and to protect the user's interests. In the related fields of finance and insurance, business materials in text form can be sourced from various crowds, and a few people cannot know how to use the standardized text expression, so that the provided business materials often contain bad information such as sensitive words, inauguration words and the like, and therefore, the requirements of identifying and filtering abnormal languages in the finance field and the insurance field are high.
In the related art, the method of abnormal language detection mainly includes a rule-based method and a machine learning-based method. Firstly, the abnormal language detection method based on rules matches abnormal language in text by establishing a sensitive word stock or regular expression, which is simple and easy to realize, but has the defects that the sensitive word stock needs to be manually maintained, the method can not adapt to the newly-appearing abnormal language, and can not process deformation modes such as semantically implicit or replacement and the like, so that the accuracy of abnormal language detection is lower; secondly, the abnormal language detection method based on machine learning automatically learns abnormal language characteristics in the text through a training model and classifies or marks the newly input text. Therefore, how to improve the accuracy of abnormal language detection is still a great problem to be solved in the industry.
The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides an abnormal language detection method, electronic equipment and a storage medium, which can improve the accuracy of abnormal language detection.
The following is a further description based on the accompanying drawings.
Referring to fig. 1, the abnormal language detection method according to the embodiment of the first aspect of the present application may include, but is not limited to, steps S101 to S105 described below.
Step S101, acquiring target audio and target text, wherein the target audio and the target text are used for reflecting the same target language content;
step S102, inputting target audio into a pre-trained audio encoder for audio feature extraction to obtain audio feature information;
step S103, inputting the target text into a pre-trained text encoder for text feature extraction to obtain text feature information;
step S104, integrating the audio characteristic information and the text characteristic information to obtain fusion embedded information;
step S105, inputting the fusion embedded information into a pre-trained modal fusion decoder to perform abnormal language detection, so as to obtain an abnormal language detection result, wherein the abnormal language detection result is used for indicating that the target language content reaches a language abnormal standard or is used for indicating that the target language content does not reach the language abnormal standard.
Through the abnormal language detection method shown in steps S101 to S105, the target audio and the target text need to be acquired first, and the target audio and the target text are used for reflecting the content of the same target language; inputting the target audio into a pre-trained audio encoder for audio feature extraction to obtain audio feature information, and inputting the target text into a pre-trained text encoder for text feature extraction to obtain text feature information; further, integrating the audio characteristic information and the text characteristic information to obtain fusion embedded information; still further, the fusion embedded information is input into a pre-trained modal fusion decoder to perform abnormal language detection, so as to obtain an abnormal language detection result, wherein the abnormal language detection result is used for indicating that the target language content reaches a language abnormal standard or is used for indicating that the target language content does not reach the language abnormal standard. In this way, the accuracy of abnormal language detection can be improved. Especially in the relevant fields of finance and insurance, the method provides convenience for screening bad information such as sensitive words, inelegant words and the like in business data.
In step S101 of some embodiments, target audio and target text are acquired, where the target audio and target text are used to reflect the same target language content. It should be noted that the target language content refers to language content that is required to be a target of abnormal language detection, and the target audio is an audio carrier of the target language content and the target text is a text carrier of the target language content. It should be noted that the voice features in the target audio and the text features in the target text can be mutually complemented, so that the feature deficiency of the single language content carrier is made up, and the accuracy and the robustness of abnormal language detection are improved.
Referring to fig. 2, step S101 may include, but is not limited to, steps S201 to S202 described below, according to some embodiments of the present application.
Step S201, obtaining target audio of target language content;
step S202, performing voice recognition processing on the target audio to obtain target text corresponding to the target language content.
Through the embodiments shown in steps S201 to S202, the target audio of the target language content may be acquired first, and then the target audio is subjected to the speech recognition processing, so as to obtain the target text corresponding to the target language content. In some embodiments, the target audio may be captured and then converted to corresponding target text by a speech recognition process (Automatic Speech Recognition, ASR). It should be noted that the speech recognition process aims at converting the lexical content in human speech into computer readable inputs, such as keys, binary codes or character sequences. Unlike speaker recognition and speaker verification, the latter attempts to identify or verify the speaker making the speech, not the lexical content contained therein.
In other embodiments, the target text may be acquired first, and then the target text is subjected to speech generation processing to generate the target audio corresponding to the target text. It should be noted that the speech generating process in the embodiment of the present application refers to a process of generating speech based on text, wherein the implementation manner of the speech generating process is various, for example, text is mapped into speech based on a speech feature library by using a statistical pronunciation model; for example, the text is divided into units, and then the voice is generated according to rules, so that the units are assembled into complete sentences; for another example, the task of converting the text into the voice is realized by adopting a neural network technology, and the method can directly output a rear-end feature sequence according to a voice feature library and a sequence of characters in the input text; also for example, text-to-speech based on deep learning, using a massive text-to-speech corpus, extracts text features and maps them to speech using a deep learning algorithm.
It should be appreciated that the implementation of obtaining the target text is wide variety and may include, but is not limited to, the specific embodiments set forth above.
It should be noted that the embodiments for obtaining the target audio and the target text are various, and may include, but are not limited to, the specific examples mentioned above.
Referring to fig. 3, the abnormal language detection method further includes joint pre-training of the audio encoder, the text encoder and the fusion modality decoder before step S102 according to some embodiments of the present application, which may specifically include, but is not limited to, steps S301 to S303 described below.
Step S301, a training data set is obtained, wherein the training data set comprises a plurality of training language samples, and each training language sample is configured with a corresponding training label; the training language sample comprises training audio and training text which represent the same training language content, and the training label is used for identifying whether the training language content reaches a language anomaly benchmark;
step S302, inputting training audio into an original audio encoder, inputting training text into an original text encoder, and inputting training labels into an original fusion mode decoder for iterative training;
step S303, when the iterative training meets the first preset condition, a pre-trained audio encoder, a pre-trained text encoder and a pre-trained fusion mode decoder are obtained.
In step S301 of some embodiments, a training data set is obtained, where the training data set includes a plurality of training language samples, each training language sample configured with a corresponding training tag; the training language sample comprises training audio and training text which represent the same training language content, and the training label is used for identifying whether the training language content reaches a language anomaly benchmark. It should be noted that, since the embodiments of the present application are directed to pre-training an audio encoder, a text encoder and a fusion mode decoder, the purpose of the embodiments is to improve the audio feature extraction capability of the audio encoder, the text feature extraction capability of the text encoder and the anomaly detection capability of the fusion mode decoder, so that training data sets are required to train the audio feature extraction capability of the audio encoder, the text feature extraction capability of the text encoder and the anomaly detection capability of the fusion mode decoder. It is emphasized that the training language sample includes training audio and training text representing the same training language content, and the training label is used for identifying whether the training language content reaches a language anomaly standard, where the language anomaly standard is a standard defining that an abnormal language in the target language content exceeds a standard, and when the abnormal language detection result indicates that the target language content reaches the language anomaly standard, the abnormal language of the target language content exceeds the standard.
In steps S302 to S303 of some embodiments, the training audio is input to the original audio encoder, the training text is input to the original text encoder, and the training label is input to the original fusion modality decoder for iterative training. And when the iterative training meets a first preset condition, obtaining a pre-trained audio encoder, a pre-trained text encoder and a pre-trained fusion mode decoder. It should be noted that after the training data set is obtained, the audio encoder, the text encoder and the fusion modality decoder may be iteratively trained based on the training language samples and the training labels. The iterative training aims at improving the audio feature extraction capability of an audio encoder, the text feature extraction capability of a text encoder and the abnormality detection capability of a fusion mode decoder. According to some exemplary embodiments of the present application, model parameters of an audio encoder, a text encoder and a fusion modality decoder are continuously updated in the process of performing iterative training on the audio encoder, the text encoder and the fusion modality decoder, so that the audio encoder has more and more excellent audio feature extraction capability in the iterative process, the text encoder has more and more excellent text feature extraction capability in the iterative process, and the fusion modality decoder has more and more excellent abnormality detection capability in the iterative process. When the audio encoder, the text encoder and the fusion mode decoder meet a first preset condition in the iterative training, the pre-trained audio encoder, the pre-trained text encoder and the pre-trained fusion mode decoder can be obtained. It should be noted that the audio encoder, the text encoder and the fusion modality decoder meet the first predetermined condition in the iterative training, which means that the audio encoder, the text encoder and the fusion modality decoder under the current iteration round have reached the pre-training expectation, and the accuracy of abnormal language detection has reached the level that can be applied to the actual scene.
Through the embodiments of the present application shown in steps S301 to S303, pre-training of the audio encoder, the text encoder and the fusion mode decoder may be implemented, so that the audio feature extraction capability of the audio encoder, the text feature extraction capability of the text encoder and the anomaly detection capability of the fusion mode decoder are improved, and the accuracy of detecting anomaly information from the target audio and the target text is correspondingly improved.
Referring to fig. 4, step S302 may include, but is not limited to, steps S401 to S404 described below, according to some embodiments of the present application.
Step S401, in each round of iterative training, audio feature extraction is carried out on training audio based on an audio encoder to obtain training audio information;
step S402, text feature extraction is carried out on training texts based on a text encoder to obtain training text information;
step S403, integrating the training audio information and the training text information to generate training fusion information, and inputting the training fusion information into a fusion mode decoder for abnormal language detection to obtain a detection result of the round;
step S404, after the detection result of the present round is obtained, the audio encoder, the text encoder and the fusion mode decoder are updated based on the detection result of the present round and the training label.
According to steps S401 to S404 of some embodiments of the present application, in each iteration training, training audio information is obtained by extracting audio features of training audio based on an audio encoder, and training text information is obtained by extracting text features of training text based on a text encoder; further, training audio information and training text information are integrated to generate training fusion information, and the training fusion information is input into a fusion mode decoder to carry out abnormal language detection, so that a round of detection result is obtained; still further, after the detection result of the present round is obtained, the audio encoder, the text encoder and the fusion modality decoder are updated based on the detection result of the present round and the training label. It is emphasized that the audio encoder is used for encoding the audio file and extracting the feature vector for representing the sound semantic from the audio file, the text encoder is used for encoding the text file and extracting the feature vector for representing the text semantic from the text file, and the modal fusion decoder is used for carrying out abnormal language detection on the fusion embedded vector fused with the audio and text two-modal features so as to determine an abnormal language detection result from the abnormal language detection result. Therefore, in order to enable the audio encoder, the text encoder and the fusion mode decoder to cooperate in practical application, an abnormal language detection effect is achieved, in the pre-training process of the audio encoder, the text encoder and the fusion mode decoder, training audio information obtained by extracting features and training text information obtained by extracting features are integrated in each iteration round to generate training fusion information, the training fusion information is input into the fusion mode decoder to perform abnormal language detection, a detection result of the round is obtained, and then the audio encoder, the text encoder and the fusion mode decoder are updated based on the detection result of the round and training labels. It should be noted that, the audio encoder, the text encoder and the fusion mode decoder may be updated based on the current round of detection results and the training labels, specifically, the loss function value of the current iteration round may be calculated by using the current round of detection results and the training labels, and then the audio encoder, the text encoder and the fusion mode decoder may be updated based on the loss function.
In the process of model training, there is a difference between the predicted value (i.e., the detection result of the present round) and the actual value (i.e., the training label), and "loss" means the penalty of the model due to failure to generate the expected result. And the loss function is used for determining the performance of the model by comparing the predicted output and the expected output of the model and searching the optimization direction. If the deviation between the predicted output and the expected output is very large, the loss function output value will be large; if the deviation is small or the values are nearly the same, the loss function output value will be very low. Thus, during iterative training, the model parameters need to be continuously adjusted as the model is trained on the data set by means of a suitable loss function, until the model parameters are adjusted to appropriate values, which may be values that minimize or converge the loss function.
Through the embodiments of the present application shown in steps S401 to S404, pre-training of the audio encoder, the text encoder and the fusion mode decoder in each iteration round can be achieved, so that the audio feature extraction capability of the audio encoder, the text feature extraction capability of the text encoder and the anomaly detection capability of the fusion mode decoder are gradually improved, and the accuracy of detecting anomaly information from the target audio and the target text is also correspondingly and gradually improved.
In step S102 of some embodiments, the target audio is input to a pre-trained audio encoder for audio feature extraction, so as to obtain audio feature information. It should be noted that the audio encoder is used to encode an audio file and extract therefrom feature vectors for characterizing sound semantics. In some embodiments, the audio data set and the corresponding audio training tag may be used to pretrain some audio processing models, and after the models converge, the encoder in the audio processing model is taken out for extracting audio features of the target audio to obtain audio feature information.
In some more specific embodiments, the hertz spectral features of the training audio in the audio dataset may be extracted by a mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC), then the Bert pre-training model is pre-trained with audio processing capability by a MLM (Masked Language Modeling) method, and after the Bert pre-training model converges, an Encoder (Encoder) therein is taken out as an audio Encoder in the embodiments of the present application, and is used for extracting audio features of the target audio to obtain audio feature information. The Mel frequency is proposed based on the auditory characteristics of human ears, and has a nonlinear corresponding relation with the Hz frequency; mel Frequency Cepstral Coefficients (MFCCs) are then used to calculate the Hz spectral signature using this relationship between them.
Referring to fig. 5, step S102 may include, but is not limited to, steps S501 to S502 described below, according to some embodiments of the present application.
Step S501, inputting target audio into a multi-head attention module of a pre-trained audio encoder for encoding processing to generate an audio query matrix, an audio key matrix and an audio value matrix;
step S502, performing scaling dot product attention operation on the audio query matrix, the audio key matrix and the audio value matrix to obtain audio characteristic information.
In some more specific embodiments, the hertz spectral features of the training audio in the audio dataset may be extracted by a mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC), then the Bert pre-training model is pre-trained with audio processing capability by a MLM (Masked Language Modeling) method, and after the Bert pre-training model converges, an Encoder (Encoder) therein is taken out as an audio Encoder in the embodiments of the present application, and is used for extracting audio features of the target audio to obtain audio feature information. In using the Bert pre-training model for audio feature extraction of target audio, a multi-head attention mechanism of the Bert model may be utilized.
It should be noted that the Bert model is a machine learning model of a common natural language process, and a bidirectional transducer encoder is used to learn deep representation of text, and there are three different self-attention mechanisms in the transducer, namely, self-attention in the encoder, self-attention in the decoder, and attention between encoder-decoder. Wherein self-attention in the encoder allows the model to capture the relationship of each position in the input sequence to other positions, thereby generating a better vector representation; self-attention in the decoder allows the model to capture the relationship of each position in the output sequence to the previous position, thus generating a better prediction. To avoid seeing future information, the decoder uses a masking mechanism; attention between encoder-decoder allows the model to use the output of the encoder to generate better decoder inputs.
A transducer-based bi-directional encoder (Bidirectional Encoder Representations from Transformer, bert) uses extensive unlabeled predictive training to obtain descriptions containing semantic information within the text or audio spectrum. Where the input is the original vector of words in the text or audio spectrum. This vector may be either randomly initialized or may be initially trained using word2 vec. The vector of the input model includes three parts: a word vector (Token) a text or audio spectral vector (Segment Embedding), a position vector (Position Embedding), and finally adding the three vectors to the Bert model. It should be noted that a word vector is a common word embedded Embedding, text or audio spectrum vectors are context separated by using the information of Embedding, and location vectors are learned.
It should be noted that the Bert model may perform a wide variety of model tasks, with different inputs and outputs for the Bert model for different tasks. For example:
single text or audio spectrum classification: such as article emotion analysis. The Bert model adds a [ CLS ] symbol in front of the input text or audio frequency spectrum, and takes the output vector corresponding to the symbol as the semantic representation of the final whole article. Compared with other text or audio frequency spectrum words, the [ CLS ] vector can more fairly fuse semantic information sentences of each word to classify the text or audio frequency spectrum: such as a question and answer system. At this time, the input vector is added with [ CLS ] and [ SEP ] for dividing sentences;
and (3) sequence labeling: the output vector corresponding to each word of the task is a label for the word and can be understood as classification;
two pre-training modes of the Bert pre-training model: first, masked LM (Masked Language Modeling) refers to a given sentence, then, erasing several words from it, requiring that the missing word be determined based on context. The image is understood to be "complete filling". The original author randomly selects 15% of words in a sentence for prediction. For the erased vocabulary, step S80% is replaced by [ MASK ], step S10% is replaced by other words, and step S10% remains unchanged. The reason for this is that the model does not know if the vocabulary of the corresponding location is the correct answer at the actual prediction, which forces the model to rely more on the context information for prediction; second, next Sentence Prediction, which is to say, given two sentences, judges whether the second sentence is next to the first sentence.
In Attention mechanism (Attention), query Vector (Query Vector), key Vector (Key Vector), and Value Vector (Value Vector) are three basic Vector representations. Which are used to describe the input sequence, calculate the similarity and output weighting information, respectively. Specifically:
query vector: generally refers to the representation of the input sequence that the model is currently looking at or processing, typically the last hidden state obtained by the encoder or the current decoder state. The query vector represents information that the model is currently focusing on and can be considered as a problem or what needs to be judged.
Key vector: for computing the similarity between the query vector and each element in the remaining all input sequences, analogous to the "keys" in the "key-value pairs" in the hash table. Also generated by the encoder, but possibly different from the query vector. Can be viewed as information provided to a model for retrieval or lookup.
Value vector: corresponds to each key vector and contains the importance weights in the model of the input sequence elements represented by the key. Also generated by the encoder, but possibly different from the query/key. It may be seen as giving an answer or decision for a particular content.
When attention coefficients are calculated by scaling Dot product attention operations (Scaled Dot-Product Attention), the coefficients are weighted according to the similarity between the key vector and the query vector, so as to obtain a weighted value vector as an output result.
In summary, the query vector, key vector, value vector are three important elements in the attention mechanism that provide input sequence, query information, and output weighting information, respectively, and get the result of the attention or lookup by computing the similarity.
In steps S501 to S502 of some embodiments, the multi-head attention module of the audio encoder pre-trained by the target audio input is encoded to generate an audio query matrix, an audio key matrix and an audio value matrix, and then the audio query matrix, the audio key matrix and the audio value matrix are subjected to scaling dot product attention operation to obtain audio feature information. It should be noted that, the input of the multi-head attention mechanism in the audio encoder is the target audio, after the target audio is input into the pre-trained audio encoder, the multi-head attention module of the audio encoder encodes the target audio to obtain an audio query matrix, an audio key matrix and an audio value matrix corresponding to the target audio, and then performs scaling dot product attention operation on the audio query matrix, the audio key matrix and the audio value matrix to obtain the audio feature information.
In some more specific embodiments, the audio query matrix QS Audio key matrix KS And an audio value matrix VS Performing a scaled dot product attention operation can be expressed as:
wherein dk Is an audio query matrix QS And audio key matrix KS Is a normalization function.
Through the embodiments of the present application shown in steps S501 to S502, the Bert pre-training model may be used in the process of extracting the audio features of the target audio, and the multi-head attention mechanism of the Bert model may be used to extract the audio feature information, so that the quality of the audio feature information may be improved, and the improvement of the accuracy of detecting the abnormal information is facilitated.
In step S103 of some embodiments, the target text is input into a pre-trained text encoder to perform text feature extraction, so as to obtain text feature information. It should be noted that, the text encoder is used for encoding the text file and extracting the feature vector for representing the text semantics from the text file. In some embodiments, a text data set and corresponding text training labels may be used to pretrain some text processing models, and after the models converge, an encoder in the text processing model is taken out for extracting text features of the target text, so as to obtain text feature information.
In some more specific embodiments, the pre-training of the text processing capability of the Bert pre-training model can be performed by using a modulated LM (Masked Language Modeling) method, and after the Bert pre-training model converges, an Encoder (Encoder) in the Bert pre-training model is taken out to be used as a text Encoder in the embodiments of the present application, so as to extract text features of the target text, and obtain text feature information.
Referring to fig. 6, step S103 may include, but is not limited to, steps S601 to S602 described below, according to some embodiments of the present application.
Step S601, inputting a target text into a multi-head attention module of a pre-trained text encoder for encoding processing to generate a text query matrix, a text key matrix and a text value matrix;
step S602, performing scaling dot product attention operation on the text query matrix, the text key matrix and the text value matrix to obtain text characteristic information.
In steps S601 to S602 of some embodiments, the multi-head attention module of the text encoder pre-trained for target text input is encoded to generate a text query matrix, a text key matrix and a text value matrix, and then the text query matrix, the text key matrix and the text value matrix are subjected to scaling dot product attention operation to obtain text feature information. It should be noted that, the input of the multi-head attention mechanism in the text encoder is a target text, after the target text is input into the pre-trained text encoder, the multi-head attention module of the text encoder encodes the target text, so as to obtain a text query matrix, a text key matrix and a text value matrix corresponding to the target text, and then the text query matrix, the text key matrix and the text value matrix are subjected to scaling dot product attention operation, so that the text feature information can be obtained.
In some more specific embodiments, the text is queried in matrix Qt Text key matrix Kt And text value matrix Vt Performing a scaled dot product attention operation can be expressed as:
wherein dk Is a text query matrix QS And text key matrix KS Is a normalization function.
Through the embodiments of the present application shown in steps S601 to S602, the Bert pre-training model may be used in a process of extracting text features from a target text, and the multi-head attention mechanism of the Bert model may be used to extract text feature information, so that the quality of the text feature information may be improved, and the improvement of the accuracy of detecting abnormal information is facilitated.
It should be appreciated that the implementation of extracting audio features from the target audio input pre-trained audio encoder to obtain audio feature information and extracting text features from the target text input pre-trained text encoder to obtain text feature information is various and may include, but is not limited to, the specific examples set forth above.
In step S104 of some embodiments, the audio feature information and the text feature information are integrated to obtain fusion embedded information. It should be noted that, in order to make the speech feature in the target audio and the text feature in the target text complement each other, make up for the feature deficiency of the single language content carrier, thereby improving the accuracy and robustness of abnormal language detection, the audio feature information and the text feature information need to be integrated to obtain the fusion embedded information, and then the fusion embedded information is subjected to abnormal language detection through the subsequent steps, so that the method has better detection effect and improves the accuracy of abnormal language detection.
In step S105 of some embodiments, the fusion embedded information is input to a pre-trained modal fusion decoder to perform abnormal language detection, so as to obtain an abnormal language detection result, where the abnormal language detection result is used to indicate that the target language content reaches the language abnormal standard, or is used to indicate that the target language content does not reach the language abnormal standard. It should be noted that, the modal fusion decoder is used for detecting abnormal language of the fusion embedded vector fused with the audio and text features, so as to determine the abnormal language detection result. It is required to be clear that the abnormal language detection result is used for indicating that the target language content reaches the language abnormal reference or is used for indicating that the target language content does not reach the language abnormal reference. The abnormal language standard is a standard for defining the abnormal language proportion exceeding standard in the target language content, and when the abnormal language detection result indicates that the target language content reaches the abnormal language standard, the abnormal language proportion of the target language content exceeds standard so as to screen or filter the target audio or the target text containing the target language content.
Referring to fig. 7, step S105 may include, but is not limited to, steps S701 to S702 described below, according to some embodiments of the present application.
Step S701, inputting the fusion embedded information into a pre-trained modal fusion decoder for decoding processing to obtain decoding classification parameters corresponding to the fusion embedded information;
step S702, detecting whether the target language content reaches the language abnormal standard or not based on the decoding classification parameters to obtain an abnormal language detection result.
In step S701 to step S702 of some embodiments, the fusion embedded information is input into a pre-trained modal fusion decoder to perform decoding processing, so as to obtain decoding classification parameters corresponding to the fusion embedded information; and detecting whether the target language content reaches a language abnormal standard or not based on the decoding classification parameters to obtain an abnormal language detection result. It should be noted that the decoding classification parameter is used to reflect the duty ratio of the abnormal language in the target language content. The Bert model may add a CLS symbol (i.e., CLS Token) before the input text or audio spectrum, and use the output vector corresponding to the symbol as a semantic representation of the entire target speech content. Compared with other text or audio frequency spectrum words, the [ CLS ] vector can more fairly fuse semantic information sentences of each word to classify the text or audio frequency spectrum. Therefore, after the fusion embedded information is input to the pre-trained modal fusion decoder for decoding processing, the [ CLS ] vector in the pre-trained modal fusion decoder can be determined as a decoding classification parameter corresponding to the fusion embedded information. In this way, the reliable fusion embedded information can be used for inputting the fusion embedded information into the pre-trained modal fusion decoder for abnormal language detection, and an abnormal language detection result is obtained.
Referring to fig. 8, according to some embodiments of the present application, the abnormal language detection result includes a first detection result for indicating that the target language content reaches the language abnormal reference or a second detection result for indicating that the target language content does not reach the language abnormal reference. It should be noted that step S702 may include, but is not limited to, steps S801 to S803 described below.
Step S801, determining abnormal language content in target language content based on decoding classification parameters;
step S802, when the abnormal language content accords with a second preset condition, determining that the target language content reaches a language abnormal standard; determining an abnormal language detection result as a first detection result;
step S803, when the abnormal language content does not meet the second predetermined condition, determining that the target language content does not reach the language abnormal reference, and determining the abnormal language detection result as the second detection result.
In steps S801 to S803 of some embodiments, it is necessary to determine an abnormal language content in the target language content based on the decoding classification parameter; when the abnormal language content accords with a second preset condition, determining that the target language content reaches a language abnormal standard; determining an abnormal language detection result as a first detection result; and when the abnormal language content does not meet the second preset condition, determining that the target language content does not reach the language abnormal standard, and determining the abnormal language detection result as a second detection result. It should be emphasized that, since the decoding classification parameter is used to reflect the duty ratio of the abnormal language in the target language content, the content of the abnormal language in the target language content can be determined based on the decoding classification parameter. When the abnormal language content accords with the second preset condition, the abnormal language content exceeds the standard, and then the target language content can be determined to reach the language abnormal standard; when the abnormal language content does not meet the second preset condition, which means that the abnormal language content does not exceed the standard, the target language content can be determined to not reach the language abnormal standard. Therefore, whether the target language content represented by the target audio or the target text has the abnormal language exceeding the standard can be detected according to the abnormal language detection result, and the accuracy of abnormal language detection is improved.
Fig. 9 shows an electronic device 900 provided in an embodiment of the present application. The electronic device 900 includes: a processor 901, a memory 902, and a computer program stored in the memory 902 and executable on the processor 901, the computer program being operative to perform the abnormal language detection method described above.
The processor 901 and the memory 902 may be connected by a bus or other means.
The memory 902, as a non-transitory computer readable storage medium, may be used to store a non-transitory software program and a non-transitory computer executable program, such as the abnormal language detection method described in the embodiments of the present application. The processor 901 implements the abnormal language detection method described above by running a non-transitory software program and instructions stored in the memory 902.
The memory 902 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area. The storage data area may store and execute the abnormal language detection method described above. In addition, the memory 902 may include high-speed random access memory 902 and may also include non-transitory memory 902, such as at least one storage device memory device, flash memory device, or other non-transitory solid state memory device. In some implementations, the memory 902 optionally includes memory 902 located remotely from the processor 901, the remote memory 902 being connectable to the electronic device 900 through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The non-transitory software programs and instructions required to implement the above-described abnormal language detection method are stored in the memory 902, and when executed by the one or more processors 901, the above-described abnormal language detection method is performed, for example, the method steps S101 to S105 in fig. 1, the method steps S201 to S202 in fig. 2, the method steps S301 to S303 in fig. 3, the method steps S401 to S404 in fig. 4, the method steps S501 to S502 in fig. 5, the method steps S601 to S602 in fig. 6, the method steps S701 to S702 in fig. 7, and the method steps S801 to S803 in fig. 8.
The embodiment of the application also provides a computer readable storage medium, which stores computer executable instructions for executing the abnormal language detection method.
In an embodiment, the computer-readable storage medium stores computer-executable instructions that are executed by one or more control processors, for example, to perform method steps S101 through S105 in fig. 1, method steps S201 through S202 in fig. 2, method steps S301 through S303 in fig. 3, method steps S401 through S404 in fig. 4, method steps S501 through S502 in fig. 5, method steps S601 through S602 in fig. 6, method steps S701 through S702 in fig. 7, and method steps S801 through S803 in fig. 8.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, storage device storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media. It should also be appreciated that the various embodiments provided in the embodiments of the present application may be arbitrarily combined to achieve different technical effects.
While the preferred embodiments of the present application have been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit and scope of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.