CN114036260A

Movatterモバイル変換

Info

Publication number: CN114036260A
Application number: CN202111319256.5A
Authority: CN
Inventors: 李聪健; 刘海东
Original assignee: Guangzhou Baiguoyuan Network Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Network Technology Co Ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-02-11
Anticipated expiration: 2041-11-09
Also published as: CN114036260B

Abstract

The embodiment of the application discloses a method, a device, equipment, a storage medium and a program product for determining sensitive words, and belongs to the field of artificial intelligence. The method comprises the following steps: training a word vector extraction model based on the corpus text and the sub-words of each word in the corpus text, wherein the corpus text is a phonogram text; performing feature extraction on candidate words corresponding to the corpus text through a word vector extraction model to obtain candidate word vectors corresponding to the candidate words; performing feature extraction on the sensitive word original word through a word vector extraction model to obtain a sensitive word vector, wherein the sensitive word original word is composed of at least one word; and determining candidate sensitive words in the candidate words based on the sensitive word vector and the candidate word vector, wherein the candidate sensitive words comprise at least one of the sensitive word original words or the sensitive word deformed words. By adopting the scheme provided by the embodiment of the application, the identification rate of the sensitive word deformation words in the phonogram scene is improved, and the shielding effect of the sensitive words is further improved.

Description

Method, device, equipment, storage medium and program product for determining sensitive words

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to a method, a device, equipment, a storage medium and a program product for determining sensitive words.

Background

In order to ensure a good internet environment, sensitive words are shielded in internet products such as websites, forums, applications and the like.

However, for phonograms, lawless persons may deform sensitive words to avoid being masked, resulting in poor effect of sensitive word masking based on the vocabulary. For example, in an english scenario, the deformation mode of the english sensitive word includes adjusting the alphabetical order or omitting partial letters, etc.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment, a storage medium and a program product for determining sensitive words. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a method for determining a sensitive word, where the method includes:

training a word vector extraction model based on a corpus text and sub-words of each word in the corpus text, wherein the corpus text is a phonogram text;

performing feature extraction on candidate words corresponding to the corpus text through the word vector extraction model to obtain candidate word vectors corresponding to the candidate words, wherein the candidate words are composed of at least one word;

performing feature extraction on the sensitive word original word through the word vector extraction model to obtain a sensitive word vector, wherein the sensitive word original word is composed of at least one word;

and determining candidate sensitive words in the candidate words based on the sensitive word and word vectors and the candidate word and word vectors, wherein the candidate sensitive words comprise at least one of the sensitive word original words or the sensitive word deformed words, and the sensitive word deformed words are obtained by deforming the sensitive word original words.

In another aspect, an embodiment of the present application provides an apparatus for determining a sensitive word, where the apparatus includes:

the training module is used for training a word vector extraction model based on a corpus text and sub-words of each word in the corpus text, wherein the corpus text is a phonogram text;

the first extraction module is used for extracting the characteristics of candidate words corresponding to the corpus text through the word vector extraction model to obtain candidate word vectors corresponding to the candidate words, wherein the candidate words are composed of at least one word;

the second extraction module is used for extracting the characteristics of the sensitive word original words through the word vector extraction model to obtain sensitive word vectors, wherein the sensitive word original words are composed of at least one word;

and the determining module is used for determining a candidate sensitive word in the candidate words based on the sensitive word vector and the candidate word vector, wherein the candidate sensitive word comprises at least one of the sensitive word original word or the sensitive word deformed word, and the sensitive word deformed word is obtained by deforming the sensitive word original word.

In another aspect, embodiments of the present application provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the method for determining a sensitive word according to the above aspect.

In another aspect, embodiments of the present application provide a computer-readable storage medium, where at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the method for determining a sensitive word as provided in various aspects of the present application.

In another aspect, the present application provides a computer program product, which includes computer instructions, and when executed by a processor, the computer instructions implement the method for determining a sensitive word according to the foregoing aspect.

In the embodiment of the application, when the word vector extraction model is trained, besides the corpus text is used as the training sample, the sub-words of each word in the corpus text are also used as the training sample, so that the recognition capability of the word vector extraction model on the deformed words is improved; the method and the device have the advantages that after feature extraction is respectively carried out on the candidate words and the sensitive word original words in the corpus text by the aid of the word vector extraction models obtained through training, sensitive word recognition can be carried out on the basis of word vectors corresponding to the candidate words and the sensitive word original words respectively.

Drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of a method of determining sensitive words provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a sensitive word determination process shown in one exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of a method of determining sensitive words provided by another exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a word vector determination process output by an exemplary embodiment;

FIG. 6 is a diagram illustrating an implementation of a sensitive word determination process, according to an illustrative embodiment of the present application;

fig. 7 is a block diagram illustrating a structure of a sensitive word determining apparatus according to an exemplary embodiment of the present application.

Detailed Description

Fig. 1 is a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application, which includes aterminal 110 and aserver 120.

Theterminal 110 is an electronic device for setting a sensitive word recognition task, and the electronic device may be a smart phone, a tablet computer, a personal computer or a personal workstation, etc. In fig. 1, theterminal 110 is illustrated as a personal computer, but the configuration is not limited thereto.

Optionally, theterminal 110 is configured to issue a sensitive word recognition task to theserver 120, where the sensitive word recognition task may include a sensitive word original word and a sensitive word length of a sensitive word to be recognized, where the sensitive word original word may be issued in a form of a vocabulary. Illustratively, the sensitive word recognition task issued by the terminal includes a sensitive word original word vocabulary and a sensitive word length 2, that is, a sensitive word composed of two words is mined.

Theserver 120 is a server for performing the sensitive word recognition task, and may be a single server or a server group consisting of multiple servers; the server may be a physical server or a cloud server, which is not limited in this embodiment of the present application.

Optionally, theserver 120 is a background server of the website, and is configured to perform sensitive word recognition on the website content; theserver 120 is a background server of the application program, and is configured to perform sensitive word recognition on text content issued by a user in the application program;server 120 is a background server of the forum and is used for sensitive word recognition of posts or comments posted in the forum by users. The embodiment of the present application does not limit the specific type of the server.

Optionally, theserver 120 performs corpus text collection in advance, so as to train a word vector extraction model based on the corpus text, where the word vector extraction model is used to output a word vector corresponding to the input text. After receiving the sensitive word recognition task issued by theterminal 110, theserver 120 performs feature extraction on the corpus text and the sensitive word original word through a word vector extraction model to obtain respective corresponding word vectors, and then recognizes a candidate sensitive word in the corpus text based on the word vectors.

In a possible implementation manner, theserver 120 performs a masking process on the corpus text containing the candidate sensitive words, or feeds back the identified candidate sensitive words to theterminal 110, and theterminal 110 further confirms the candidate sensitive words, for example, the candidate sensitive words may be further screened through a manual review method.

It should be noted that the foregoing embodiment has been described only by taking the word vector extraction model and the sensitive word recognition as examples, and in other possible embodiments, the word vector extraction model may be trained by a terminal to perform the sensitive word recognition, which is not limited in this embodiment. For convenience of description, in the following embodiments, a method of determining a sensitive word is described as an example of a method performed by a computer device.

Referring to fig. 2, a flowchart of a method for determining a sensitive word provided by an exemplary embodiment of the present application is shown, which may include the following steps.

Step 201, training a word vector extraction model based on a corpus text and sub-words of each word in the corpus text, wherein the corpus text is a phonogram text.

In one possible implementation, a computer device first gathers corpus text for training a word vector extraction model. The corpus text may be a phonetic text captured from a network, the phonetic text may be an english text, a french text, or other text using latin letters, a russian text using slav letters, an arabic text using arabic letters, or the like. For convenience of description, the following examples will be described with reference to the phonogram text as the english text, but the present invention is not limited thereto.

Optionally, the corpus text includes a compliant corpus text and an unconditional corpus text, where the unconditional corpus text includes at least one of a sensitive word original word or a sensitive word deformed word, and the compliant corpus text does not include the sensitive word original word or the sensitive word deformed word.

In the embodiment of the application, the word vector extraction model is used for extracting features of an input text and outputting word vectors corresponding to words in the text. In addition, in order to improve the recognition capability of the deformed words, when the word vector extraction model is trained, besides the corpus text is used as a training sample, sub-words of each word in the corpus text are also used as the training sample, and correspondingly, word vectors corresponding to the words in the expected text are obtained by calculating vectors of the words and the vectors of the sub-words, namely the characteristics of the sub-words are integrated into the word vectors, so that the expression capability of the word vectors on the sub-words in the words is improved.

Wherein a subword is composed of at least two consecutive letters in a word. For example, for the word learning, its corresponding subword may include lea, ear, ari, rin, and ing.

Step 202, performing feature extraction on candidate words corresponding to the corpus text through a word vector extraction model to obtain candidate word vectors corresponding to the candidate words, wherein the candidate words are composed of at least one word.

In a possible implementation manner, after completing the training of the word vector extraction model and receiving the sensitive word recognition task, the computer device performs feature extraction on the corpus text through the word vector extraction model. Optionally, the computer device determines candidate words in the corpus text based on the sensitive word recognition task, and further performs feature extraction on the candidate words by using a word vector extraction model to obtain candidate word vectors corresponding to the candidate words.

Optionally, the candidate word is a single word or a phrase consisting of at least two words. Correspondingly, when the candidate word is a single word, the candidate word vector is the word vector of the word, and when the candidate word is a phrase, the candidate word vector is the word vector of the phrase. The following embodiments describe the determination of the word vector corresponding to the word group in detail.

And 203, performing feature extraction on the sensitive word original word through a word vector extraction model to obtain a sensitive word vector, wherein the sensitive word original word is composed of at least one word.

In a possible implementation manner, after completing the training of the word vector extraction model and receiving the task of identifying the sensitive word, the computer device performs feature extraction on the original word of the sensitive word through the word vector extraction model, wherein the original word of the sensitive word can be included in a vocabulary table indicated by the task of identifying the sensitive word, and the task of identifying the sensitive word is used for indicating to identify the original word of the sensitive word included in the corpus text and obtaining a deformed word of the sensitive word through the deformation of the original word of the sensitive word.

Optionally, the sensitive word primitive word is a single word or a word group composed of at least two words. Correspondingly, when the original word of the sensitive word is a single word, the word vector of the sensitive word is the word vector of the word, and when the original word of the sensitive word is a word group, the word vector of the sensitive word is the word vector of the word group.

It should be noted that there is no strict execution timing sequence between the

steps

202 and 203, that is, the

steps

202 and 203 may be executed sequentially or synchronously, which is not limited in this embodiment.

And 204, determining candidate sensitive words in the candidate words based on the sensitive word and word vectors and the candidate word and word vectors, wherein the candidate sensitive words comprise at least one of sensitive word original words or sensitive word deformed words, and the sensitive word deformed words are obtained by deforming the sensitive word original words.

Further, the computer device screens candidate sensitive words from the candidate words based on the sensitive word vector and the candidate word vector, where the candidate sensitive words may be consistent with the original sensitive word or may be deformed sensitive words obtained by deforming the original sensitive word. The original words of the sensitive words may be modified in a manner of letter replacement, letter order replacement, letter omission, or the like.

In one possible implementation, the computer device determines the candidate sensitive word by calculating a vector similarity between the sensitive word-word vector and the candidate word-word vector.

Taking english as an example, when the original word of the sensitive word includes "stupidjerk", the determined candidate sensitive word may include "stupidjxxk" and the like.

Schematically, the flow of determining the sensitive word is shown in fig. 3. The computer equipment firstly trains a word vector extraction model 33 based on a corpus text 31 and sub-words 32 of words in the corpus text, then determines a candidate word 34 from the corpus text 31, inputs the candidate word 34 into the word vector extraction model 33 for feature extraction to obtain a candidate word vector 35, inputs an original sensitive word 36 into the word vector extraction model 33 for feature extraction to obtain a sensitive word vector 37, and then determines a candidate sensitive word 38 corresponding to the original sensitive word from the candidate word 34 based on the candidate word vector 35 and the sensitive word vector 37.

In summary, in the embodiment of the present application, when the word vector extraction model is trained, besides the corpus text is used as the training sample, the sub-words of each word in the corpus text are also used as the training sample, so that the recognition capability of the word vector extraction model on the deformed words is improved; the method and the device have the advantages that after feature extraction is respectively carried out on the candidate words and the sensitive word original words in the corpus text by the aid of the word vector extraction models obtained through training, sensitive word recognition can be carried out on the basis of word vectors corresponding to the candidate words and the sensitive word original words respectively.

Referring to fig. 4, a flowchart of a method for determining a sensitive word provided in another exemplary embodiment of the present application is shown, which may include the following steps.

Step 401, performing n-gram word segmentation on each word in the text to obtain a subword, where n is an integer greater than or equal to 2.

In one possible implementation, before performing model training using a corpus text, the computer device first performs n-gram segmentation (character-level segmentation) on each word in the same corpus text to obtain a plurality of sub-words corresponding to the word, where each sub-word is composed of n characters (letters or symbols).

N may be a default value or a custom value, which is not limited in this embodiment.

Optionally, the computer device performs word segmentation on the word by using at least one word segmentation mode, so that the word vector extraction model can learn the features of the sub-words in different dimensions in the subsequent model training process. For example, 2-gram and 3-gram participles may be performed on words.

Optionally, before n-gram segmentation, the computer device represents the beginning and end of a word by adding special symbols before and after the word, which can be "<" and ">". In one illustrative example, a 3-gram segmentation is performed on the word jxxk, with the resulting segmentation including < jx, jxx, xxk, xk >.

Step 402, generating a word sequence based on words and subwords in the corpus text.

After word segmentation is completed, the computer equipment generates a word sequence corresponding to the corpus text based on each word in the same corpus text and the sub-word corresponding to each word for subsequent model training.

In one illustrative example, when the corpus text is stupidjxxk, the computer device generates word sequences { stupid, jxxk, < st, stu, tup, upi, pid, id >, < jx, jxx, xxk, xk > }.

Step 403, training a word vector extraction model based on the word sequence and the context relationship of the words in the corpus text.

Because there must be correlation between words in the same corpus text (for example, some words may appear in combination), the computer device may train the word vector extraction model in an unsupervised manner based on the context of the words in the corpus text.

Optionally, the computer device predicts a word vector of a context word corresponding to a word based on the word vector corresponding to the word in the corpus text; alternatively, the computer device may predict word vectors for words between context words based on word vectors for context words in the corpus text.

The word vector of the word is obtained by superposing the word vector corresponding to the word and the word vectors of all the sub-words corresponding to the word. As shown in FIG. 5, for the word jxxk, its corresponding word vector may be represented as V_<jxxk>+V_<jx+V_jxx+V_xxk+V_xk>。

In one possible implementation, the computer device trains the word vector extraction model through a skip-gram algorithm based on the word sequence and the context of the words in the corpus text, wherein the skip-gram algorithm is used for performing context prediction according to the central words.

For example, when the corpus text is composed of word a, word B, word C, and word D, when model training is performed based on the context, word a and word C may be predicted from word B (i.e., the central word), and word B and word D may be predicted from word C (i.e., the central word).

In another possible implementation, the computer device trains the word vector extraction model through a CBOW (Continuous Bag-Of-Words model) algorithm based on the word sequence and the context relationship Of Words in the corpus text, wherein the CBOW algorithm is used for performing the central word prediction according to the context.

For example, when the corpus text is composed of word a, word B, word C, and word D, when model training is performed based on the context, word B (i.e., the central word) can be predicted from word a and word C, and word C (i.e., the central word) can be predicted from word B and word D.

Optionally, when the skip-gram algorithm or the CBOW algorithm is used for model training, the computer may perform model training in manners of hierarchical softmax or negative sample sampling (negative sampling), and the like, which is not limited in this embodiment.

In some embodiments, the word vector extraction model trained by the computer device may be a Fasttext word vector model. Of course, other word vector extraction models capable of outputting a word vector including sub-word features may be adopted, and the embodiment of the present application is not limited thereto.

In an illustrative example, the parameters from the input layer to the hidden layer of the word vector extraction model include a matrix W, the size of the matrix W is N × V, V is the size of the word list (i.e., the vector dimension of one-hot vectors corresponding to words and subwords in the input word sequence), N is the dimension of the word vector to be generated, and N is the number of hidden nodes in the hidden layer (first layer). And adjusting the matrix W in the iterative training process each time, and only modifying the row of the parameter 1 in the one-hot each time the matrix W is adjusted due to the property of the one-hot.

After completing the training of the word vector extraction model through theabove steps 401 to 403, the computer device performs feature extraction on the candidate words through the followingsteps 404 to 406, and performs feature extraction on the sensitive word original words through the following steps 407 to 408. It should be noted thatsteps 404 to 406 and steps 407 to 408 do not have strict sequence, and this embodiment is described by taking an example in which steps 404 to 406 and steps 407 to 408 are executed synchronously.

Step 404, obtaining the number of target words of the candidate sensitive words, where the number of target words is the number of words included in the candidate sensitive words.

In one possible implementation, the sensitive word recognition task includes a target word number, and the target word number is used for indicating the number of words included in the candidate sensitive words to be mined. For example, when a sensitive word composed of a single word or a sensitive word composed of two words needs to be mined, the target word number may be set to 1 and 2.

Optionally, the number of the target words may be a default value or may be custom data, which is not limited in this embodiment.

Step 405, performing word segmentation processing on the corpus text based on the target word quantity to obtain at least one candidate word, wherein the candidate word is composed of words of the target word quantity, and the words in the candidate word are continuous in the corpus text.

Further, the computer device performs word segmentation processing on the corpus text based on the target word quantity to obtain candidate words formed by words of the target word quantity. In one possible implementation, the computer device performs n-gram word segmentation processing on the corpus text based on the target word quantity to obtain candidate words.

Schematically, as shown in fig. 6, when the number of target words is 2, that is, when a phrase composed of two words needs to be mined, the computer device performs 2-gram segmentation on the corpus sample 61 (youtupidjerk), so as to obtain a first candidate word 611 (youtupidjerk) and a second candidate word 612 (stupidjerk).

And step 406, performing feature extraction on the candidate words corresponding to the corpus text through a word vector extraction model to obtain candidate word vectors corresponding to the candidate words.

For each candidate word, the computer equipment inputs the candidate word into the word vector extraction model, characteristic extraction is carried out on each word in the candidate word through the word vector extraction model, a word vector corresponding to each word is obtained, and then the candidate word vector is determined based on the word vector of each word.

In one possible implementation, when the candidate word is a single word, the candidate word-word vector is a word vector of the single word; and when the candidate words are at least two words, the candidate word-word vectors are the superposition of the word vectors corresponding to the words.

Illustratively, as shown in fig. 6, the computer device inputs thefirst candidate word 611 and thesecond candidate word 612 into the word vector extraction model 62, respectively, to obtain a first candidate word vector 631 corresponding to thefirst candidate word 611 and a second candidate word vector 632 corresponding to thesecond candidate word 612, where the first candidate word vector 631 is V_you+V_stupidThe second candidate word vector 632 is V_stupid+V_jxxk。

And 407, extracting the characteristics of each word in the original word of the sensitive word through the word vector extraction model to obtain the word vector of each word.

In a possible implementation manner, the computer device inputs the sensitive original words into the word vector extraction model, and the word vector extraction model performs feature extraction on each word in the sensitive original words to obtain word vectors. It should be noted that when feature extraction is performed on the sensitive word original word, word segmentation processing is not required to be performed on the sensitive word original word.

Step 408, determining a sensitive word-word vector based on the word-word vectors of the respective words in the sensitive original word.

In one possible implementation, the computer device superimposes the single word vectors in the sensitive word original words, and accordingly determines the vectors obtained by the superimposition as the sensitive word vectors.

Illustratively, as shown in fig. 6, the computer device performs feature extraction on a first sensitive wordprimitive word 641, a second sensitive wordprimitive word 642 and a third sensitive wordprimitive word 643 in the sensitive word primitive word table 64 respectively through the word vector extraction model 62 to obtain a first sensitive word vector 651, a second sensitive word vector 652 and a third sensitive word vector 653 respectively.

Step 409, for each sensitive word and word vector, calculating the vector similarity between the sensitive word and word vector and each candidate word and word vector.

Through the steps, the computer equipment obtains a first word vector set containing candidate word and word vectors corresponding to the candidate words and a second word vector set containing sensitive word and word vectors corresponding to the sensitive word original words. Further, the computer device determines candidate sensitive words in the candidate words by calculating vector similarity between the vectors.

In one possible implementation, the computer device determines the vector similarity between the candidate word and the sensitive word by calculating a vector distance between the candidate word and the sensitive word, wherein a smaller vector distance indicates a higher vector similarity between the vectors, and accordingly a higher probability that the candidate word belongs to the sensitive word. The vector distance between the candidate word and word vector and the sensitive word and word vector may be an euclidean distance, a cosine distance, or the like, which is not limited in the embodiments of the present application.

Also, to reduce the time consumption for mining sensitive words, in one possible implementation, the computer device performs a vector Similarity search using a faiss (facebook AI Similarity search) framework.

Illustratively, as shown in fig. 6, the computer device calculates a first vector similarity between the first candidate word vector 631 and the first sensitive word vector 651, a second vector similarity between the first candidate word vector 631 and the second sensitive word vector 652, a third vector similarity between the first candidate word vector 631 and the third sensitive word vector 653, a fourth vector similarity between the second candidate word vector 632 and the first sensitive word vector 651, a fifth vector similarity between the second candidate word vector 632 and the second sensitive word vector 652, and a sixth vector similarity between the second candidate word vector 632 and the third sensitive word vector 653.

In some embodiments, in order to improve the sensitive word recognition efficiency and reduce the calculation amount, the computer device only calculates the vector similarity between the candidate words with the same word number and the original words of the sensitive words. For example, the computer device calculates the vector similarity between the candidate word composed of three words and the sensitive word original word composed of three words, and calculates the vector similarity between the candidate word composed of two words and the sensitive word original word composed of two words.

Step 410, determining the first k candidate words with the vector similarity larger than the similarity threshold as candidate sensitive words, wherein k is a positive integer.

In a possible implementation manner, the computer device performs descending sorting on the candidate words according to the calculated vector similarity, so that the similarity is greater than a similarity threshold, and the top k candidate words are determined as candidate sensitive words corresponding to the original sensitive word.

For example, the computer device determines the top 10 candidate words with the similarity higher than 0.85 and the highest similarity with the original word of the sensitive word as the candidate sensitive words. Illustratively, as shown in fig. 6, the computer device determines thesecond candidate word 612 as a sensitive word deformed word corresponding to the second sensitive wordoriginal word 642 based on the fifth vector similarity.

In this embodiment, the computer device performs n-gram segmentation on words in the corpus text to obtain a plurality of sub-words, and performs model training using word sequences formed by the words and the sub-words in the corpus text as training samples, so that the model can learn features of the sub-words in the training process, and the expression capability of the word vector extraction model on the sub-word features is improved.

In addition, in the training process, model training is carried out by adopting a skip-gram or CBOW algorithm based on the context of words in the corpus text, unsupervised model training is realized under the condition that the corpus text is not required to be labeled, and the model training efficiency is improved.

Meanwhile, the candidate words formed by the words with the target word quantity are obtained by segmenting the corpus text, and the characteristic extraction is carried out on the candidate words, so that the mining of single-word sensitive words and multi-word sensitive word groups can be realized, and the comprehensiveness of the sensitive word mining is improved.

Referring to fig. 7, a block diagram of a device for determining a sensitive word according to an exemplary embodiment of the present application is shown. The device includes:

atraining module 701, configured to train a word vector extraction model based on a corpus text and sub-words of each word in the corpus text, where the corpus text is a phonogram text;

afirst extraction module 702, configured to perform feature extraction on candidate words corresponding to the corpus text through the word vector extraction model to obtain candidate word vectors corresponding to the candidate words, where each candidate word is composed of at least one word;

thesecond extraction module 703 is configured to perform feature extraction on the sensitive word original word through the word vector extraction model to obtain a sensitive word vector, where the sensitive word original word is composed of at least one word;

a determiningmodule 704, configured to determine a candidate sensitive word in the candidate words based on the sensitive word and word vector and the candidate word and word vector, where the candidate sensitive word includes at least one of the sensitive word original word or the sensitive word deformed word, and the sensitive word deformed word is obtained by deforming the sensitive word original word.

Optionally, thetraining module 701 includes:

the word segmentation unit is used for performing n-gram word segmentation on each word in the corpus text to obtain the sub-words, wherein n is an integer greater than or equal to 2;

the generating unit is used for generating a word sequence based on the words and the sub-words in the corpus text;

and the training unit is used for training the word vector extraction model based on the word sequence and the context relationship of the words in the corpus text.

Optionally, the training unit is configured to:

training the word vector extraction model through a skip-gram algorithm based on the word sequence and the context relationship of words in the corpus text, wherein the skip-gram algorithm is used for carrying out context prediction according to the central words;

or the like, or, alternatively,

and training the word vector extraction model through a CBOW algorithm based on the word sequence and the context relationship of the words in the corpus text, wherein the CBOW algorithm is used for predicting the central words according to the context.

Optionally, the word vector extraction model is a Fasttext word vector extraction model.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring the number of target words of the candidate sensitive words, wherein the number of the target words is the number of words contained in the candidate sensitive words;

and the word segmentation module is used for performing word segmentation processing on the corpus text based on the target word quantity to obtain at least one candidate word, wherein the candidate word is composed of words of the target word quantity, and the words in the candidate word are continuous in the corpus text.

Optionally, thesecond extraction module 703 includes:

the extraction unit is used for extracting the characteristics of each word in the sensitive word original word through the word vector extraction model to obtain the word vector of each word;

a first determining unit, configured to determine the sensitive word-word vector based on the word-word vector of each word in the sensitive original word.

Optionally, the determining module includes:

the calculation unit is used for calculating the vector similarity between the sensitive word vector and each candidate word vector for each sensitive word vector;

a second determining unit, configured to determine, as the candidate sensitive word, the first k candidate words whose vector similarity is greater than a similarity threshold, where k is a positive integer.

In an exemplary embodiment, the present application further provides a computer device, which includes a processor and a memory, where the memory stores a computer program, and the computer program is loaded by the processor and executed to implement the method for determining a sensitive word as provided in the above embodiments.

The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for determining a sensitive word according to the above embodiments.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the method for determining the sensitive word provided in the various alternative implementations of the above aspects.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for determining sensitive words, the method comprising:

2. The method of claim 1, wherein training a word vector extraction model based on the corpus text and subwords of respective words in the corpus text comprises:

performing n-gram word segmentation on each word in the corpus text to obtain the sub-word, wherein n is an integer greater than or equal to 2;

generating a word sequence based on the words in the corpus text and the sub-words;

and training the word vector extraction model based on the word sequence and the context relationship of the words in the corpus text.

3. The method of claim 2, wherein training the word vector extraction model based on the word sequence and context of words in the corpus text comprises:

or the like, or, alternatively,

4. The method of claim 2, wherein the word vector extraction model is a Fasttext word vector extraction model.

5. The method according to any one of claims 1 to 4, wherein before the feature extraction is performed on the candidate words corresponding to the corpus text through the word vector extraction model to obtain the word vector of the candidate word corresponding to each candidate word, the method comprises:

acquiring the number of target words of the candidate sensitive words, wherein the number of the target words is the number of words contained in the candidate sensitive words;

performing word segmentation processing on the corpus text based on the target word quantity to obtain at least one candidate word, wherein the candidate word is composed of words of the target word quantity, and the words in the candidate word are continuous in the corpus text.

6. The method according to any one of claims 1 to 4, wherein the performing feature extraction on the sensitive word original word by the word vector extraction model to obtain a sensitive word and word vector comprises:

extracting the characteristics of each word in the sensitive word original word through the word vector extraction model to obtain a word vector of each word;

and determining the sensitive word vector based on the word vector of each word in the sensitive original word.

7. The method of any one of claims 1 to 4, wherein determining the candidate sensitive word of the candidate words based on the sensitive word-word vector and the candidate word-word vector comprises:

for each sensitive word and word vector, calculating the vector similarity between the sensitive word and word vector and each candidate word and word vector;

and determining the first k candidate words with the vector similarity larger than a similarity threshold as the candidate sensitive words, wherein k is a positive integer.

8. An apparatus for determining sensitive words, the apparatus comprising:

9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of determining a sensitive word according to any one of claims 1 to 7.

10. A computer-readable storage medium, having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method for determining a sensitive word according to any one of claims 1 to 7.

11. A computer program product, characterized in that it comprises computer instructions which, when executed by a processor, implement the method of determining a sensitive word according to any one of claims 1 to 7.