Movatterモバイル変換


[0]ホーム

URL:


CN114036260A - Determination method, device, equipment, storage medium and program product of sensitive words - Google Patents

Determination method, device, equipment, storage medium and program product of sensitive words
Download PDF

Info

Publication number
CN114036260A
CN114036260ACN202111319256.5ACN202111319256ACN114036260ACN 114036260 ACN114036260 ACN 114036260ACN 202111319256 ACN202111319256 ACN 202111319256ACN 114036260 ACN114036260 ACN 114036260A
Authority
CN
China
Prior art keywords
word
words
sensitive
candidate
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111319256.5A
Other languages
Chinese (zh)
Other versions
CN114036260B (en
Inventor
李聪健
刘海东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Baiguoyuan Network Technology Co Ltd
Original Assignee
Guangzhou Baiguoyuan Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Baiguoyuan Network Technology Co LtdfiledCriticalGuangzhou Baiguoyuan Network Technology Co Ltd
Priority to CN202111319256.5ApriorityCriticalpatent/CN114036260B/en
Publication of CN114036260ApublicationCriticalpatent/CN114036260A/en
Application grantedgrantedCritical
Publication of CN114036260BpublicationCriticalpatent/CN114036260B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The embodiment of the application discloses a method, a device, equipment, a storage medium and a program product for determining sensitive words, and belongs to the field of artificial intelligence. The method comprises the following steps: training a word vector extraction model based on the corpus text and the sub-words of each word in the corpus text, wherein the corpus text is a phonogram text; performing feature extraction on candidate words corresponding to the corpus text through a word vector extraction model to obtain candidate word vectors corresponding to the candidate words; performing feature extraction on the sensitive word original word through a word vector extraction model to obtain a sensitive word vector, wherein the sensitive word original word is composed of at least one word; and determining candidate sensitive words in the candidate words based on the sensitive word vector and the candidate word vector, wherein the candidate sensitive words comprise at least one of the sensitive word original words or the sensitive word deformed words. By adopting the scheme provided by the embodiment of the application, the identification rate of the sensitive word deformation words in the phonogram scene is improved, and the shielding effect of the sensitive words is further improved.

Description

Method, device, equipment, storage medium and program product for determining sensitive words
Technical Field
The embodiment of the application relates to the field of artificial intelligence, in particular to a method, a device, equipment, a storage medium and a program product for determining sensitive words.
Background
In order to ensure a good internet environment, sensitive words are shielded in internet products such as websites, forums, applications and the like.
In the related art, sensitive word recognition is usually performed based on a preset sensitive word vocabulary, and then the recognized sensitive words are shielded. For example, when the comment information is composed of word 1, word 2, and word 3, if word 1 belongs to the sensitive word vocabulary, the comment information is masked.
However, for phonograms, lawless persons may deform sensitive words to avoid being masked, resulting in poor effect of sensitive word masking based on the vocabulary. For example, in an english scenario, the deformation mode of the english sensitive word includes adjusting the alphabetical order or omitting partial letters, etc.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment, a storage medium and a program product for determining sensitive words. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides a method for determining a sensitive word, where the method includes:
training a word vector extraction model based on a corpus text and sub-words of each word in the corpus text, wherein the corpus text is a phonogram text;
performing feature extraction on candidate words corresponding to the corpus text through the word vector extraction model to obtain candidate word vectors corresponding to the candidate words, wherein the candidate words are composed of at least one word;
performing feature extraction on the sensitive word original word through the word vector extraction model to obtain a sensitive word vector, wherein the sensitive word original word is composed of at least one word;
and determining candidate sensitive words in the candidate words based on the sensitive word and word vectors and the candidate word and word vectors, wherein the candidate sensitive words comprise at least one of the sensitive word original words or the sensitive word deformed words, and the sensitive word deformed words are obtained by deforming the sensitive word original words.
In another aspect, an embodiment of the present application provides an apparatus for determining a sensitive word, where the apparatus includes:
the training module is used for training a word vector extraction model based on a corpus text and sub-words of each word in the corpus text, wherein the corpus text is a phonogram text;
the first extraction module is used for extracting the characteristics of candidate words corresponding to the corpus text through the word vector extraction model to obtain candidate word vectors corresponding to the candidate words, wherein the candidate words are composed of at least one word;
the second extraction module is used for extracting the characteristics of the sensitive word original words through the word vector extraction model to obtain sensitive word vectors, wherein the sensitive word original words are composed of at least one word;
and the determining module is used for determining a candidate sensitive word in the candidate words based on the sensitive word vector and the candidate word vector, wherein the candidate sensitive word comprises at least one of the sensitive word original word or the sensitive word deformed word, and the sensitive word deformed word is obtained by deforming the sensitive word original word.
In another aspect, embodiments of the present application provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the method for determining a sensitive word according to the above aspect.
In another aspect, embodiments of the present application provide a computer-readable storage medium, where at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the method for determining a sensitive word as provided in various aspects of the present application.
In another aspect, the present application provides a computer program product, which includes computer instructions, and when executed by a processor, the computer instructions implement the method for determining a sensitive word according to the foregoing aspect.
In the embodiment of the application, when the word vector extraction model is trained, besides the corpus text is used as the training sample, the sub-words of each word in the corpus text are also used as the training sample, so that the recognition capability of the word vector extraction model on the deformed words is improved; the method and the device have the advantages that after feature extraction is respectively carried out on the candidate words and the sensitive word original words in the corpus text by the aid of the word vector extraction models obtained through training, sensitive word recognition can be carried out on the basis of word vectors corresponding to the candidate words and the sensitive word original words respectively.
Drawings
In order to more clearly describe the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;
FIG. 2 illustrates a flow chart of a method of determining sensitive words provided by an exemplary embodiment of the present application;
FIG. 3 is a schematic diagram of a sensitive word determination process shown in one exemplary embodiment of the present application;
FIG. 4 illustrates a flow chart of a method of determining sensitive words provided by another exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of a word vector determination process output by an exemplary embodiment;
FIG. 6 is a diagram illustrating an implementation of a sensitive word determination process, according to an illustrative embodiment of the present application;
fig. 7 is a block diagram illustrating a structure of a sensitive word determining apparatus according to an exemplary embodiment of the present application.
Detailed Description
Fig. 1 is a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application, which includes aterminal 110 and aserver 120.
Theterminal 110 is an electronic device for setting a sensitive word recognition task, and the electronic device may be a smart phone, a tablet computer, a personal computer or a personal workstation, etc. In fig. 1, theterminal 110 is illustrated as a personal computer, but the configuration is not limited thereto.
Optionally, theterminal 110 is configured to issue a sensitive word recognition task to theserver 120, where the sensitive word recognition task may include a sensitive word original word and a sensitive word length of a sensitive word to be recognized, where the sensitive word original word may be issued in a form of a vocabulary. Illustratively, the sensitive word recognition task issued by the terminal includes a sensitive word original word vocabulary and a sensitive word length 2, that is, a sensitive word composed of two words is mined.
Theserver 120 is a server for performing the sensitive word recognition task, and may be a single server or a server group consisting of multiple servers; the server may be a physical server or a cloud server, which is not limited in this embodiment of the present application.
Optionally, theserver 120 is a background server of the website, and is configured to perform sensitive word recognition on the website content; theserver 120 is a background server of the application program, and is configured to perform sensitive word recognition on text content issued by a user in the application program;server 120 is a background server of the forum and is used for sensitive word recognition of posts or comments posted in the forum by users. The embodiment of the present application does not limit the specific type of the server.
Optionally, theserver 120 performs corpus text collection in advance, so as to train a word vector extraction model based on the corpus text, where the word vector extraction model is used to output a word vector corresponding to the input text. After receiving the sensitive word recognition task issued by theterminal 110, theserver 120 performs feature extraction on the corpus text and the sensitive word original word through a word vector extraction model to obtain respective corresponding word vectors, and then recognizes a candidate sensitive word in the corpus text based on the word vectors.
In a possible implementation manner, theserver 120 performs a masking process on the corpus text containing the candidate sensitive words, or feeds back the identified candidate sensitive words to theterminal 110, and theterminal 110 further confirms the candidate sensitive words, for example, the candidate sensitive words may be further screened through a manual review method.
It should be noted that the foregoing embodiment has been described only by taking the word vector extraction model and the sensitive word recognition as examples, and in other possible embodiments, the word vector extraction model may be trained by a terminal to perform the sensitive word recognition, which is not limited in this embodiment. For convenience of description, in the following embodiments, a method of determining a sensitive word is described as an example of a method performed by a computer device.
Referring to fig. 2, a flowchart of a method for determining a sensitive word provided by an exemplary embodiment of the present application is shown, which may include the following steps.
Step 201, training a word vector extraction model based on a corpus text and sub-words of each word in the corpus text, wherein the corpus text is a phonogram text.
In one possible implementation, a computer device first gathers corpus text for training a word vector extraction model. The corpus text may be a phonetic text captured from a network, the phonetic text may be an english text, a french text, or other text using latin letters, a russian text using slav letters, an arabic text using arabic letters, or the like. For convenience of description, the following examples will be described with reference to the phonogram text as the english text, but the present invention is not limited thereto.
Optionally, the corpus text includes a compliant corpus text and an unconditional corpus text, where the unconditional corpus text includes at least one of a sensitive word original word or a sensitive word deformed word, and the compliant corpus text does not include the sensitive word original word or the sensitive word deformed word.
In the embodiment of the application, the word vector extraction model is used for extracting features of an input text and outputting word vectors corresponding to words in the text. In addition, in order to improve the recognition capability of the deformed words, when the word vector extraction model is trained, besides the corpus text is used as a training sample, sub-words of each word in the corpus text are also used as the training sample, and correspondingly, word vectors corresponding to the words in the expected text are obtained by calculating vectors of the words and the vectors of the sub-words, namely the characteristics of the sub-words are integrated into the word vectors, so that the expression capability of the word vectors on the sub-words in the words is improved.
Wherein a subword is composed of at least two consecutive letters in a word. For example, for the word learning, its corresponding subword may include lea, ear, ari, rin, and ing.
Step 202, performing feature extraction on candidate words corresponding to the corpus text through a word vector extraction model to obtain candidate word vectors corresponding to the candidate words, wherein the candidate words are composed of at least one word.
In a possible implementation manner, after completing the training of the word vector extraction model and receiving the sensitive word recognition task, the computer device performs feature extraction on the corpus text through the word vector extraction model. Optionally, the computer device determines candidate words in the corpus text based on the sensitive word recognition task, and further performs feature extraction on the candidate words by using a word vector extraction model to obtain candidate word vectors corresponding to the candidate words.
Optionally, the candidate word is a single word or a phrase consisting of at least two words. Correspondingly, when the candidate word is a single word, the candidate word vector is the word vector of the word, and when the candidate word is a phrase, the candidate word vector is the word vector of the phrase. The following embodiments describe the determination of the word vector corresponding to the word group in detail.
And 203, performing feature extraction on the sensitive word original word through a word vector extraction model to obtain a sensitive word vector, wherein the sensitive word original word is composed of at least one word.
In a possible implementation manner, after completing the training of the word vector extraction model and receiving the task of identifying the sensitive word, the computer device performs feature extraction on the original word of the sensitive word through the word vector extraction model, wherein the original word of the sensitive word can be included in a vocabulary table indicated by the task of identifying the sensitive word, and the task of identifying the sensitive word is used for indicating to identify the original word of the sensitive word included in the corpus text and obtaining a deformed word of the sensitive word through the deformation of the original word of the sensitive word.
Optionally, the sensitive word primitive word is a single word or a word group composed of at least two words. Correspondingly, when the original word of the sensitive word is a single word, the word vector of the sensitive word is the word vector of the word, and when the original word of the sensitive word is a word group, the word vector of the sensitive word is the word vector of the word group.
It should be noted that there is no strict execution timing sequence between thesteps 202 and 203, that is, thesteps 202 and 203 may be executed sequentially or synchronously, which is not limited in this embodiment.
And 204, determining candidate sensitive words in the candidate words based on the sensitive word and word vectors and the candidate word and word vectors, wherein the candidate sensitive words comprise at least one of sensitive word original words or sensitive word deformed words, and the sensitive word deformed words are obtained by deforming the sensitive word original words.
Further, the computer device screens candidate sensitive words from the candidate words based on the sensitive word vector and the candidate word vector, where the candidate sensitive words may be consistent with the original sensitive word or may be deformed sensitive words obtained by deforming the original sensitive word. The original words of the sensitive words may be modified in a manner of letter replacement, letter order replacement, letter omission, or the like.
In one possible implementation, the computer device determines the candidate sensitive word by calculating a vector similarity between the sensitive word-word vector and the candidate word-word vector.
Taking english as an example, when the original word of the sensitive word includes "stupidjerk", the determined candidate sensitive word may include "stupidjxxk" and the like.
Schematically, the flow of determining the sensitive word is shown in fig. 3. The computer equipment firstly trains a word vector extraction model 33 based on a corpus text 31 and sub-words 32 of words in the corpus text, then determines a candidate word 34 from the corpus text 31, inputs the candidate word 34 into the word vector extraction model 33 for feature extraction to obtain a candidate word vector 35, inputs an original sensitive word 36 into the word vector extraction model 33 for feature extraction to obtain a sensitive word vector 37, and then determines a candidate sensitive word 38 corresponding to the original sensitive word from the candidate word 34 based on the candidate word vector 35 and the sensitive word vector 37.
In summary, in the embodiment of the present application, when the word vector extraction model is trained, besides the corpus text is used as the training sample, the sub-words of each word in the corpus text are also used as the training sample, so that the recognition capability of the word vector extraction model on the deformed words is improved; the method and the device have the advantages that after feature extraction is respectively carried out on the candidate words and the sensitive word original words in the corpus text by the aid of the word vector extraction models obtained through training, sensitive word recognition can be carried out on the basis of word vectors corresponding to the candidate words and the sensitive word original words respectively.
Referring to fig. 4, a flowchart of a method for determining a sensitive word provided in another exemplary embodiment of the present application is shown, which may include the following steps.
Step 401, performing n-gram word segmentation on each word in the text to obtain a subword, where n is an integer greater than or equal to 2.
In one possible implementation, before performing model training using a corpus text, the computer device first performs n-gram segmentation (character-level segmentation) on each word in the same corpus text to obtain a plurality of sub-words corresponding to the word, where each sub-word is composed of n characters (letters or symbols).
N may be a default value or a custom value, which is not limited in this embodiment.
Optionally, the computer device performs word segmentation on the word by using at least one word segmentation mode, so that the word vector extraction model can learn the features of the sub-words in different dimensions in the subsequent model training process. For example, 2-gram and 3-gram participles may be performed on words.
Optionally, before n-gram segmentation, the computer device represents the beginning and end of a word by adding special symbols before and after the word, which can be "<" and ">". In one illustrative example, a 3-gram segmentation is performed on the word jxxk, with the resulting segmentation including < jx, jxx, xxk, xk >.
Step 402, generating a word sequence based on words and subwords in the corpus text.
After word segmentation is completed, the computer equipment generates a word sequence corresponding to the corpus text based on each word in the same corpus text and the sub-word corresponding to each word for subsequent model training.
In one illustrative example, when the corpus text is stupidjxxk, the computer device generates word sequences { stupid, jxxk, < st, stu, tup, upi, pid, id >, < jx, jxx, xxk, xk > }.
Step 403, training a word vector extraction model based on the word sequence and the context relationship of the words in the corpus text.
Because there must be correlation between words in the same corpus text (for example, some words may appear in combination), the computer device may train the word vector extraction model in an unsupervised manner based on the context of the words in the corpus text.
Optionally, the computer device predicts a word vector of a context word corresponding to a word based on the word vector corresponding to the word in the corpus text; alternatively, the computer device may predict word vectors for words between context words based on word vectors for context words in the corpus text.
The word vector of the word is obtained by superposing the word vector corresponding to the word and the word vectors of all the sub-words corresponding to the word. As shown in FIG. 5, for the word jxxk, its corresponding word vector may be represented as V<jxxk>+V<jx+Vjxx+Vxxk+Vxk>
In one possible implementation, the computer device trains the word vector extraction model through a skip-gram algorithm based on the word sequence and the context of the words in the corpus text, wherein the skip-gram algorithm is used for performing context prediction according to the central words.
For example, when the corpus text is composed of word a, word B, word C, and word D, when model training is performed based on the context, word a and word C may be predicted from word B (i.e., the central word), and word B and word D may be predicted from word C (i.e., the central word).
In another possible implementation, the computer device trains the word vector extraction model through a CBOW (Continuous Bag-Of-Words model) algorithm based on the word sequence and the context relationship Of Words in the corpus text, wherein the CBOW algorithm is used for performing the central word prediction according to the context.
For example, when the corpus text is composed of word a, word B, word C, and word D, when model training is performed based on the context, word B (i.e., the central word) can be predicted from word a and word C, and word C (i.e., the central word) can be predicted from word B and word D.
Optionally, when the skip-gram algorithm or the CBOW algorithm is used for model training, the computer may perform model training in manners of hierarchical softmax or negative sample sampling (negative sampling), and the like, which is not limited in this embodiment.
In some embodiments, the word vector extraction model trained by the computer device may be a Fasttext word vector model. Of course, other word vector extraction models capable of outputting a word vector including sub-word features may be adopted, and the embodiment of the present application is not limited thereto.
In an illustrative example, the parameters from the input layer to the hidden layer of the word vector extraction model include a matrix W, the size of the matrix W is N × V, V is the size of the word list (i.e., the vector dimension of one-hot vectors corresponding to words and subwords in the input word sequence), N is the dimension of the word vector to be generated, and N is the number of hidden nodes in the hidden layer (first layer). And adjusting the matrix W in the iterative training process each time, and only modifying the row of the parameter 1 in the one-hot each time the matrix W is adjusted due to the property of the one-hot.
After completing the training of the word vector extraction model through theabove steps 401 to 403, the computer device performs feature extraction on the candidate words through the followingsteps 404 to 406, and performs feature extraction on the sensitive word original words through the following steps 407 to 408. It should be noted thatsteps 404 to 406 and steps 407 to 408 do not have strict sequence, and this embodiment is described by taking an example in which steps 404 to 406 and steps 407 to 408 are executed synchronously.
Step 404, obtaining the number of target words of the candidate sensitive words, where the number of target words is the number of words included in the candidate sensitive words.
In one possible implementation, the sensitive word recognition task includes a target word number, and the target word number is used for indicating the number of words included in the candidate sensitive words to be mined. For example, when a sensitive word composed of a single word or a sensitive word composed of two words needs to be mined, the target word number may be set to 1 and 2.
Optionally, the number of the target words may be a default value or may be custom data, which is not limited in this embodiment.
Step 405, performing word segmentation processing on the corpus text based on the target word quantity to obtain at least one candidate word, wherein the candidate word is composed of words of the target word quantity, and the words in the candidate word are continuous in the corpus text.
Further, the computer device performs word segmentation processing on the corpus text based on the target word quantity to obtain candidate words formed by words of the target word quantity. In one possible implementation, the computer device performs n-gram word segmentation processing on the corpus text based on the target word quantity to obtain candidate words.
Schematically, as shown in fig. 6, when the number of target words is 2, that is, when a phrase composed of two words needs to be mined, the computer device performs 2-gram segmentation on the corpus sample 61 (youtupidjerk), so as to obtain a first candidate word 611 (youtupidjerk) and a second candidate word 612 (stupidjerk).
And step 406, performing feature extraction on the candidate words corresponding to the corpus text through a word vector extraction model to obtain candidate word vectors corresponding to the candidate words.
For each candidate word, the computer equipment inputs the candidate word into the word vector extraction model, characteristic extraction is carried out on each word in the candidate word through the word vector extraction model, a word vector corresponding to each word is obtained, and then the candidate word vector is determined based on the word vector of each word.
In one possible implementation, when the candidate word is a single word, the candidate word-word vector is a word vector of the single word; and when the candidate words are at least two words, the candidate word-word vectors are the superposition of the word vectors corresponding to the words.
Illustratively, as shown in fig. 6, the computer device inputs thefirst candidate word 611 and thesecond candidate word 612 into the word vector extraction model 62, respectively, to obtain a first candidate word vector 631 corresponding to thefirst candidate word 611 and a second candidate word vector 632 corresponding to thesecond candidate word 612, where the first candidate word vector 631 is Vyou+VstupidThe second candidate word vector 632 is Vstupid+Vjxxk
And 407, extracting the characteristics of each word in the original word of the sensitive word through the word vector extraction model to obtain the word vector of each word.
In a possible implementation manner, the computer device inputs the sensitive original words into the word vector extraction model, and the word vector extraction model performs feature extraction on each word in the sensitive original words to obtain word vectors. It should be noted that when feature extraction is performed on the sensitive word original word, word segmentation processing is not required to be performed on the sensitive word original word.
Step 408, determining a sensitive word-word vector based on the word-word vectors of the respective words in the sensitive original word.
In one possible implementation, the computer device superimposes the single word vectors in the sensitive word original words, and accordingly determines the vectors obtained by the superimposition as the sensitive word vectors.
Illustratively, as shown in fig. 6, the computer device performs feature extraction on a first sensitive wordprimitive word 641, a second sensitive wordprimitive word 642 and a third sensitive wordprimitive word 643 in the sensitive word primitive word table 64 respectively through the word vector extraction model 62 to obtain a first sensitive word vector 651, a second sensitive word vector 652 and a third sensitive word vector 653 respectively.
Step 409, for each sensitive word and word vector, calculating the vector similarity between the sensitive word and word vector and each candidate word and word vector.
Through the steps, the computer equipment obtains a first word vector set containing candidate word and word vectors corresponding to the candidate words and a second word vector set containing sensitive word and word vectors corresponding to the sensitive word original words. Further, the computer device determines candidate sensitive words in the candidate words by calculating vector similarity between the vectors.
In one possible implementation, the computer device determines the vector similarity between the candidate word and the sensitive word by calculating a vector distance between the candidate word and the sensitive word, wherein a smaller vector distance indicates a higher vector similarity between the vectors, and accordingly a higher probability that the candidate word belongs to the sensitive word. The vector distance between the candidate word and word vector and the sensitive word and word vector may be an euclidean distance, a cosine distance, or the like, which is not limited in the embodiments of the present application.
Also, to reduce the time consumption for mining sensitive words, in one possible implementation, the computer device performs a vector Similarity search using a faiss (facebook AI Similarity search) framework.
Illustratively, as shown in fig. 6, the computer device calculates a first vector similarity between the first candidate word vector 631 and the first sensitive word vector 651, a second vector similarity between the first candidate word vector 631 and the second sensitive word vector 652, a third vector similarity between the first candidate word vector 631 and the third sensitive word vector 653, a fourth vector similarity between the second candidate word vector 632 and the first sensitive word vector 651, a fifth vector similarity between the second candidate word vector 632 and the second sensitive word vector 652, and a sixth vector similarity between the second candidate word vector 632 and the third sensitive word vector 653.
In some embodiments, in order to improve the sensitive word recognition efficiency and reduce the calculation amount, the computer device only calculates the vector similarity between the candidate words with the same word number and the original words of the sensitive words. For example, the computer device calculates the vector similarity between the candidate word composed of three words and the sensitive word original word composed of three words, and calculates the vector similarity between the candidate word composed of two words and the sensitive word original word composed of two words.
Step 410, determining the first k candidate words with the vector similarity larger than the similarity threshold as candidate sensitive words, wherein k is a positive integer.
In a possible implementation manner, the computer device performs descending sorting on the candidate words according to the calculated vector similarity, so that the similarity is greater than a similarity threshold, and the top k candidate words are determined as candidate sensitive words corresponding to the original sensitive word.
For example, the computer device determines the top 10 candidate words with the similarity higher than 0.85 and the highest similarity with the original word of the sensitive word as the candidate sensitive words. Illustratively, as shown in fig. 6, the computer device determines thesecond candidate word 612 as a sensitive word deformed word corresponding to the second sensitive wordoriginal word 642 based on the fifth vector similarity.
In this embodiment, the computer device performs n-gram segmentation on words in the corpus text to obtain a plurality of sub-words, and performs model training using word sequences formed by the words and the sub-words in the corpus text as training samples, so that the model can learn features of the sub-words in the training process, and the expression capability of the word vector extraction model on the sub-word features is improved.
In addition, in the training process, model training is carried out by adopting a skip-gram or CBOW algorithm based on the context of words in the corpus text, unsupervised model training is realized under the condition that the corpus text is not required to be labeled, and the model training efficiency is improved.
Meanwhile, the candidate words formed by the words with the target word quantity are obtained by segmenting the corpus text, and the characteristic extraction is carried out on the candidate words, so that the mining of single-word sensitive words and multi-word sensitive word groups can be realized, and the comprehensiveness of the sensitive word mining is improved.
Referring to fig. 7, a block diagram of a device for determining a sensitive word according to an exemplary embodiment of the present application is shown. The device includes:
atraining module 701, configured to train a word vector extraction model based on a corpus text and sub-words of each word in the corpus text, where the corpus text is a phonogram text;
afirst extraction module 702, configured to perform feature extraction on candidate words corresponding to the corpus text through the word vector extraction model to obtain candidate word vectors corresponding to the candidate words, where each candidate word is composed of at least one word;
thesecond extraction module 703 is configured to perform feature extraction on the sensitive word original word through the word vector extraction model to obtain a sensitive word vector, where the sensitive word original word is composed of at least one word;
a determiningmodule 704, configured to determine a candidate sensitive word in the candidate words based on the sensitive word and word vector and the candidate word and word vector, where the candidate sensitive word includes at least one of the sensitive word original word or the sensitive word deformed word, and the sensitive word deformed word is obtained by deforming the sensitive word original word.
Optionally, thetraining module 701 includes:
the word segmentation unit is used for performing n-gram word segmentation on each word in the corpus text to obtain the sub-words, wherein n is an integer greater than or equal to 2;
the generating unit is used for generating a word sequence based on the words and the sub-words in the corpus text;
and the training unit is used for training the word vector extraction model based on the word sequence and the context relationship of the words in the corpus text.
Optionally, the training unit is configured to:
training the word vector extraction model through a skip-gram algorithm based on the word sequence and the context relationship of words in the corpus text, wherein the skip-gram algorithm is used for carrying out context prediction according to the central words;
or the like, or, alternatively,
and training the word vector extraction model through a CBOW algorithm based on the word sequence and the context relationship of the words in the corpus text, wherein the CBOW algorithm is used for predicting the central words according to the context.
Optionally, the word vector extraction model is a Fasttext word vector extraction model.
Optionally, the apparatus further comprises:
the acquisition module is used for acquiring the number of target words of the candidate sensitive words, wherein the number of the target words is the number of words contained in the candidate sensitive words;
and the word segmentation module is used for performing word segmentation processing on the corpus text based on the target word quantity to obtain at least one candidate word, wherein the candidate word is composed of words of the target word quantity, and the words in the candidate word are continuous in the corpus text.
Optionally, thesecond extraction module 703 includes:
the extraction unit is used for extracting the characteristics of each word in the sensitive word original word through the word vector extraction model to obtain the word vector of each word;
a first determining unit, configured to determine the sensitive word-word vector based on the word-word vector of each word in the sensitive original word.
Optionally, the determining module includes:
the calculation unit is used for calculating the vector similarity between the sensitive word vector and each candidate word vector for each sensitive word vector;
a second determining unit, configured to determine, as the candidate sensitive word, the first k candidate words whose vector similarity is greater than a similarity threshold, where k is a positive integer.
In summary, in the embodiment of the present application, when the word vector extraction model is trained, besides the corpus text is used as the training sample, the sub-words of each word in the corpus text are also used as the training sample, so that the recognition capability of the word vector extraction model on the deformed words is improved; the method and the device have the advantages that after feature extraction is respectively carried out on the candidate words and the sensitive word original words in the corpus text by the aid of the word vector extraction models obtained through training, sensitive word recognition can be carried out on the basis of word vectors corresponding to the candidate words and the sensitive word original words respectively.
In this embodiment, the computer device performs n-gram segmentation on words in the corpus text to obtain a plurality of sub-words, and performs model training using word sequences formed by the words and the sub-words in the corpus text as training samples, so that the model can learn features of the sub-words in the training process, and the expression capability of the word vector extraction model on the sub-word features is improved.
In addition, in the training process, model training is carried out by adopting a skip-gram or CBOW algorithm based on the context of words in the corpus text, unsupervised model training is realized under the condition that the corpus text is not required to be labeled, and the model training efficiency is improved.
Meanwhile, the candidate words formed by the words with the target word quantity are obtained by segmenting the corpus text, and the characteristic extraction is carried out on the candidate words, so that the mining of single-word sensitive words and multi-word sensitive word groups can be realized, and the comprehensiveness of the sensitive word mining is improved.
In an exemplary embodiment, the present application further provides a computer device, which includes a processor and a memory, where the memory stores a computer program, and the computer program is loaded by the processor and executed to implement the method for determining a sensitive word as provided in the above embodiments.
The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for determining a sensitive word according to the above embodiments.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the method for determining the sensitive word provided in the various alternative implementations of the above aspects.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (11)

CN202111319256.5A2021-11-092021-11-09 Sensitive word determination method, device, equipment, storage medium and program productActiveCN114036260B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111319256.5ACN114036260B (en)2021-11-092021-11-09 Sensitive word determination method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111319256.5ACN114036260B (en)2021-11-092021-11-09 Sensitive word determination method, device, equipment, storage medium and program product

Publications (2)

Publication NumberPublication Date
CN114036260Atrue CN114036260A (en)2022-02-11
CN114036260B CN114036260B (en)2025-10-03

Family

ID=80137089

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111319256.5AActiveCN114036260B (en)2021-11-092021-11-09 Sensitive word determination method, device, equipment, storage medium and program product

Country Status (1)

CountryLink
CN (1)CN114036260B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114971854A (en)*2022-05-302022-08-30中国银行股份有限公司Transaction information processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101452446A (en)*2007-12-072009-06-10株式会社东芝Target language word deforming method and device
WO2011150730A1 (en)*2010-05-312011-12-08百度在线网络技术(北京)有限公司Method and device for mixed input in english and another kind of language
CN113094459A (en)*2021-04-212021-07-09自然资源部地图技术审查中心Map checking method and device
CN113486227A (en)*2021-07-012021-10-08哈尔滨理工大学Shopping platform commodity spam comment identification method based on deep learning
CN113536800A (en)*2020-04-132021-10-22北京金山数字娱乐科技有限公司 A word vector representation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101452446A (en)*2007-12-072009-06-10株式会社东芝Target language word deforming method and device
WO2011150730A1 (en)*2010-05-312011-12-08百度在线网络技术(北京)有限公司Method and device for mixed input in english and another kind of language
CN113536800A (en)*2020-04-132021-10-22北京金山数字娱乐科技有限公司 A word vector representation method and device
CN113094459A (en)*2021-04-212021-07-09自然资源部地图技术审查中心Map checking method and device
CN113486227A (en)*2021-07-012021-10-08哈尔滨理工大学Shopping platform commodity spam comment identification method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUN YU 等: "Design and Implementation of the Context-Based Adaptive Filtering System for Sensitive Words", 《 RECENT TRENDS IN INTELLIGENT COMPUTING, COMMUNICATION AND DEVICES》, 2 October 2019 (2019-10-02), pages 317 - 326*
郑旭如: "基于深度学习的数据脱敏研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, 15 February 2021 (2021-02-15), pages 138 - 2548*

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114971854A (en)*2022-05-302022-08-30中国银行股份有限公司Transaction information processing method and device

Also Published As

Publication numberPublication date
CN114036260B (en)2025-10-03

Similar Documents

PublicationPublication DateTitle
CN112528637B (en)Text processing model training method, device, computer equipment and storage medium
US10504010B2 (en)Systems and methods for fast novel visual concept learning from sentence descriptions of images
CN111985229B (en)Sequence labeling method and device and computer equipment
CN111783471B (en)Semantic recognition method, device, equipment and storage medium for natural language
CN111291195B (en)Data processing method, device, terminal and readable storage medium
CN109977416A (en)A kind of multi-level natural language anti-spam text method and system
CN113254654B (en) Model training, text recognition method, apparatus, equipment and medium
CN107341143B (en)Sentence continuity judgment method and device and electronic equipment
CN114358203A (en)Training method and device for image description sentence generation module and electronic equipment
CN113158656B (en)Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN111599340A (en)Polyphone pronunciation prediction method and device and computer readable storage medium
CN110163181A (en)Sign Language Recognition Method and device
CN107526721B (en)Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN113255331B (en)Text error correction method, device and storage medium
CN116432646A (en)Training method of pre-training language model, entity information identification method and device
CN111767714B (en)Text smoothness determination method, device, equipment and medium
CN113918031B (en) System and method for Chinese punctuation recovery using sub-character information
CN110298041B (en)Junk text filtering method and device, electronic equipment and storage medium
CN112257452A (en)Emotion recognition model training method, device, equipment and storage medium
CN112199954A (en)Disease entity matching method and device based on voice semantics and computer equipment
CN112926308A (en)Method, apparatus, device, storage medium and program product for matching text
CN116796730A (en)Text error correction method, device, equipment and storage medium based on artificial intelligence
CN116384403A (en) A Scene Graph Based Multimodal Social Media Named Entity Recognition Method
CN116245097A (en)Method for training entity recognition model, entity recognition method and corresponding device
CN115526318B (en) Knowledge extraction method, device, electronic device and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp