Summary of the invention
The purpose of the invention is to overcome the shortcomings of above-mentioned background technique, a kind of barrage based on N-gram model is providedSubject distillation method, medium, equipment and system can accurately extract barrage theme.
The present invention provides a kind of barrage subject distillation method based on N-gram model, comprising the following steps:
Data preparation: barrage data are extracted;
It constructs barrage feature: extracting the corresponding feature of word for indicating certain specific intended, be added to customized dictionary;It willThere is no the word of practical significance that customized deactivated dictionary is added;
Data prediction: removal " barrage content " field is empty data;Remove the punctuate symbol in " barrage content " fieldNumber;
Use N-gram model by barrage content representation for term vector: by the barrage content of data prediction, using N-Gram model indicates that N-gram model indicates that the probability of occurrence of certain word in sentence is to N-1 word before related, and N is positive integer;Every barrage content is cut into one group of term vector, carrys out every barrage content of cutting at word rule according in customized dictionary,According to customized deactivated dictionary come the word of filtering useless.
Based on the above technical solution, the value of the N is 2, i.e., each word has relationship with a word before it.
Based on the above technical solution, in the N-gram model, the new probability formula of sentence appearance are as follows:Wherein, p indicates the probability value that sentence occurs, w11st is indicated in sentenceThe word of position, Π indicates tired and multiplies symbol, and m indicates the word number in sentence, wiIndicate that i-th of word, m, i are positive integer, p (wiwi-1)It indicates the probability that the word on the position i-th and i-1 occurs simultaneously, existing phrase is reconfigured according to the formula, utilization is adjacentTwo words merge and generate new phrase.
Based on the above technical solution, the dimension of the term vector is 600,000 dimensions, i.e., by each barrage table of contentsIt is shown as a vector of 600,000 dimensions, the corresponding word in each position, obtaining final barrage theme indicates.
The present invention also provides a kind of storage medium, computer program, the computer program are stored on the storage mediumThe above method is realized when being executed by processor.
The present invention also provides a kind of electronic equipment, including memory and processor, stored on a processor on memoryThe computer program of operation, processor realize the above method when executing computer program.
The present invention also provides a kind of barrage subject extraction system based on N-gram model, which includes data preparation listMember, barrage feature construction unit, data pre-processing unit, N-gram model indicate unit, cutting unit, in which:
Data preparation unit is used for: extracting barrage data;
Barrage feature construction unit is used for: being extracted the corresponding feature of word for indicating certain specific intended, is added to and makes by oneselfAdopted dictionary;Customized deactivated dictionary is added in the word of not practical significance;
Data pre-processing unit is used for: removal " barrage content " field is empty data;Remove in " barrage content " fieldPunctuation mark;
N-gram model indicates that unit is used for: by the barrage content of data prediction, indicated using N-gram model,N-gram model indicates that the probability of occurrence of certain word in sentence is related to N-1 word before, and N is positive integer;
Cutting unit is used for: every barrage content being cut into one group of term vector, according to advising in customized dictionary at wordThen carry out every barrage content of cutting, according to customized deactivated dictionary come the word of filtering useless.
Based on the above technical solution, the value of the N is 2, i.e., each word has relationship with a word before it.
Based on the above technical solution, in the N-gram model, the new probability formula of sentence appearance are as follows:Wherein, p indicates the probability value that sentence occurs, w11st is indicated in sentenceThe word of position, ∏ indicates tired and multiplies symbol, and m indicates the word number in sentence, wiIndicate that i-th of word, m, i are positive integer, p (wiwi-1)It indicates the probability that the word on the position i-th and i-1 occurs simultaneously, existing phrase is reconfigured according to the formula, utilization is adjacentTwo words merge and generate new phrase.
Based on the above technical solution, the dimension of the term vector is 600,000 dimensions, i.e., by each barrage table of contentsIt is shown as a vector of 600,000 dimensions, the corresponding word in each position, obtaining final barrage theme indicates.
Compared with prior art, advantages of the present invention is as follows:
(1) present invention extracts barrage data;The corresponding feature of word for indicating certain specific intended is extracted, is added to and makes by oneselfAdopted dictionary;Customized deactivated dictionary is added in the word of not practical significance;Data prediction: removal " barrage content " field is skyData;Remove the punctuation mark in " barrage content " field;By the barrage content of data prediction, using N-gram modelIt indicates, N-gram model indicates that the probability of occurrence of certain word in sentence is to N-1 word before related, and N is positive integer;By every bulletCurtain content is cut into one group of term vector, carrys out every barrage content of cutting at word rule according in customized dictionary, according to making by oneselfThe deactivated dictionary of justice carrys out the word of filtering useless.The present invention uses N-gram model by barrage content representation for term vector, N-gramThe barrage representation method of model overcomes existing bag of words and ignores this disadvantage of context, so that barrage indicates more acurrate,Therefore, barrage theme can accurately be extracted.
(2) in N-gram model of the invention, the value of N is 2, i.e., each word has relationship with a word before it.This hairIt is bright that improvement is made based on original 2-gram model, it can reduce computation complexity.
(3) present invention has merged N-gram model and artificial constructed characterization method, can accurately extract single barrage,The main information in single room, and realize that the vectorization of barrage theme indicates.
(4) it is also accumulated while live streaming platform has accumulated many any active ues with the continuous development of live streaming platform serviceThe data of a large amount of text type.Deep excavation is done by the content of text to live streaming platform, it is known that user and roomBetween content is similar, content similarity of room and room, to improve the effect of live streaming platform personalized recommendation.
Specific embodiment
With reference to the accompanying drawing and specific embodiment the present invention is described in further detail.
Shown in Figure 1, the embodiment of the present invention provides a kind of barrage subject distillation method based on N-gram model, includingFollowing steps:
S1, data preparation: barrage data are extracted;
S2, building barrage feature: the corresponding feature of word for indicating certain specific intended is extracted, custom words are added toLibrary;Customized deactivated dictionary is added in the word of not practical significance;
S3, data prediction: removal " barrage content " field is empty data;Remove the punctuate in " barrage content " fieldSymbol;
S4, it uses N-gram model by barrage content representation for term vector: by the barrage content of data prediction, usingN-gram model indicates that N-gram model indicates that the probability of occurrence of certain word in sentence is to N-1 word before related, and N is positive wholeNumber;Every barrage content is cut into one group of term vector, according to coming in every barrage of cutting in customized dictionary at word ruleHold, according to customized deactivated dictionary come the word of filtering useless.
The embodiment of the present invention also provides a kind of storage medium, and computer program, computer journey are stored on the storage mediumThe above-mentioned barrage subject distillation method based on N-gram model is realized when sequence is executed by processor.It should be noted that storage is situated betweenMatter include USB flash disk, mobile hard disk, ROM (Read-Only Memory, read-only memory), RAM (Random Access Memory,Random access memory), the various media that can store program code such as magnetic or disk.
Shown in Figure 2, the embodiment of the present invention also provides a kind of electronic equipment, including memory and processor, memoryOn store the computer program run on a processor, processor is realized above-mentioned based on N-gram mould when executing computer programThe barrage subject distillation method of type.
The embodiment of the present invention also provides a kind of barrage subject extraction system based on N-gram model, which includes dataPreparatory unit, barrage feature construction unit, data pre-processing unit, N-gram model indicate unit, cutting unit, in which:
Data preparation unit is used for: extracting barrage data;
Barrage feature construction unit is used for: being extracted the corresponding feature of word for indicating certain specific intended, is added to and makes by oneselfAdopted dictionary;Customized deactivated dictionary is added in the word of not practical significance;
Data pre-processing unit is used for: removal " barrage content " field is empty data;Remove in " barrage content " fieldPunctuation mark;
N-gram model indicates that unit is used for: by the barrage content of data prediction, indicated using N-gram model,N-gram model indicates that the probability of occurrence of certain word in sentence is related to N-1 word before, and N is positive integer;
Cutting unit is used for: every barrage content being cut into one group of term vector, according to advising in customized dictionary at wordThen carry out every barrage content of cutting, according to customized deactivated dictionary come the word of filtering useless.
N-gram model is common a kind of language model in large vocabulary continuous speech recognition, and for Chinese, we claimBe Chinese language model (CLM:Chinese Language Model).Chinese language model is using between adjacent word in contextCollocation information, needing the phonetic continuously without space, stroke, or represents alphabetical or stroke number, be converted into Chinese character stringWhen (i.e. sentence), the sentence with maximum probability can be calculated, thus realize the automatic conversion for arriving Chinese character, it is manual without userSelection avoids the coincident code problem of the corresponding identical phonetic (or stroke string or numeric string) of many Chinese characters.
N-gram model based on it is such a it is assumed that some word appearance only it is related to the word of front N-1, and with otherWhat word is all uncorrelated, and the probability of whole sentence is exactly the product of each word probability of occurrence.These probability can be by directly from corpusIt counts N number of word while the number occurred obtains.
In the embodiment of the present invention, the value of N is 2, i.e., each word has relationship with a word before it.
In N-gram model, the new probability formula of sentence appearance are as follows:ItsIn, p indicates the probability value that sentence occurs, w1Indicate the word of the 1st position in sentence, Π indicates tired and multiplies symbol, and m is indicated in sentenceWord number, wiIndicate that i-th of word, m, i are positive integer, p (wiwi-1) indicate the word on the position i-th and i-1 while occurring generalRate reconfigures existing phrase according to the formula, is merged using two adjacent words and generates new phrase.
The dimension of term vector is 600,000 dimensions, i.e., is a vector of 600,000 dimensions, Mei Gewei by each barrage content representationA corresponding word is set, obtaining final barrage theme indicates.
The text representation for being based purely on bag of words has ignored relationship between single word and context, in contrast, thisInventive embodiments consider the influence of context-sensitive text by N-gram model, keep barrage subject distillation more acurrate;It is anotherAspect, simple uses N-gram model to make text representation excessively complicated, and the embodiment of the present invention makes improvement for N-gram.
It is exemplified below.
Modelling data source: here by nearest one month barrage data of platform as data source.
Modeling procedure:
(1) data preparation: extracting nearest one month barrage data, and the data for mainly including have this word of barrage contentSection, data format are [barrage content];
(2) barrage feature construction: there are many proprietary vocabulary for having platform characteristic, such as " water friend " in barrage, refer to masterThe bean vermicelli broadcast.
Customized proper noun, verb, and it is added to Custom Dictionaries library.Such as: " water friend " word is added customized" water friend " cutting can be a word during subsequent participle, without being cut into " water " and " friend " two words by dictionary.
Feature extraction: the vocabulary of cheers such as " 666 " is replaced with " cheer ";It will " 13617258349 " be this meets handThe numeric string of machine feature, with " mobile phone contact method ", this feature is replaced;By " QQ324567865 " this kind of character string, with " QQThis feature of contact method " replaces, the word with platform characteristic and with certain specific intended of expression of all platform accumulations,It will be converted to a corresponding feature in this way.
The mode of this feature extraction can effectively reduce the dimension of feature, such as all QQ numbers are all converted into " QQ connectionIt is mode " this feature.
(3) data prediction:
Data prediction is done on the basis of step (2): removal " barrage content " field is empty data;Remove " barrageThe punctuation mark for including in content " field.
(4) use N-gram model by barrage content representation for term vector:
Customized dictionary: the content based on platform, manual sorting portion include the custom words of all specific vocabulary of platformLibrary, the accuracy of customized dictionary will affect the extraction accuracy of barrage subject content.
Customized deactivated dictionary: stop words refers to that the word of no too many practical significance compared with other words, this word existIt can be removed before doing content analysis.Specific to judging that a word is stop words, vary with each individual, the present invention different because of sceneThe peculiar stop words dictionary that embodiment is precipitated using live streaming platform itself.
Word cutting and the bag of words of barrage indicate: will be cut into one group of term vector by the barrage of data prediction, oftenBarrage, which is first depending in customized dictionary, carrys out cutting at word rule, while can come according in customized dictionary at word ruleFilter useless word.
For example, barrage content is " the naked wolf of main broadcaster today plays excellent ", it include " today " in stop words," ", " ", after word cutting, barrage becomes [" main broadcaster ", " naked wolf ", " object for appreciation ", " so good ", " so good "].Here in order to remember bulletAfter curtain word cutting, the word order between word and word, be expressed as here with key-value key-value pair [" main broadcaster ": 1, " naked wolf ": 2, " object for appreciation ":3, " so good ": 4, " so good ": 5], key indicates that word, value indicate the word order of word place sentence.
N-gram model is meant that: the probability of occurrence of some word is relevant to its preceding N-1 word in sentence, here NValue be 2, i.e., each word and it before a word have relationship.
In N-gram model, the probability that a sentence occurs is expressed as:
Wherein, p (wi|wi-1) indicate: when the word on the (i-1)-th position occurs, the conditional probability of i-th of word appearance.
According to N-gram model, the probability of sentence appearance is calculated, needs successively to calculate each word and relies on a wordConditional probability p (wi|wi-1)。
From naive Bayesian formula:
p(wi|wi-1) specific calculating process it is extremely complex, the embodiment of the present invention is innovatively by former simplified formula are as follows:
Wherein, p (wiwi-1) indicate that the probability that the word on the position i-th and i-1 occurs simultaneously, p indicate the probability that sentence occursValue, w1Indicate the word of the 1st position in sentence, ∏ indicates tired and multiplies symbol, and m indicates the word number in sentence, wiIndicate i-th of word,M, i is positive integer.
The embodiment of the present invention will calculate the conditional probability that i-th of word occurs when the word on the (i-1)-th position occurs:The word for being reduced to calculate on the position i-th and i-1 while the Probability p occurred(wiwi-1), hence it is evident that alleviate the workload and complexity of calculating.
On the basis of (3) step, existing phrase is reconfigured according to the rule of simplified N-gram, utilizes phaseTwo adjacent words, which merge, generates new phrase, [" main broadcaster ", " naked wolf ", " object for appreciation ", " so good ", " so good "] it is changed into that [" main broadcaster is nakedWolf ", " naked wolf plays ", " so good ", " playing so good "]
Barrage representation method of the embodiment of the present invention based on 2-gram model, overcome bag of words ignore context thisDisadvantage, so that barrage indicates more acurrate;In addition, reducing and calculating in practice again based on improvement is made on original 2-gram modelMiscellaneous degree.
The hash of word maps: setting the dimension of term vector here as 600,000 dimensions (based on Heuristics), i.e., by each barrageContent representation is a vector V of 600,000 dimensions, the corresponding word in each position.Such as: " main broadcaster " is mapped to vector V here(0) position, " naked wolf " are mapped to the position of (1) V, and " object for appreciation " is mapped to the position of (2) V, and " so good " is mapped to the position of (3) V(mapping in reality be it is random be mapped in 600,000 positions, for the convenience of description, word is mapped to specified forward 4On a position, the value of corresponding position indicates the number that word occurs, then [" the naked wolf of main broadcaster ": 1, " naked wolf plays ": 1, " so good ":1, " playing so good ": 1] it is converted to term vector and becomes (1,1,1,1,0,0,0,0,0) later, ellipsis is omitted 590,000 9 thousand 9100 0 (because term vector regular length is 600,000), then obtaining final barrage theme indicates.The subject heading list of barrage is obtainedAfter showing, technology place mat can be done for the identification of rubbish barrage based on the subject distillation of single barrage.
It should be understood that system provided in an embodiment of the present invention is when carrying out intermodule communication, only with above-mentioned each functionThe division progress of module can according to need and for example, in practical application by above-mentioned function distribution by different function mouldsBlock is completed, i.e., the internal structure of system is divided into different functional modules, to complete all or part of function described aboveEnergy.
Further, the present invention is not limited to the above-described embodiments, for those skilled in the art,Without departing from the principles of the invention, several improvements and modifications can also be made, these improvements and modifications are also considered as the present inventionProtection scope within.The content being not described in detail in this specification belongs to existing skill well known to professional and technical personnel in the fieldArt.