Disclosure of Invention
In order to solve the above problems, the present application provides a news entity analysis method and apparatus based on unsupervised learning.
In a first aspect, the present application provides a news entity analysis method based on unsupervised learning, including:
performing word segmentation on each piece of news data in a plurality of pieces of news data to be processed respectively, and labeling a plurality of entities contained in each piece of news after word segmentation to obtain a labeling result;
constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, wherein the distributed representation information is identified as entity vectors;
and according to the distributed representation information of the entities, carrying out clustering analysis on the entities to obtain a clustering result.
Preferably, the unsupervised learning based news entity method further comprises:
carrying out topic clustering according to the labeling result corresponding to each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs;
according to the topic to which each piece of news data belongs, the probability of each entity in the plurality of entities appearing in the topic to which the plurality of pieces of news data belong is counted, and the topic distribution of each entity in the plurality of entities in the plurality of pieces of news data is obtained.
Preferably, the unsupervised learning based news entity method further comprises:
determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented by an average distance of the topic distribution of each entity.
Preferably, the unsupervised learning based news entity method further comprises:
and searching according to the seed information to obtain the hidden information related to the plurality of entities in the plurality of news data.
Preferably, the unsupervised learning based news entity method further comprises:
constructing relationships between the plurality of entities according to the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of pieces of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
In a second aspect, the present application provides a news entity analysis apparatus based on unsupervised learning, including:
the marking module is used for respectively carrying out word segmentation on each piece of news data in the plurality of pieces of news data to be processed and marking a plurality of entities contained in each piece of news after the word segmentation so as to obtain a marking result;
the acquisition module is used for constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, and the distributed representation information is marked as entity vectors;
and the clustering module is used for carrying out clustering analysis on the entities according to the distributed representation information of the entities so as to obtain a clustering result.
Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:
the news topic acquisition module is used for carrying out topic clustering according to the labeling result corresponding to each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs;
and the topic distribution acquisition module is used for counting the probability of each entity in the plurality of entities appearing in the topics to which the plurality of news data belong according to the topic to which each piece of news data belongs, so as to obtain the topic distribution of each entity in the plurality of news data.
Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:
and the entity clustering analysis module is used for determining the clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, and the clustering effect is represented by the average distance of the topic distribution of each entity.
Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:
and the seed information searching module is used for searching according to the seed information to obtain the hidden information related to the entities in the news data.
Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:
the community discovery module is used for constructing the relationship among the plurality of entities according to the distributed representation of the plurality of entities and the co-occurrence times of the plurality of entities in the plurality of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
According to the method and the device, a distributed idea is introduced into the processing of the news entity, the distributed representation of the entity is obtained through the context of the position of the news entity, and the clustering result of the entity is obtained through clustering analysis of the entity.
Detailed Description
The technical solution provided by the present invention is further described in detail below with reference to the accompanying drawings and embodiments.
Fig. 1 is an application schematic diagram of a technical scheme provided in an embodiment of the present application. Referring to fig. 1, after obtaining news data, according to the technical solution, similarity between entities in the news data and a community relationship between the entities can be found, and seed information search can be performed to find information related to the entities.
Fig. 2 is a schematic method diagram of a technical solution provided in an embodiment of the present application. As shown in fig. 2, the method for analyzing news entities in unsupervised learning provided by the present application includes:
s201: and respectively carrying out word segmentation on each piece of news data in the plurality of pieces of news data to be processed, and labeling a plurality of entities contained in each piece of news after word segmentation to obtain a labeling result.
In some possible embodiments, before performing word segmentation processing on each piece of news data in the plurality of pieces of news data to be processed, the plurality of pieces of obtained news data are preprocessed to obtain a plurality of pieces of news texts for performing word segmentation processing. Correspondingly, the news data and the news text correspond one to one; the word segmentation processing is performed on the news data, and specifically, the word segmentation processing may be performed on a news text corresponding to the news data. The acquisition of news data can be crawled from mainstream news websites by means of web crawlers.
The pretreatment process mainly comprises the following steps: converting a Chinese coding system adopted by a plurality of texts of news data into an utf-8 format; removing illegal characters in the news text, removing illegal characters in the news text by using a regular expression, and only reserving Chinese characters, English words, numbers and common punctuation marks; the digital conversion is to uniformly convert all Arabic numerals in the news text file into a simplified Chinese standard writing method; punctuation conversion, namely uniformly converting half-corner characters in the news text into corresponding full-corner characters; and (4) screening the size of the news text, and deleting the noise file with the number of words smaller than the preset number after the punctuation is removed from the news text. It will be appreciated that the specific pre-processing rules will depend on the actual news text, for example the predetermined number may be 6 or some other number. The news data constitutes a corpus.
Specifically, the pre-training model for Chinese word segmentation and named entity recognition can be downloaded correspondingly, the model is loaded, and word segmentation is performed in a news text corresponding to each piece of news data in a plurality of pieces of news data. And checking word segmentation results, creating a dictionary for words which are easy to be wrongly divided in the news text, and loading the words in a word segmentation stage, so that the word segmentation precision is improved. And carrying out named entity recognition on the news text subjected to word segmentation, namely carrying out entity marking, and marking out a corresponding entity in the text. And storing an entity labeling result generated by the first named entity labeling. And then searching and matching in the segmented words and the labeled news text, and when a certain word is searched to be matched with the identified entity but not labeled as the entity, re-labeling in the text. In this way, the labeling result can be obtained.
S202: and constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, wherein the distributed representation information is identified as entity vectors.
In a more specific example, based on the labeling result after Chinese word segmentation and entity labeling, a skip-gram model and a glove model are trained to obtain a distributed representation of each entity in a plurality of entities, that is, an entity vector of the entity is obtained.
The skip-gram model is a model for describing word2vec, and is a word vector representation model in which a central word w is giveniTo predict a possible context wi-k,…,wi-1],wi,[wi+1,…, wi+k]. Word wcThe probability of occurrence is denoted as P (w)c|wi) In the scheme, a SoftMax function is adopted to calculate the given wiIn the case of (1), the word wcThe probability of occurrence is expressed as:
wherein P represents the word w
cProbability of occurrence, w
jA word vector characterizing word j in the vocabulary, V characterizing the total vocabulary,
word w
cTransposition of vector representation, w
iA given core word is characterized.
The glove model is also a vectorization representation model of words, and semantic and grammatical information can be contained between vectors as much as possible. In addition, the glove model realizes the introduction of global information by acquiring a word co-occurrence matrix in the whole corpus. The optimized function introduced with the weighting function is represented as follows:
wherein J represents the loss function to be solved, V represents the total vocabulary, X
ijCharacterizing the number of co-occurrences of word i and word j, f (X)
ij) The weight function is characterized in that it is,
transpose of the i-vector of the token, w
jWord vectors characterizing word j, b
iAnd b
jThe word bias parameters are characterized. The word bias parameters need to be obtained through training.
Thus, the entity vector of the entity can be obtained through the skip-gram model and the glove model.
In some cases, entity vector results can be used to find some entity results that identify errors due to erroneous segmentation, but it can be found that the erroneous entities caused by the erroneous segmentation often have similar entity vectors due to similar contexts and certainly exist in the same text. The method judges whether different participles are the same entity or not by using the entity vector result of model training. When a character string of a certain entity recognition result represents a substring of another entity recognition result, two results are considered to be the same entity if any one of the following two rules is satisfied:
rule 1. cosine similarity of two entities > 0.80.
Rule 2. cosine similarity of two entities >0.75 and occurs at least simultaneously in one news text.
Wherein the cosine similarity is expressed as:
wiand wjEntities characterizing two entities separatelyAnd (5) vector quantity. The result of assuming that the entities are the same can be adjusted by changing the threshold of cosine similarity.
For example, if there are entities in the news text as names of people, and the names of people are 4 words "ABCD", when performing word segmentation, the words may be divided into "ABCD", "ABC", "AB", and the like, and thus the word segmentation may affect the generated entities. Or the news text has entities of a person name and a place name respectively, but the person name and the place name are called as 'AA', so that the generated entities are influenced. Therefore, whether the entities are the same entity is judged through the two rules.
S203: and according to the distributed representation information of the entities, carrying out clustering analysis on the entities to obtain a clustering result.
In a more specific example, according to the distributed representation information of the entities obtained through the skip-gram model, namely, the entity vectors of the entities, the cosine similarity between the entity vectors is used as a distance measurement function to perform k-means clustering, so as to obtain the clustering result of the entities.
Fig. 3 is a schematic method diagram of a technical solution provided in another embodiment of the present application. As shown in fig. 3, in some possible embodiments, the method for analyzing news entities for unsupervised learning may further include the following steps:
s204: and carrying out topic clustering according to the respective corresponding labeling result of each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs.
Specifically, clustering is performed on the labeling result corresponding to each piece of news data by using a Dirichlet Allocation (LDA) topic clustering model, topic distribution of each piece of news data is obtained by adopting a Gibbs sampling mode, and then the topic with the maximum probability is distributed to each piece of news data to serve as the topic to which each piece of news data belongs.
The LDA model is a three-layer Bayesian network, each piece of news data and each topic respectively meet Dirichlet distribution with parameters of alpha and beta as prior distribution, topic distribution of each word position in the news text is generated according to the topic distribution of each piece of news data, words of each position are generated according to the topic distribution of each position, and finally a sequence-independent word combination can be obtained. The probability of generation of the location is expressed as:
wherein p represents probability, theta represents distribution of topics, N represents number of words, z represents topics, w represents words, and alpha and beta represent hyper-parameters in LDA.
Thus, the probability of word generation in each news text is represented as:
wherein p represents probability, w represents words, alpha and beta represent hyper-parameters in LDA, theta represents distribution of topics, N represents number of words, and z represents topics.
Then, using the Gibbs sampling estimation model parameters, the optimization goal is to generate the maximization of likelihood probability for the corpora in the corpus. And discarding the result in the combustion period through multiple iterations to obtain a parameter estimation result.
A specific algorithmic input is a sequence of text words, an example being w ═ w
1,w
2,…,w
M}. The input parameters are hyper-parameters alpha and beta, and the number K of topics. The output is a sequence of text topics, an example being z ═ { z ═ z
1, z
2,…,z
MAnd single text therein
A count of the posterior probability distribution p (z | w, α, β), and an estimate of the parameter θ.
The algorithm is represented as follows:
s2041: let the elements n of all count matricesmkAnd nkvCounting the elements n of the vectorkAnd nmThe initial value is 0. Wherein n ismkCharacterization ofThe number of words belonging to the k-th topic, n, in the text corresponding to the m-th news datakvThe number of times the token v belongs to the kth topic.
S2042: for all words w in the mth textmnSampling topics, adding texts, adding topics, and respectively adding texts and n after adding topicsmkAnd nkvCounting is performed.
S2043: the following operations are cyclically carried out until the combustion period is entered:
s2043 a: if the current word is the v-th word and the topic is the k-th topic, the count is reduced;
s2043 b: sampling according to the full condition distribution, wherein the formula is as follows;
s2043 c: a new count is added.
S2044: using the resulting sample count, a parameter θ can be calculated, as follows:
in some possible embodiments, the tagging result corresponding to each news data in the plurality of news data may include a plurality of topic-independent interfering words. In this case, it is necessary to remove the invalid information first.
For example, a labeling result corresponding to a piece of news contains many high-frequency common words such as "of", "hello", and the like, and the scheme adopts a method of removing stop words.
S205: according to the topic to which each piece of news data belongs, the probability of each entity in the plurality of entities appearing in the topic to which the plurality of pieces of news data belong is counted, and the topic distribution of each entity in the plurality of entities in the plurality of pieces of news data is obtained.
Specifically, for the frequency of occurrence of all entities in news of a certain topic, the distribution of the occurrence of all entities in news topics can be obtained, and the distribution of entity topics is represented by vectors with the length of M.
For example, there is an entity A that appears in 4 news items, where one news item is topic T
0Belonging news, three news being topics T
M-1The news of the entity, then the topic distribution vector of the entity A can be expressed as
Fig. 4 is a schematic method diagram of a technical solution provided in another embodiment of the present application, please refer to fig. 4, in some possible embodiments, the method for analyzing news entities in unsupervised learning may further include:
s206: determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented by an average distance of the topic distribution of each entity.
Specifically, according to the obtained topic distribution of the entity and the clustering result of the entity, the clustering effect of the entity can be determined, and the effectiveness of the distributed representation (i.e. entity vector) of the entity is represented by the clustering effect.
The entity topic distribution within a cluster is identified as [ x ]1,x2,…,xM]The similarity of the topics is measured by using the average distance of the topics in the class, and the formula is as follows:
wherein dist represents the average distance, and M represents the number of topics.
The smaller the dist value, the higher the topic similarity. In addition, the topic occupying the greatest proportion in the clustering result is taken as the topic of the entity clustering result, and the larger the value is, the more concentrated the topics of the clustering result are.
In some possible embodiments, referring to fig. 4, the method for analyzing news entities in unsupervised learning may further include:
s207: and searching according to the seed information to obtain the hidden information related to the plurality of entities in the plurality of news data.
Specifically, the seed information is information used in retrieval. For example, if it is desired to obtain information related to entity a, entity a is the seed information. And screening vector representation of input seeds by using entity vectors generated by a skip-gram model, and screening parts with higher similarity as input associated contents at K entities with the similarity topK in a lexicon.
Some information may be hidden in the text and is not found, and in order to find some hidden related information as much as possible, the specific flow of the designed algorithm is as follows:
s2071: and searching all possible recognition results found by the seed entity in the entity merging stage, and taking all the possible recognition situations as candidates for input search.
For example, searching for an entity named "ABCD" may search for possible related information such as "ABCD", "ABC", "AB", etc. as seeds to obtain search identification results of all related entities, where the identification results are hidden information.
S2072: judging each element in the search candidates, if the element is an entity, extracting words with similar Top50 in the model dictionary and the similarity degree larger than a threshold value of 0.80, and adding the words into the keyword list.
S2073: all entities are found from the keyword list, and the entities in the vocabulary of each entity similarity Top30 are searched.
S2074: searching the intersection of the word of the new entity Top50 and the existing keyword list, and adding the new entity Top50 into the list if the number of the intersection is more than 12; if the number of words in the keyword list is not enough 24, the number of overlaps may be more than half of the number of list words.
S2075: the operations of S2073 and S2074 are repeated until the elements in the list are no longer changed.
The specific value of the threshold can be debugged according to the use results of different corpora, if the recall rate is increased, the threshold is reduced, and if the accuracy rate is increased, the threshold is improved.
In some possible embodiments, referring to fig. 4, the method for analyzing news entities in unsupervised learning may further include:
s208: constructing relationships between the plurality of entities based on the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of news data. Community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
Specifically, the relationship between entities is constructed by using an entity distributed representation (namely an entity vector) generated by a glove model and the co-occurrence times of the entities in the news text.
Illustratively, the relationship between two entities may be constructed by the following formula:
rel'(w1,w2)=min(log2(co_times+1)*sim(w1,w2)/2,ξ2)
wherein, w1And w2Two entities are characterized, co times characterizing entity w1And w2The number of co-occurrences. The more the co-occurrence times among the entities are, the greater the similarity of the entity vectors is, and the stronger the association among the entities is. When the similarity degree is higher than xi1When so, the entities are considered to have an association; xi2Maximum value of the characterizing entity relationship; xi3For removing some weak associations to reduce some unnecessary edges, and to facilitate analyzing the results.
The threshold is adjusted according to specific needs, the smaller the threshold is, the larger the number of entity relationships considered is, but if the threshold is too small, a large amount of noise is introduced, and if the threshold is too large, the number of relationships considered is likely to be too small to lose information.
After the relationship construction between the entities is completed, a community discovery task is performed on the entity relationship graph by using a Louvain algorithm. The community is a group which is mined to have relatively close potential relation in a graph structure formed by entities and the association between the entities, and can be used for discovering groups hidden in texts. The Louvain algorithm optimizes the modularity of community division based on a greedy strategy. The modularity is an index for measuring the community division level under the condition of not knowing the real community division, can be used for measuring the association degree in the community, and is expressed by a formula as follows:
wherein Q represents the modularity, m represents the sum of network connection weights, v and w are any two points in the entity relationship construction diagram, AvwCharacterizing the connection weight, k, between two pointsvDegree, k, characterizing v PointwDegree characterizing point w, δ (c)v,cw) And representing whether the v point and the w point belong to the same community, if so, the v point and the w point are 1, and if not, the w point and the w point are 0.
In some possible embodiments, the Louvain algorithm may also be optimized, and the optimization may be divided into two steps:
1) dividing communities for each node, calculating the modularity at the moment, combining a certain node with the communities of the neighbor nodes to calculate the modularity gain, finding the maximum modularity gain combination node, and repeating for several times until the modularity is not increased any more;
2) the new communities are merged into points and the operations in 1 are repeated until convergence.
By calculating the relationship between two entities extracted from the same community of the community division result into an entity pair and comparing the relationship between the entity pair formed by randomly extracting the entities, the effectiveness of the community discovery task can be explained.
Fig. 5 is a schematic diagram of an apparatus according to the technical solution provided in the example of the present application. As shown in fig. 5, the unsupervised learning-based news entity analyzing apparatus may include:
alabeling module 501, configured to perform word segmentation on each piece of news data to be processed, and label a plurality of entities included in each piece of news after the word segmentation to obtain a labeling result;
an obtainingmodule 502, configured to construct a distributed representation model based on the tagging result, to obtain distributed representation information of the multiple entities, where the distributed representation information is identified as an entity vector;
theclustering module 503 is configured to perform clustering analysis on the multiple entities according to the distributed representation information of the multiple entities to obtain a clustering result.
In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:
a newstopic acquisition module 504, configured to perform topic clustering according to the respective labeling results corresponding to each piece of news data including the plurality of pieces of news data, and acquire a topic to which each piece of news data belongs;
a topicdistribution obtaining module 505, configured to count, according to the topic to which each piece of news data belongs, a probability that each entity in the plurality of entities appears in the topic to which the news data belongs, so as to obtain topic distribution of each entity in the plurality of news.
In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:
an entitycluster analysis module 506, configured to determine a clustering effect of the clustering result according to the topic distribution of the plurality of entities and the clustering result, where the clustering effect is represented by an average distance of the topic distribution in the topic clustering result.
In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:
the seedinformation searching module 507 is configured to search according to seed information to obtain implicit information related to the multiple entities in the multiple pieces of news data.
In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:
acommunity discovery module 508 for constructing relationships between the plurality of entities according to the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.