Disclosure of Invention
In order to solve the problems, the application provides a news entity analysis method and device based on unsupervised learning.
In a first aspect, the present application provides a news entity analysis method based on unsupervised learning, including:
Respectively performing word segmentation on each piece of news data in the pieces of news data to be processed, and labeling a plurality of entities contained in each piece of news after the word segmentation to obtain labeling results;
Constructing a distributed representation model based on the labeling result to obtain distributed representation information of the plurality of entities, wherein the distributed representation information is marked as an entity vector;
and carrying out cluster analysis on the entities according to the distributed representation information of the entities to obtain a cluster result.
Preferably, the news entity method based on the unsupervised learning further comprises:
Performing topic clustering according to the labeling results corresponding to each piece of news data to obtain topic clustering results and obtaining topics to which each piece of news data belongs;
and counting the occurrence probability of each entity in the plurality of entities in the topics of the plurality of news data according to the topics of each news data, and obtaining the topic distribution of each entity in the plurality of news data.
Preferably, the news entity method based on the unsupervised learning further comprises:
And determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented through the average distance of the topic distribution of each entity.
Preferably, the news entity method based on the unsupervised learning further comprises:
Searching according to the seed information to obtain implicit information related to the entities in the news data.
Preferably, the news entity method based on the unsupervised learning further comprises:
Constructing a relationship between the plurality of entities according to the distributed representation of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of pieces of news data; a community structure that exists in a relationship between the plurality of entities is discovered using a community discovery algorithm.
In a second aspect, the present application provides a news entity analysis device based on unsupervised learning, including:
The marking module is used for respectively carrying out word segmentation on each piece of news data in the pieces of news data to be processed, and marking a plurality of entities contained in each piece of news after the word segmentation so as to obtain marking results;
The acquisition module is used for constructing a distributed representation model based on the labeling result to obtain distributed representation information of the plurality of entities, and the distributed representation information is marked as an entity vector;
and the clustering module is used for carrying out clustering analysis on the entities according to the distributed representation information of the entities so as to obtain a clustering result.
Preferably, the news entity analyzing apparatus based on the unsupervised learning further includes:
The news topic acquisition module is used for carrying out topic clustering according to the labeling results corresponding to each piece of news data to obtain topic clustering results and obtaining topics to which each piece of news data belongs;
the topic distribution acquisition module is used for counting the probability of each entity in the plurality of entities in the topics of the plurality of news data according to the topics of each news data, so as to obtain the topic distribution of each entity in the plurality of news data.
Preferably, the news entity analyzing apparatus based on the unsupervised learning further includes:
And the entity cluster analysis module is used for determining a cluster effect of the cluster result through the topic distribution of each entity in the plurality of entities and the cluster result, wherein the cluster effect is represented through the average distance of the topic distribution of each entity.
Preferably, the news entity analyzing apparatus based on the unsupervised learning further includes:
and the seed information searching module is used for searching according to the seed information to obtain implicit information related to the entities in the news data.
Preferably, the news entity analyzing apparatus based on the unsupervised learning further includes:
The community discovery module is used for constructing a relation among the entities according to the distributed representation of the entities and the co-occurrence times of the entities in the news data, and discovering a community structure existing in the relation among the entities by using a community discovery algorithm.
The application introduces the distributed ideas into the processing of news entities, obtains the distributed representation of the entities through the context of the positions of the news entities, and obtains the clustering result of the entities through the clustering analysis of the entities.
Detailed Description
The technical scheme provided by the invention is further described in detail below with reference to the accompanying drawings and the embodiments.
Fig. 1 is an application schematic diagram of the technical solution provided in the embodiment of the present application. Referring to fig. 1, after obtaining news data, through the technical scheme, similarities among entities in the news data and community relations of the entities can be found, seed information search can be performed, and information related to the entities can be found.
Fig. 2 is a schematic diagram of a method according to the technical scheme provided in the embodiment of the present application. As shown in fig. 2, the method for analyzing news entities without supervised learning provided by the application comprises the following steps:
S201, word segmentation is carried out on each piece of news data in the pieces of news data to be processed, and a plurality of entities contained in each piece of news after the word segmentation are marked to obtain marked results.
In some possible embodiments, before each piece of news data in the pieces of news data to be processed is subjected to word segmentation, the acquired pieces of news data are preprocessed to obtain a plurality of pieces of news text for word segmentation. And the word segmentation processing is carried out on the news data, specifically, the word segmentation processing can be carried out on the news text corresponding to the news data. Acquisition of news data may rely on web crawlers to crawl from mainstream news websites.
The preprocessing process mainly comprises the steps of converting a Chinese coding system adopted by texts of a plurality of news data into an utf-8 format, removing illegal characters in the news texts by using a regular expression, only retaining Chinese characters, english words, numbers and common punctuation marks, converting numbers, uniformly converting all Arabic numbers in a news text file into a simplified Chinese standard writing method, converting punctuation marks, uniformly converting half-angle characters in the news text into corresponding full-angle characters, screening the sizes of the news texts, and deleting noise files with the number of words smaller than a preset number after the punctuation is removed in the news texts. It will be appreciated that the specific preprocessing rules will depend on the actual news text, for example the predetermined number may be 6 or some other value. The news data forms a corpus.
Specifically, a corresponding Chinese word segmentation and named entity recognition pre-training model can be downloaded, and the model is loaded and word segmentation is performed in a news text corresponding to each piece of news data in a plurality of pieces of news data. And checking word segmentation results, creating a dictionary for words which are easy to divide into errors in the news text, loading the dictionary in a word segmentation stage, and improving word segmentation precision. And carrying out named entity recognition on the news text subjected to word segmentation, namely carrying out entity labeling, and labeling corresponding entities in the text. And storing an entity labeling result generated by the first named entity labeling. And then searching and matching in the segmented and marked news text, and re-marking in the text when a word is searched to be matched with the identified entity but not marked as the entity. Thus, the labeling result can be obtained.
S202, constructing a distributed representation model based on the labeling result to obtain distributed representation information of the plurality of entities, wherein the distributed representation information is identified as an entity vector.
In a more specific example, based on the labeling results after Chinese word segmentation and entity labeling have been performed, a skip-gram model and a glove model are trained to obtain a distributed representation of each entity in the plurality of entities, that is, to obtain an entity vector of the entity.
The skip-gram model is a model describing word2vec, a word vector representation model in which the possible context is predicted by giving a central word wi [ wi-k,…,wi-1],wi,[wi+1,…,wi+k ]. The probability of occurrence of the word wc is denoted as P (wc|wi), and in this case, given that the SoftMax function is used to calculate the given wi, the probability of occurrence of the word wc is expressed as:
Wherein P represents the probability of occurrence of the word wc, wj represents the word vector of the word j in the vocabulary, V represents the total vocabulary,The token wc represents the transpose of the vector representation, wi represents a given center word.
The glove model is also a vectorized representation model of words, and can enable the vectors to contain as much semantic and grammatical information as possible. In addition, the glove model realizes the introduction of global information by acquiring a word co-occurrence matrix in the whole corpus. The optimized function after the weighting function is introduced is expressed as follows:
Wherein J represents the loss function to be solved, V represents the total vocabulary, Xij represents the number of times the word i and the word J co-occur, f (Xij) represents the weight function,The transpose of the vector of token i, wj the token vector of token j, bi and bj the token bias parameters. The word bias parameters need to be obtained through training.
Thus, the entity vector of the entity can be obtained through the skip-gram model and the glove model.
In some cases, some entity results that identify errors may be found using entity vector results, which are due to erroneous segmentations, but it may be found that erroneous entities due to erroneous segmentations tend to have similar entity vectors due to similar contexts and are likely to exist in the same piece of text. The invention judges whether different segmentation words are the same entity or not by using the entity vector result of model training. When a character string of a certain entity recognition result represents a substring of another entity recognition result, if any one of the following two rules is satisfied, the two results are considered to be the same entity:
Rule 1 cosine similarity of two entities >0.80.
Rule 2 cosine similarity of two entities >0.75 and appear in at least one news text at the same time.
Wherein, cosine similarity is expressed as:
wi and wj represent entity vectors for two entities, respectively. The result of whether the entities are identical may be adjusted by changing the threshold of cosine similarity.
For example, if a name is an entity in the news text, the name is specifically 4 words "ABCD", then when word segmentation is performed, the words may be classified into "ABCD", "ABC", "AB", and so on, and thus word segmentation may affect the generated entity. Or the news text has the names of people and places respectively, but the names of people and places are called as AA, so that the generated entities are influenced. Therefore, whether the entity is the same entity is judged by the two rules.
S203, carrying out cluster analysis on the entities according to the distributed representation information of the entities to obtain a cluster result.
In a more specific example, according to the distributed representation information of the entities obtained through the skip-gram model, that is, the entity vectors of the entities, k-means clustering is performed by taking cosine similarity between the entity vectors as a distance measurement function, so as to obtain a clustering result of the entities.
Fig. 3 is a schematic diagram of a method according to another embodiment of the present application. As shown in fig. 3, in some possible embodiments, the method for analyzing news entities without supervised learning may further include the steps of:
And S204, performing topic clustering according to the labeling results corresponding to each piece of news data to obtain topic clustering results and obtaining topics to which each piece of news data belongs.
Specifically, labeling results corresponding to each piece of news data are clustered by using a dirichlet Allocation (LDA) topic clustering model, topic distribution of each piece of news data is obtained by using a Gibbs sampling mode, and then topics with the highest probability are allocated to each piece of news data as topics to which each piece of news data belongs.
The LDA model is a three-layer Bayesian network, takes the Dirichlet distribution of each piece of news data and each topic which respectively meets the parameters alpha and beta into prior distribution, generates topic distribution of the position of each word in the news text by the topic distribution of each piece of news data, generates the word of each position by the topic distribution of each position, and finally can obtain a combination of words irrelevant in sequence. The probability of generation of the location is expressed as:
Wherein p represents probability, θ represents distribution of topics, N represents word number, z represents topics, w represents words, and α and β represent super parameters in LDA.
Thus, the probability of word generation in each news text is expressed as:
Wherein p represents probability, w represents word, alpha and beta represent super parameters in LDA, theta represents topic distribution, N represents word number, and z represents topic.
Then, estimating model parameters by using Gibbs sampling, and generating maximization of likelihood probability for the corpus in the corpus by an optimization target. And discarding the results in the combustion period after multiple iterations to obtain the parameter estimation result.
A specific algorithmic input is a sequence of text words, exemplified by w= { w1,w2,…,wM }. The input parameters are super parameters alpha and beta, and the number K of topics. The output is a sequence of text topics, exemplified by z= { z1,z2,…,zM }, and where a single textThe posterior probability distribution p (z|w, α, β), and the estimated value of the parameter θ.
The algorithm is expressed as follows:
S2041, setting elements nmk and nkv of all the count matrixes, and setting initial values of elements nk and nm of the count vectors to be 0. Wherein nmk represents the number of words belonging to the kth topic in the text corresponding to the mth news data, and nkv represents the number of times that word v belongs to the kth topic.
And S2042, performing topic sampling, text adding and topic adding on all words wmn in the mth text, and respectively counting nmk and nkv after text adding and topic adding.
S2043 the following operations are cyclically performed until the combustion period is entered:
s2043a, if the current word is the v-th word and the topic is the k-th topic, reducing the count;
S2043b, sampling according to the distribution of the full condition, wherein the formula is as follows;
And S2043c, adding a new count.
S2044, using the obtained sample count, a parameter θ can be calculated as follows:
In some possible embodiments, the labeling result corresponding to each piece of news data in the plurality of pieces of news data may include a plurality of interference words irrelevant to topics. In this case, it is necessary to remove the invalid information first.
For example, a labeling result corresponding to a news item includes a plurality of high-frequency common words such as "hello", and the like, and the method of removing the disabling words is adopted in the scheme.
And S205, counting the occurrence probability of each entity in the plurality of entities in the topics of the plurality of news data according to the topics of the each news data, and obtaining the topic distribution of each entity in the plurality of news data.
Specifically, for the occurrence frequency of all entities in news of a certain topic, the occurrence distribution of all entities in the news topic can be obtained, and the entity topic distribution is represented by a vector with length of M.
For example, there is an entity a, which appears in 4 news, one of which is the news to which the topic T0 belongs, and three of which are the news to which the topic TM-1 belongs, and then the topic distribution vector of the entity a can be expressed as
Fig. 4 is a schematic diagram of a method according to another embodiment of the present application, referring to fig. 4, in some possible embodiments, the method for analyzing news entities without supervised learning may further include:
S206, determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented through the average distance of the topic distribution of each entity.
Specifically, according to the obtained topic distribution of the entity and the clustering result of the entity, the clustering effect of the entity can be determined, and the effectiveness of the distributed representation (i.e. entity vector) of the entity is represented by the clustering effect.
The distribution of entity topics in the cluster is identified as [ x1,x2,…,xM ], the similarity of the topics is measured by using the average distance of the topics in the class, and the formula is as follows:
wherein dist represents the average distance and M represents the number of topics.
The smaller the dist value, the higher the topic similarity. In addition, the topic with the largest proportion in the clustering result is taken as the topic of the entity clustering result, and the larger the value is, the more concentrated the topics of the clustering result are.
In some possible embodiments, referring to fig. 4, the method for analyzing news entities without supervised learning may further include:
And S207, searching according to the seed information to obtain hidden information related to the entities in the news data.
Specifically, the seed information is information used in retrieval. For example, if it is desired to obtain information related to the entity a, the entity a is seed information. And screening out vector representations of input seeds by using entity vectors generated by a skip-gram model, and screening out parts with higher similarity from K entities with similarity topK in a word stock as input associated contents.
Some information is not found in the text, and in order to find some hidden related information as far as possible, the specific flow of the designed algorithm is as follows:
s2071, searching all possible recognition results found by the seed entity in the entity merging stage, and taking all possible recognition conditions as candidates for input search.
For example, an entity is searched, and the entity is named as "ABCD", and then the information possibly related to "ABCD", "ABC", "AB" and the like is used as seeds to search, so as to obtain search recognition results of all related entities, and the recognition results are implicit information.
S2072, judging each element in the search candidates, if the element is an entity, extracting words with similar Top50 in the model dictionary and the similarity degree being greater than a threshold value of 0.80, and adding the words into the keyword list.
S2073, finding out all the entities from the keyword list, and searching the entities in the vocabulary of each entity similarity Top 30.
S2074, searching the intersection of the words of the new entity Top50 and the existing keyword list, adding the new entity Top50 and the existing keyword list into the list if the overlapping number is larger than 12, and adding the new entity Top50 and the existing keyword list into the list if the number of words in the keyword list is smaller than 24, wherein the overlapping number is larger than half of the number of words in the list.
S2075 the operations of S2073, S2074 are repeated until no more changes occur to the elements in the list.
The specific value of the threshold can be debugged according to the using results of different corpus, if the recall rate is required to be increased, the threshold is reduced, and if the accuracy is required to be increased, the threshold is increased.
In some possible embodiments, referring to fig. 4, the method for analyzing news entities without supervised learning may further include:
and S208, constructing a relation among the entities according to the distributed representation of the entities and the co-occurrence times of the entities in the news data. A community structure that exists in a relationship between the plurality of entities is discovered using a community discovery algorithm.
Specifically, the relationship between entities is constructed using the entity distributed representation (i.e., entity vector) generated by the glove model and the number of co-occurrences of the entity in the news text.
By way of example, the relationship between two entities may be constructed by the following formula:
rel'(w1,w2)=min(log2(co_times+1)*sim(w1,w2)/2,ξ2)
where w1 and w2 characterize the two entities and co-occurrence of entities w1 and w2 is characterized by co-t imes. The more times that co-occurrence occurs between entities, the greater the similarity of entity vectors, and the stronger the association between entities. When the similarity is higher than xi1, the entities are considered to have association, xi2 represents the maximum value of the entity relationship, and xi3 is used for removing some weak association so as to reduce some unnecessary edges and facilitate analysis results.
The threshold is adjusted according to specific needs, the smaller the threshold is, the more entity relation quantity is considered, but too small the threshold can introduce a large amount of noise, and too large the relation quantity is considered, the information can be lost too little.
After the relationship among the entities is constructed, a Louvain algorithm is used for carrying out community discovery tasks on the entity relationship graph. Communities are groups with potential close relations in a graph structure formed by entities and associations among the entities, and can be used for finding hidden groups in texts. The Louvain algorithm optimizes the modularity of community division based on greedy strategies. The modularity is an index for measuring the division level of communities under the condition that the real division of communities is not known, and can be used for measuring the association degree in communities, and is expressed as follows by a formula:
Wherein, Q represents modularity, m represents the sum of network connection weights, v and w are any two points in the entity relation construction diagram, Avw represents the connection weight between the two points, kv represents the degree of v point, kw represents the degree of w point, delta (cv,cw) represents whether the v point and the w point belong to the same community, if so, the v point and the w point are 1, and if not, the v point and the w point are 0.
In some possible embodiments, the Louvain algorithm may also be optimized, which may be divided into two steps:
1) Dividing communities for each node, calculating the modularity at the moment, then merging a node into communities of neighbor nodes to calculate the modularity gain, finding out the largest modularity gain merging node, and repeating for several times until the modularity is not increased any more;
2) The new communities are merged into points and the operations in 1 are repeated until convergence.
The effectiveness of the community discovery task can be illustrated by comparing the association of entity pairs which are extracted from the same community of the community division result in a pairwise combination mode with the association of entity pairs which are randomly extracted.
Fig. 5 is a schematic diagram of an apparatus according to the technical solution provided in the embodiment of the present application. As shown in fig. 5, the news entity analyzing apparatus based on the unsupervised learning may include:
the labeling module 501 is configured to perform word segmentation on each piece of news data in the pieces of news data to be processed, and label a plurality of entities included in each piece of news after the word segmentation to obtain a labeling result;
The obtaining module 502 is configured to construct a distributed representation model based on the labeling result, so as to obtain distributed representation information of the plurality of entities, where the distributed representation information is identified as an entity vector;
and the clustering module 503 is configured to perform cluster analysis on the plurality of entities according to the distributed representation information of the plurality of entities to obtain a clustering result.
In some possible embodiments, the news entity analyzing apparatus based on the unsupervised learning further includes:
the news topic obtaining module 504 is configured to cluster topics according to labeling results corresponding to each piece of news data including the plurality of pieces of news data, so as to obtain topics to which each piece of news data belongs;
The topic distribution obtaining module 505 is configured to calculate, according to topics to which each of the news data belongs, a probability that each of the entities appears in the topics to which the news data belongs, and obtain topic distribution of each of the entities in a plurality of news.
In some possible embodiments, the news entity analyzing apparatus based on the unsupervised learning further includes:
The entity cluster analysis module 506 is configured to determine a cluster effect of the cluster result according to topic distributions of the plurality of entities and the cluster result, where the cluster effect is represented by an average distance of topic distributions in the topic cluster result.
In some possible embodiments, the news entity analyzing apparatus based on the unsupervised learning further includes:
and the seed information searching module 507 is configured to search according to seed information to obtain implicit information related to the entities in the plurality of pieces of news data.
In some possible embodiments, the news entity analyzing apparatus based on the unsupervised learning further includes:
the community discovery module 508 is configured to construct a relationship between the entities according to the distributed representations of the entities and the co-occurrence times of the entities in the news data, and discover a community structure existing in the relationship between the entities by using a community discovery algorithm.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.