CN113420112A

Movatterモバイル変換

Info

Publication number: CN113420112A
Application number: CN202110685518.3A
Authority: CN
Inventors: 周军; 张震; 杨家豪; 沈亮; 张鹏远; 王立强; 颜永红
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-09-21
Anticipated expiration: 2041-06-21
Also published as: CN113420112B

Abstract

Translated fromChinese

本发明涉及一种基于无监督学习的新闻实体分析方法及装置。方法包括：对待处理的多条新闻数据中的每条新闻数据分别进行分词处理，将分词处理后的每条新闻中包含的多个实体进行标注以得到标注结果；基于所述标注结果构建分布式表示模型，得到所述多个实体的分布式表示信息，所述分布式表示信息标识为实体向量；根据所述多个实体的分布式表示信息，对所述多个实体进行聚类分析以得到聚类结果。本申请将分布式的思想引入新闻实体的处理当中，通过新闻实体所处位置的上下文来得到实体的分布式表示，通过对实体的聚类分析来得到实体的聚类结果。

The invention relates to a news entity analysis method and device based on unsupervised learning. The method includes: performing word segmentation processing on each piece of news data in multiple pieces of news data to be processed, and labeling multiple entities included in each piece of news after word segmentation processing to obtain labeling results; constructing a distributed distribution system based on the labeling results a representation model to obtain distributed representation information of the multiple entities, and the distributed representation information is identified as an entity vector; according to the distributed representation information of the multiple entities, perform cluster analysis on the multiple entities to obtain Clustering results. The present application introduces the distributed idea into the processing of news entities, obtains the distributed representation of the entities through the context of the location of the news entities, and obtains the clustering results of the entities through cluster analysis of the entities.

Description

News entity analysis method and device based on unsupervised learning

Technical Field

The application relates to the field of text information mining, in particular to a news entity analysis method and device based on unsupervised learning.

Background

News is an important source for acquiring open source information, and because news is low in acquisition difficulty, wide in spreading range and good in timeliness, analysis of news is always a hotspot of text analysis and mining, and a lot of researches on news text analysis are available.

Entities in news can be obtained through related tools for Chinese named entity recognition, related work is mature, but large-scale labeling is difficult to perform due to the fact that corresponding entities are various in types and wide in related range and related entries do not exist necessarily, most of the existing work aiming at news analysis needs labeled information, and unsupervised methods directly applied to modeling and analyzing of the entities in news are not available according to investigation.

Disclosure of Invention

In order to solve the above problems, the present application provides a news entity analysis method and apparatus based on unsupervised learning.

In a first aspect, the present application provides a news entity analysis method based on unsupervised learning, including:

performing word segmentation on each piece of news data in a plurality of pieces of news data to be processed respectively, and labeling a plurality of entities contained in each piece of news after word segmentation to obtain a labeling result;

constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, wherein the distributed representation information is identified as entity vectors;

and according to the distributed representation information of the entities, carrying out clustering analysis on the entities to obtain a clustering result.

Preferably, the unsupervised learning based news entity method further comprises:

carrying out topic clustering according to the labeling result corresponding to each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs;

according to the topic to which each piece of news data belongs, the probability of each entity in the plurality of entities appearing in the topic to which the plurality of pieces of news data belong is counted, and the topic distribution of each entity in the plurality of entities in the plurality of pieces of news data is obtained.

determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented by an average distance of the topic distribution of each entity.

and searching according to the seed information to obtain the hidden information related to the plurality of entities in the plurality of news data.

constructing relationships between the plurality of entities according to the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of pieces of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.

In a second aspect, the present application provides a news entity analysis apparatus based on unsupervised learning, including:

the marking module is used for respectively carrying out word segmentation on each piece of news data in the plurality of pieces of news data to be processed and marking a plurality of entities contained in each piece of news after the word segmentation so as to obtain a marking result;

the acquisition module is used for constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, and the distributed representation information is marked as entity vectors;

and the clustering module is used for carrying out clustering analysis on the entities according to the distributed representation information of the entities so as to obtain a clustering result.

Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:

the news topic acquisition module is used for carrying out topic clustering according to the labeling result corresponding to each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs;

and the topic distribution acquisition module is used for counting the probability of each entity in the plurality of entities appearing in the topics to which the plurality of news data belong according to the topic to which each piece of news data belongs, so as to obtain the topic distribution of each entity in the plurality of news data.

and the entity clustering analysis module is used for determining the clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, and the clustering effect is represented by the average distance of the topic distribution of each entity.

and the seed information searching module is used for searching according to the seed information to obtain the hidden information related to the entities in the news data.

the community discovery module is used for constructing the relationship among the plurality of entities according to the distributed representation of the plurality of entities and the co-occurrence times of the plurality of entities in the plurality of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.

According to the method and the device, a distributed idea is introduced into the processing of the news entity, the distributed representation of the entity is obtained through the context of the position of the news entity, and the clustering result of the entity is obtained through clustering analysis of the entity.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic application diagram of the technical solution provided in the embodiment of the present application;

FIG. 2 is a schematic illustration of a method according to an embodiment of the present application;

FIG. 3 is a schematic illustration of a method according to another embodiment of the present application;

FIG. 4 is a schematic illustration of a method according to another embodiment of the present application;

fig. 5 is a schematic diagram of an apparatus according to the technical solution provided in the example of the present application.

Detailed Description

The technical solution provided by the present invention is further described in detail below with reference to the accompanying drawings and embodiments.

Fig. 1 is an application schematic diagram of a technical scheme provided in an embodiment of the present application. Referring to fig. 1, after obtaining news data, according to the technical solution, similarity between entities in the news data and a community relationship between the entities can be found, and seed information search can be performed to find information related to the entities.

Fig. 2 is a schematic method diagram of a technical solution provided in an embodiment of the present application. As shown in fig. 2, the method for analyzing news entities in unsupervised learning provided by the present application includes:

s201: and respectively carrying out word segmentation on each piece of news data in the plurality of pieces of news data to be processed, and labeling a plurality of entities contained in each piece of news after word segmentation to obtain a labeling result.

In some possible embodiments, before performing word segmentation processing on each piece of news data in the plurality of pieces of news data to be processed, the plurality of pieces of obtained news data are preprocessed to obtain a plurality of pieces of news texts for performing word segmentation processing. Correspondingly, the news data and the news text correspond one to one; the word segmentation processing is performed on the news data, and specifically, the word segmentation processing may be performed on a news text corresponding to the news data. The acquisition of news data can be crawled from mainstream news websites by means of web crawlers.

The pretreatment process mainly comprises the following steps: converting a Chinese coding system adopted by a plurality of texts of news data into an utf-8 format; removing illegal characters in the news text, removing illegal characters in the news text by using a regular expression, and only reserving Chinese characters, English words, numbers and common punctuation marks; the digital conversion is to uniformly convert all Arabic numerals in the news text file into a simplified Chinese standard writing method; punctuation conversion, namely uniformly converting half-corner characters in the news text into corresponding full-corner characters; and (4) screening the size of the news text, and deleting the noise file with the number of words smaller than the preset number after the punctuation is removed from the news text. It will be appreciated that the specific pre-processing rules will depend on the actual news text, for example the predetermined number may be 6 or some other number. The news data constitutes a corpus.

Specifically, the pre-training model for Chinese word segmentation and named entity recognition can be downloaded correspondingly, the model is loaded, and word segmentation is performed in a news text corresponding to each piece of news data in a plurality of pieces of news data. And checking word segmentation results, creating a dictionary for words which are easy to be wrongly divided in the news text, and loading the words in a word segmentation stage, so that the word segmentation precision is improved. And carrying out named entity recognition on the news text subjected to word segmentation, namely carrying out entity marking, and marking out a corresponding entity in the text. And storing an entity labeling result generated by the first named entity labeling. And then searching and matching in the segmented words and the labeled news text, and when a certain word is searched to be matched with the identified entity but not labeled as the entity, re-labeling in the text. In this way, the labeling result can be obtained.

S202: and constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, wherein the distributed representation information is identified as entity vectors.

In a more specific example, based on the labeling result after Chinese word segmentation and entity labeling, a skip-gram model and a glove model are trained to obtain a distributed representation of each entity in a plurality of entities, that is, an entity vector of the entity is obtained.

The skip-gram model is a model for describing word2vec, and is a word vector representation model in which a central word w is given_iTo predict a possible context w_i-k，…，w_i-1]，w_i，[w_i+1，…， w_i+k]. Word w_cThe probability of occurrence is denoted as P (w)_c|w_i) In the scheme, a SoftMax function is adopted to calculate the given w_iIn the case of (1), the word w_cThe probability of occurrence is expressed as:

wherein P represents the word w_cProbability of occurrence, w_jA word vector characterizing word j in the vocabulary, V characterizing the total vocabulary,

word w_cTransposition of vector representation, w_iA given core word is characterized.

The glove model is also a vectorization representation model of words, and semantic and grammatical information can be contained between vectors as much as possible. In addition, the glove model realizes the introduction of global information by acquiring a word co-occurrence matrix in the whole corpus. The optimized function introduced with the weighting function is represented as follows:

wherein J represents the loss function to be solved, V represents the total vocabulary, X_ijCharacterizing the number of co-occurrences of word i and word j, f (X)_ij) The weight function is characterized in that it is,

transpose of the i-vector of the token, w_jWord vectors characterizing word j, b_iAnd b_jThe word bias parameters are characterized. The word bias parameters need to be obtained through training.

Thus, the entity vector of the entity can be obtained through the skip-gram model and the glove model.

In some cases, entity vector results can be used to find some entity results that identify errors due to erroneous segmentation, but it can be found that the erroneous entities caused by the erroneous segmentation often have similar entity vectors due to similar contexts and certainly exist in the same text. The method judges whether different participles are the same entity or not by using the entity vector result of model training. When a character string of a certain entity recognition result represents a substring of another entity recognition result, two results are considered to be the same entity if any one of the following two rules is satisfied:

rule 1. cosine similarity of two entities > 0.80.

Rule 2. cosine similarity of two entities >0.75 and occurs at least simultaneously in one news text.

Wherein the cosine similarity is expressed as:

w_iand w_jEntities characterizing two entities separatelyAnd (5) vector quantity. The result of assuming that the entities are the same can be adjusted by changing the threshold of cosine similarity.

For example, if there are entities in the news text as names of people, and the names of people are 4 words "ABCD", when performing word segmentation, the words may be divided into "ABCD", "ABC", "AB", and the like, and thus the word segmentation may affect the generated entities. Or the news text has entities of a person name and a place name respectively, but the person name and the place name are called as 'AA', so that the generated entities are influenced. Therefore, whether the entities are the same entity is judged through the two rules.

S203: and according to the distributed representation information of the entities, carrying out clustering analysis on the entities to obtain a clustering result.

In a more specific example, according to the distributed representation information of the entities obtained through the skip-gram model, namely, the entity vectors of the entities, the cosine similarity between the entity vectors is used as a distance measurement function to perform k-means clustering, so as to obtain the clustering result of the entities.

Fig. 3 is a schematic method diagram of a technical solution provided in another embodiment of the present application. As shown in fig. 3, in some possible embodiments, the method for analyzing news entities for unsupervised learning may further include the following steps:

s204: and carrying out topic clustering according to the respective corresponding labeling result of each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs.

Specifically, clustering is performed on the labeling result corresponding to each piece of news data by using a Dirichlet Allocation (LDA) topic clustering model, topic distribution of each piece of news data is obtained by adopting a Gibbs sampling mode, and then the topic with the maximum probability is distributed to each piece of news data to serve as the topic to which each piece of news data belongs.

The LDA model is a three-layer Bayesian network, each piece of news data and each topic respectively meet Dirichlet distribution with parameters of alpha and beta as prior distribution, topic distribution of each word position in the news text is generated according to the topic distribution of each piece of news data, words of each position are generated according to the topic distribution of each position, and finally a sequence-independent word combination can be obtained. The probability of generation of the location is expressed as:

wherein p represents probability, theta represents distribution of topics, N represents number of words, z represents topics, w represents words, and alpha and beta represent hyper-parameters in LDA.

Thus, the probability of word generation in each news text is represented as:

wherein p represents probability, w represents words, alpha and beta represent hyper-parameters in LDA, theta represents distribution of topics, N represents number of words, and z represents topics.

Then, using the Gibbs sampling estimation model parameters, the optimization goal is to generate the maximization of likelihood probability for the corpora in the corpus. And discarding the result in the combustion period through multiple iterations to obtain a parameter estimation result.

A specific algorithmic input is a sequence of text words, an example being w ═ w₁，w₂，…，w_M}. The input parameters are hyper-parameters alpha and beta, and the number K of topics. The output is a sequence of text topics, an example being z ═ { z ═ z₁， z₂，…，z_MAnd single text therein

A count of the posterior probability distribution p (z | w, α, β), and an estimate of the parameter θ.

The algorithm is represented as follows:

s2041: let the elements n of all count matrices_mkAnd n_kvCounting the elements n of the vector_kAnd n_mThe initial value is 0. Wherein n is_mkCharacterization ofThe number of words belonging to the k-th topic, n, in the text corresponding to the m-th news data_kvThe number of times the token v belongs to the kth topic.

S2042: for all words w in the mth text_mnSampling topics, adding texts, adding topics, and respectively adding texts and n after adding topics_mkAnd n_kvCounting is performed.

S2043: the following operations are cyclically carried out until the combustion period is entered:

s2043 a: if the current word is the v-th word and the topic is the k-th topic, the count is reduced;

s2043 b: sampling according to the full condition distribution, wherein the formula is as follows;

s2043 c: a new count is added.

S2044: using the resulting sample count, a parameter θ can be calculated, as follows:

in some possible embodiments, the tagging result corresponding to each news data in the plurality of news data may include a plurality of topic-independent interfering words. In this case, it is necessary to remove the invalid information first.

For example, a labeling result corresponding to a piece of news contains many high-frequency common words such as "of", "hello", and the like, and the scheme adopts a method of removing stop words.

S205: according to the topic to which each piece of news data belongs, the probability of each entity in the plurality of entities appearing in the topic to which the plurality of pieces of news data belong is counted, and the topic distribution of each entity in the plurality of entities in the plurality of pieces of news data is obtained.

Specifically, for the frequency of occurrence of all entities in news of a certain topic, the distribution of the occurrence of all entities in news topics can be obtained, and the distribution of entity topics is represented by vectors with the length of M.

For example, there is an entity A that appears in 4 news items, where one news item is topic T₀Belonging news, three news being topics T_M-1The news of the entity, then the topic distribution vector of the entity A can be expressed as

Fig. 4 is a schematic method diagram of a technical solution provided in another embodiment of the present application, please refer to fig. 4, in some possible embodiments, the method for analyzing news entities in unsupervised learning may further include:

s206: determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented by an average distance of the topic distribution of each entity.

Specifically, according to the obtained topic distribution of the entity and the clustering result of the entity, the clustering effect of the entity can be determined, and the effectiveness of the distributed representation (i.e. entity vector) of the entity is represented by the clustering effect.

The entity topic distribution within a cluster is identified as [ x ]₁，x₂，…，x_M]The similarity of the topics is measured by using the average distance of the topics in the class, and the formula is as follows:

wherein dist represents the average distance, and M represents the number of topics.

The smaller the dist value, the higher the topic similarity. In addition, the topic occupying the greatest proportion in the clustering result is taken as the topic of the entity clustering result, and the larger the value is, the more concentrated the topics of the clustering result are.

In some possible embodiments, referring to fig. 4, the method for analyzing news entities in unsupervised learning may further include:

s207: and searching according to the seed information to obtain the hidden information related to the plurality of entities in the plurality of news data.

Specifically, the seed information is information used in retrieval. For example, if it is desired to obtain information related to entity a, entity a is the seed information. And screening vector representation of input seeds by using entity vectors generated by a skip-gram model, and screening parts with higher similarity as input associated contents at K entities with the similarity topK in a lexicon.

Some information may be hidden in the text and is not found, and in order to find some hidden related information as much as possible, the specific flow of the designed algorithm is as follows:

s2071: and searching all possible recognition results found by the seed entity in the entity merging stage, and taking all the possible recognition situations as candidates for input search.

For example, searching for an entity named "ABCD" may search for possible related information such as "ABCD", "ABC", "AB", etc. as seeds to obtain search identification results of all related entities, where the identification results are hidden information.

S2072: judging each element in the search candidates, if the element is an entity, extracting words with similar Top50 in the model dictionary and the similarity degree larger than a threshold value of 0.80, and adding the words into the keyword list.

S2073: all entities are found from the keyword list, and the entities in the vocabulary of each entity similarity Top30 are searched.

S2074: searching the intersection of the word of the new entity Top50 and the existing keyword list, and adding the new entity Top50 into the list if the number of the intersection is more than 12; if the number of words in the keyword list is not enough 24, the number of overlaps may be more than half of the number of list words.

S2075: the operations of S2073 and S2074 are repeated until the elements in the list are no longer changed.

The specific value of the threshold can be debugged according to the use results of different corpora, if the recall rate is increased, the threshold is reduced, and if the accuracy rate is increased, the threshold is improved.

s208: constructing relationships between the plurality of entities based on the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of news data. Community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.

Specifically, the relationship between entities is constructed by using an entity distributed representation (namely an entity vector) generated by a glove model and the co-occurrence times of the entities in the news text.

Illustratively, the relationship between two entities may be constructed by the following formula:

rel'(w₁,w₂)＝min(log₂(co_times+1)*sim(w₁,w₂)/2,ξ₂)

wherein, w₁And w₂Two entities are characterized, co times characterizing entity w₁And w₂The number of co-occurrences. The more the co-occurrence times among the entities are, the greater the similarity of the entity vectors is, and the stronger the association among the entities is. When the similarity degree is higher than xi₁When so, the entities are considered to have an association; xi₂Maximum value of the characterizing entity relationship; xi₃For removing some weak associations to reduce some unnecessary edges, and to facilitate analyzing the results.

The threshold is adjusted according to specific needs, the smaller the threshold is, the larger the number of entity relationships considered is, but if the threshold is too small, a large amount of noise is introduced, and if the threshold is too large, the number of relationships considered is likely to be too small to lose information.

After the relationship construction between the entities is completed, a community discovery task is performed on the entity relationship graph by using a Louvain algorithm. The community is a group which is mined to have relatively close potential relation in a graph structure formed by entities and the association between the entities, and can be used for discovering groups hidden in texts. The Louvain algorithm optimizes the modularity of community division based on a greedy strategy. The modularity is an index for measuring the community division level under the condition of not knowing the real community division, can be used for measuring the association degree in the community, and is expressed by a formula as follows:

wherein Q represents the modularity, m represents the sum of network connection weights, v and w are any two points in the entity relationship construction diagram, A_vwCharacterizing the connection weight, k, between two points_vDegree, k, characterizing v Point_wDegree characterizing point w, δ (c)_v,c_w) And representing whether the v point and the w point belong to the same community, if so, the v point and the w point are 1, and if not, the w point and the w point are 0.

In some possible embodiments, the Louvain algorithm may also be optimized, and the optimization may be divided into two steps:

1) dividing communities for each node, calculating the modularity at the moment, combining a certain node with the communities of the neighbor nodes to calculate the modularity gain, finding the maximum modularity gain combination node, and repeating for several times until the modularity is not increased any more;

2) the new communities are merged into points and the operations in 1 are repeated until convergence.

By calculating the relationship between two entities extracted from the same community of the community division result into an entity pair and comparing the relationship between the entity pair formed by randomly extracting the entities, the effectiveness of the community discovery task can be explained.

Fig. 5 is a schematic diagram of an apparatus according to the technical solution provided in the example of the present application. As shown in fig. 5, the unsupervised learning-based news entity analyzing apparatus may include:

alabeling module 501, configured to perform word segmentation on each piece of news data to be processed, and label a plurality of entities included in each piece of news after the word segmentation to obtain a labeling result;

an obtainingmodule 502, configured to construct a distributed representation model based on the tagging result, to obtain distributed representation information of the multiple entities, where the distributed representation information is identified as an entity vector;

theclustering module 503 is configured to perform clustering analysis on the multiple entities according to the distributed representation information of the multiple entities to obtain a clustering result.

In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:

a newstopic acquisition module 504, configured to perform topic clustering according to the respective labeling results corresponding to each piece of news data including the plurality of pieces of news data, and acquire a topic to which each piece of news data belongs;

a topicdistribution obtaining module 505, configured to count, according to the topic to which each piece of news data belongs, a probability that each entity in the plurality of entities appears in the topic to which the news data belongs, so as to obtain topic distribution of each entity in the plurality of news.

an entitycluster analysis module 506, configured to determine a clustering effect of the clustering result according to the topic distribution of the plurality of entities and the clustering result, where the clustering effect is represented by an average distance of the topic distribution in the topic clustering result.

the seedinformation searching module 507 is configured to search according to seed information to obtain implicit information related to the multiple entities in the multiple pieces of news data.

acommunity discovery module 508 for constructing relationships between the plurality of entities according to the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A news entity analysis method based on unsupervised learning is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

3. The method of claim 2, further comprising:

4. The method of claim 1, further comprising:

5. The method of claim 1, further comprising:

6. A news entity analysis apparatus based on unsupervised learning, comprising:

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 6, further comprising:

9. The apparatus of claim 6, further comprising:

10. The apparatus of claim 6, further comprising: