Movatterモバイル変換


[0]ホーム

URL:


CN113420112A - News entity analysis method and device based on unsupervised learning - Google Patents

News entity analysis method and device based on unsupervised learning
Download PDF

Info

Publication number
CN113420112A
CN113420112ACN202110685518.3ACN202110685518ACN113420112ACN 113420112 ACN113420112 ACN 113420112ACN 202110685518 ACN202110685518 ACN 202110685518ACN 113420112 ACN113420112 ACN 113420112A
Authority
CN
China
Prior art keywords
entities
news
entity
topic
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110685518.3A
Other languages
Chinese (zh)
Other versions
CN113420112B (en
Inventor
周军
张震
杨家豪
沈亮
张鹏远
王立强
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, National Computer Network and Information Security Management CenterfiledCriticalInstitute of Acoustics CAS
Priority to CN202110685518.3ApriorityCriticalpatent/CN113420112B/en
Publication of CN113420112ApublicationCriticalpatent/CN113420112A/en
Application grantedgrantedCritical
Publication of CN113420112BpublicationCriticalpatent/CN113420112B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明涉及一种基于无监督学习的新闻实体分析方法及装置。方法包括:对待处理的多条新闻数据中的每条新闻数据分别进行分词处理,将分词处理后的每条新闻中包含的多个实体进行标注以得到标注结果;基于所述标注结果构建分布式表示模型,得到所述多个实体的分布式表示信息,所述分布式表示信息标识为实体向量;根据所述多个实体的分布式表示信息,对所述多个实体进行聚类分析以得到聚类结果。本申请将分布式的思想引入新闻实体的处理当中,通过新闻实体所处位置的上下文来得到实体的分布式表示,通过对实体的聚类分析来得到实体的聚类结果。

Figure 202110685518

The invention relates to a news entity analysis method and device based on unsupervised learning. The method includes: performing word segmentation processing on each piece of news data in multiple pieces of news data to be processed, and labeling multiple entities included in each piece of news after word segmentation processing to obtain labeling results; constructing a distributed distribution system based on the labeling results a representation model to obtain distributed representation information of the multiple entities, and the distributed representation information is identified as an entity vector; according to the distributed representation information of the multiple entities, perform cluster analysis on the multiple entities to obtain Clustering results. The present application introduces the distributed idea into the processing of news entities, obtains the distributed representation of the entities through the context of the location of the news entities, and obtains the clustering results of the entities through cluster analysis of the entities.

Figure 202110685518

Description

News entity analysis method and device based on unsupervised learning
Technical Field
The application relates to the field of text information mining, in particular to a news entity analysis method and device based on unsupervised learning.
Background
News is an important source for acquiring open source information, and because news is low in acquisition difficulty, wide in spreading range and good in timeliness, analysis of news is always a hotspot of text analysis and mining, and a lot of researches on news text analysis are available.
Entities in news can be obtained through related tools for Chinese named entity recognition, related work is mature, but large-scale labeling is difficult to perform due to the fact that corresponding entities are various in types and wide in related range and related entries do not exist necessarily, most of the existing work aiming at news analysis needs labeled information, and unsupervised methods directly applied to modeling and analyzing of the entities in news are not available according to investigation.
Disclosure of Invention
In order to solve the above problems, the present application provides a news entity analysis method and apparatus based on unsupervised learning.
In a first aspect, the present application provides a news entity analysis method based on unsupervised learning, including:
performing word segmentation on each piece of news data in a plurality of pieces of news data to be processed respectively, and labeling a plurality of entities contained in each piece of news after word segmentation to obtain a labeling result;
constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, wherein the distributed representation information is identified as entity vectors;
and according to the distributed representation information of the entities, carrying out clustering analysis on the entities to obtain a clustering result.
Preferably, the unsupervised learning based news entity method further comprises:
carrying out topic clustering according to the labeling result corresponding to each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs;
according to the topic to which each piece of news data belongs, the probability of each entity in the plurality of entities appearing in the topic to which the plurality of pieces of news data belong is counted, and the topic distribution of each entity in the plurality of entities in the plurality of pieces of news data is obtained.
Preferably, the unsupervised learning based news entity method further comprises:
determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented by an average distance of the topic distribution of each entity.
Preferably, the unsupervised learning based news entity method further comprises:
and searching according to the seed information to obtain the hidden information related to the plurality of entities in the plurality of news data.
Preferably, the unsupervised learning based news entity method further comprises:
constructing relationships between the plurality of entities according to the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of pieces of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
In a second aspect, the present application provides a news entity analysis apparatus based on unsupervised learning, including:
the marking module is used for respectively carrying out word segmentation on each piece of news data in the plurality of pieces of news data to be processed and marking a plurality of entities contained in each piece of news after the word segmentation so as to obtain a marking result;
the acquisition module is used for constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, and the distributed representation information is marked as entity vectors;
and the clustering module is used for carrying out clustering analysis on the entities according to the distributed representation information of the entities so as to obtain a clustering result.
Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:
the news topic acquisition module is used for carrying out topic clustering according to the labeling result corresponding to each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs;
and the topic distribution acquisition module is used for counting the probability of each entity in the plurality of entities appearing in the topics to which the plurality of news data belong according to the topic to which each piece of news data belongs, so as to obtain the topic distribution of each entity in the plurality of news data.
Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:
and the entity clustering analysis module is used for determining the clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, and the clustering effect is represented by the average distance of the topic distribution of each entity.
Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:
and the seed information searching module is used for searching according to the seed information to obtain the hidden information related to the entities in the news data.
Preferably, the unsupervised learning-based news entity analyzing apparatus further includes:
the community discovery module is used for constructing the relationship among the plurality of entities according to the distributed representation of the plurality of entities and the co-occurrence times of the plurality of entities in the plurality of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
According to the method and the device, a distributed idea is introduced into the processing of the news entity, the distributed representation of the entity is obtained through the context of the position of the news entity, and the clustering result of the entity is obtained through clustering analysis of the entity.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic application diagram of the technical solution provided in the embodiment of the present application;
FIG. 2 is a schematic illustration of a method according to an embodiment of the present application;
FIG. 3 is a schematic illustration of a method according to another embodiment of the present application;
FIG. 4 is a schematic illustration of a method according to another embodiment of the present application;
fig. 5 is a schematic diagram of an apparatus according to the technical solution provided in the example of the present application.
Detailed Description
The technical solution provided by the present invention is further described in detail below with reference to the accompanying drawings and embodiments.
Fig. 1 is an application schematic diagram of a technical scheme provided in an embodiment of the present application. Referring to fig. 1, after obtaining news data, according to the technical solution, similarity between entities in the news data and a community relationship between the entities can be found, and seed information search can be performed to find information related to the entities.
Fig. 2 is a schematic method diagram of a technical solution provided in an embodiment of the present application. As shown in fig. 2, the method for analyzing news entities in unsupervised learning provided by the present application includes:
s201: and respectively carrying out word segmentation on each piece of news data in the plurality of pieces of news data to be processed, and labeling a plurality of entities contained in each piece of news after word segmentation to obtain a labeling result.
In some possible embodiments, before performing word segmentation processing on each piece of news data in the plurality of pieces of news data to be processed, the plurality of pieces of obtained news data are preprocessed to obtain a plurality of pieces of news texts for performing word segmentation processing. Correspondingly, the news data and the news text correspond one to one; the word segmentation processing is performed on the news data, and specifically, the word segmentation processing may be performed on a news text corresponding to the news data. The acquisition of news data can be crawled from mainstream news websites by means of web crawlers.
The pretreatment process mainly comprises the following steps: converting a Chinese coding system adopted by a plurality of texts of news data into an utf-8 format; removing illegal characters in the news text, removing illegal characters in the news text by using a regular expression, and only reserving Chinese characters, English words, numbers and common punctuation marks; the digital conversion is to uniformly convert all Arabic numerals in the news text file into a simplified Chinese standard writing method; punctuation conversion, namely uniformly converting half-corner characters in the news text into corresponding full-corner characters; and (4) screening the size of the news text, and deleting the noise file with the number of words smaller than the preset number after the punctuation is removed from the news text. It will be appreciated that the specific pre-processing rules will depend on the actual news text, for example the predetermined number may be 6 or some other number. The news data constitutes a corpus.
Specifically, the pre-training model for Chinese word segmentation and named entity recognition can be downloaded correspondingly, the model is loaded, and word segmentation is performed in a news text corresponding to each piece of news data in a plurality of pieces of news data. And checking word segmentation results, creating a dictionary for words which are easy to be wrongly divided in the news text, and loading the words in a word segmentation stage, so that the word segmentation precision is improved. And carrying out named entity recognition on the news text subjected to word segmentation, namely carrying out entity marking, and marking out a corresponding entity in the text. And storing an entity labeling result generated by the first named entity labeling. And then searching and matching in the segmented words and the labeled news text, and when a certain word is searched to be matched with the identified entity but not labeled as the entity, re-labeling in the text. In this way, the labeling result can be obtained.
S202: and constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, wherein the distributed representation information is identified as entity vectors.
In a more specific example, based on the labeling result after Chinese word segmentation and entity labeling, a skip-gram model and a glove model are trained to obtain a distributed representation of each entity in a plurality of entities, that is, an entity vector of the entity is obtained.
The skip-gram model is a model for describing word2vec, and is a word vector representation model in which a central word w is giveniTo predict a possible context wi-k,…,wi-1],wi,[wi+1,…, wi+k]. Word wcThe probability of occurrence is denoted as P (w)c|wi) In the scheme, a SoftMax function is adopted to calculate the given wiIn the case of (1), the word wcThe probability of occurrence is expressed as:
Figure BDA0003124465350000061
wherein P represents the word wcProbability of occurrence, wjA word vector characterizing word j in the vocabulary, V characterizing the total vocabulary,
Figure BDA0003124465350000062
word wcTransposition of vector representation, wiA given core word is characterized.
The glove model is also a vectorization representation model of words, and semantic and grammatical information can be contained between vectors as much as possible. In addition, the glove model realizes the introduction of global information by acquiring a word co-occurrence matrix in the whole corpus. The optimized function introduced with the weighting function is represented as follows:
Figure BDA0003124465350000063
wherein J represents the loss function to be solved, V represents the total vocabulary, XijCharacterizing the number of co-occurrences of word i and word j, f (X)ij) The weight function is characterized in that it is,
Figure BDA0003124465350000064
transpose of the i-vector of the token, wjWord vectors characterizing word j, biAnd bjThe word bias parameters are characterized. The word bias parameters need to be obtained through training.
Thus, the entity vector of the entity can be obtained through the skip-gram model and the glove model.
In some cases, entity vector results can be used to find some entity results that identify errors due to erroneous segmentation, but it can be found that the erroneous entities caused by the erroneous segmentation often have similar entity vectors due to similar contexts and certainly exist in the same text. The method judges whether different participles are the same entity or not by using the entity vector result of model training. When a character string of a certain entity recognition result represents a substring of another entity recognition result, two results are considered to be the same entity if any one of the following two rules is satisfied:
rule 1. cosine similarity of two entities > 0.80.
Rule 2. cosine similarity of two entities >0.75 and occurs at least simultaneously in one news text.
Wherein the cosine similarity is expressed as:
Figure BDA0003124465350000065
wiand wjEntities characterizing two entities separatelyAnd (5) vector quantity. The result of assuming that the entities are the same can be adjusted by changing the threshold of cosine similarity.
For example, if there are entities in the news text as names of people, and the names of people are 4 words "ABCD", when performing word segmentation, the words may be divided into "ABCD", "ABC", "AB", and the like, and thus the word segmentation may affect the generated entities. Or the news text has entities of a person name and a place name respectively, but the person name and the place name are called as 'AA', so that the generated entities are influenced. Therefore, whether the entities are the same entity is judged through the two rules.
S203: and according to the distributed representation information of the entities, carrying out clustering analysis on the entities to obtain a clustering result.
In a more specific example, according to the distributed representation information of the entities obtained through the skip-gram model, namely, the entity vectors of the entities, the cosine similarity between the entity vectors is used as a distance measurement function to perform k-means clustering, so as to obtain the clustering result of the entities.
Fig. 3 is a schematic method diagram of a technical solution provided in another embodiment of the present application. As shown in fig. 3, in some possible embodiments, the method for analyzing news entities for unsupervised learning may further include the following steps:
s204: and carrying out topic clustering according to the respective corresponding labeling result of each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs.
Specifically, clustering is performed on the labeling result corresponding to each piece of news data by using a Dirichlet Allocation (LDA) topic clustering model, topic distribution of each piece of news data is obtained by adopting a Gibbs sampling mode, and then the topic with the maximum probability is distributed to each piece of news data to serve as the topic to which each piece of news data belongs.
The LDA model is a three-layer Bayesian network, each piece of news data and each topic respectively meet Dirichlet distribution with parameters of alpha and beta as prior distribution, topic distribution of each word position in the news text is generated according to the topic distribution of each piece of news data, words of each position are generated according to the topic distribution of each position, and finally a sequence-independent word combination can be obtained. The probability of generation of the location is expressed as:
Figure BDA0003124465350000071
wherein p represents probability, theta represents distribution of topics, N represents number of words, z represents topics, w represents words, and alpha and beta represent hyper-parameters in LDA.
Thus, the probability of word generation in each news text is represented as:
Figure BDA0003124465350000081
wherein p represents probability, w represents words, alpha and beta represent hyper-parameters in LDA, theta represents distribution of topics, N represents number of words, and z represents topics.
Then, using the Gibbs sampling estimation model parameters, the optimization goal is to generate the maximization of likelihood probability for the corpora in the corpus. And discarding the result in the combustion period through multiple iterations to obtain a parameter estimation result.
A specific algorithmic input is a sequence of text words, an example being w ═ w1,w2,…,wM}. The input parameters are hyper-parameters alpha and beta, and the number K of topics. The output is a sequence of text topics, an example being z ═ { z ═ z1, z2,…,zMAnd single text therein
Figure BDA0003124465350000082
A count of the posterior probability distribution p (z | w, α, β), and an estimate of the parameter θ.
The algorithm is represented as follows:
s2041: let the elements n of all count matricesmkAnd nkvCounting the elements n of the vectorkAnd nmThe initial value is 0. Wherein n ismkCharacterization ofThe number of words belonging to the k-th topic, n, in the text corresponding to the m-th news datakvThe number of times the token v belongs to the kth topic.
S2042: for all words w in the mth textmnSampling topics, adding texts, adding topics, and respectively adding texts and n after adding topicsmkAnd nkvCounting is performed.
S2043: the following operations are cyclically carried out until the combustion period is entered:
s2043 a: if the current word is the v-th word and the topic is the k-th topic, the count is reduced;
s2043 b: sampling according to the full condition distribution, wherein the formula is as follows;
Figure BDA0003124465350000083
s2043 c: a new count is added.
S2044: using the resulting sample count, a parameter θ can be calculated, as follows:
Figure BDA0003124465350000091
in some possible embodiments, the tagging result corresponding to each news data in the plurality of news data may include a plurality of topic-independent interfering words. In this case, it is necessary to remove the invalid information first.
For example, a labeling result corresponding to a piece of news contains many high-frequency common words such as "of", "hello", and the like, and the scheme adopts a method of removing stop words.
S205: according to the topic to which each piece of news data belongs, the probability of each entity in the plurality of entities appearing in the topic to which the plurality of pieces of news data belong is counted, and the topic distribution of each entity in the plurality of entities in the plurality of pieces of news data is obtained.
Specifically, for the frequency of occurrence of all entities in news of a certain topic, the distribution of the occurrence of all entities in news topics can be obtained, and the distribution of entity topics is represented by vectors with the length of M.
For example, there is an entity A that appears in 4 news items, where one news item is topic T0Belonging news, three news being topics TM-1The news of the entity, then the topic distribution vector of the entity A can be expressed as
Figure BDA0003124465350000092
Fig. 4 is a schematic method diagram of a technical solution provided in another embodiment of the present application, please refer to fig. 4, in some possible embodiments, the method for analyzing news entities in unsupervised learning may further include:
s206: determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented by an average distance of the topic distribution of each entity.
Specifically, according to the obtained topic distribution of the entity and the clustering result of the entity, the clustering effect of the entity can be determined, and the effectiveness of the distributed representation (i.e. entity vector) of the entity is represented by the clustering effect.
The entity topic distribution within a cluster is identified as [ x ]1,x2,…,xM]The similarity of the topics is measured by using the average distance of the topics in the class, and the formula is as follows:
Figure BDA0003124465350000101
wherein dist represents the average distance, and M represents the number of topics.
The smaller the dist value, the higher the topic similarity. In addition, the topic occupying the greatest proportion in the clustering result is taken as the topic of the entity clustering result, and the larger the value is, the more concentrated the topics of the clustering result are.
In some possible embodiments, referring to fig. 4, the method for analyzing news entities in unsupervised learning may further include:
s207: and searching according to the seed information to obtain the hidden information related to the plurality of entities in the plurality of news data.
Specifically, the seed information is information used in retrieval. For example, if it is desired to obtain information related to entity a, entity a is the seed information. And screening vector representation of input seeds by using entity vectors generated by a skip-gram model, and screening parts with higher similarity as input associated contents at K entities with the similarity topK in a lexicon.
Some information may be hidden in the text and is not found, and in order to find some hidden related information as much as possible, the specific flow of the designed algorithm is as follows:
s2071: and searching all possible recognition results found by the seed entity in the entity merging stage, and taking all the possible recognition situations as candidates for input search.
For example, searching for an entity named "ABCD" may search for possible related information such as "ABCD", "ABC", "AB", etc. as seeds to obtain search identification results of all related entities, where the identification results are hidden information.
S2072: judging each element in the search candidates, if the element is an entity, extracting words with similar Top50 in the model dictionary and the similarity degree larger than a threshold value of 0.80, and adding the words into the keyword list.
S2073: all entities are found from the keyword list, and the entities in the vocabulary of each entity similarity Top30 are searched.
S2074: searching the intersection of the word of the new entity Top50 and the existing keyword list, and adding the new entity Top50 into the list if the number of the intersection is more than 12; if the number of words in the keyword list is not enough 24, the number of overlaps may be more than half of the number of list words.
S2075: the operations of S2073 and S2074 are repeated until the elements in the list are no longer changed.
The specific value of the threshold can be debugged according to the use results of different corpora, if the recall rate is increased, the threshold is reduced, and if the accuracy rate is increased, the threshold is improved.
In some possible embodiments, referring to fig. 4, the method for analyzing news entities in unsupervised learning may further include:
s208: constructing relationships between the plurality of entities based on the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of news data. Community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
Specifically, the relationship between entities is constructed by using an entity distributed representation (namely an entity vector) generated by a glove model and the co-occurrence times of the entities in the news text.
Illustratively, the relationship between two entities may be constructed by the following formula:
Figure BDA0003124465350000111
rel'(w1,w2)=min(log2(co_times+1)*sim(w1,w2)/2,ξ2)
Figure BDA0003124465350000112
wherein, w1And w2Two entities are characterized, co times characterizing entity w1And w2The number of co-occurrences. The more the co-occurrence times among the entities are, the greater the similarity of the entity vectors is, and the stronger the association among the entities is. When the similarity degree is higher than xi1When so, the entities are considered to have an association; xi2Maximum value of the characterizing entity relationship; xi3For removing some weak associations to reduce some unnecessary edges, and to facilitate analyzing the results.
The threshold is adjusted according to specific needs, the smaller the threshold is, the larger the number of entity relationships considered is, but if the threshold is too small, a large amount of noise is introduced, and if the threshold is too large, the number of relationships considered is likely to be too small to lose information.
After the relationship construction between the entities is completed, a community discovery task is performed on the entity relationship graph by using a Louvain algorithm. The community is a group which is mined to have relatively close potential relation in a graph structure formed by entities and the association between the entities, and can be used for discovering groups hidden in texts. The Louvain algorithm optimizes the modularity of community division based on a greedy strategy. The modularity is an index for measuring the community division level under the condition of not knowing the real community division, can be used for measuring the association degree in the community, and is expressed by a formula as follows:
Figure BDA0003124465350000121
wherein Q represents the modularity, m represents the sum of network connection weights, v and w are any two points in the entity relationship construction diagram, AvwCharacterizing the connection weight, k, between two pointsvDegree, k, characterizing v PointwDegree characterizing point w, δ (c)v,cw) And representing whether the v point and the w point belong to the same community, if so, the v point and the w point are 1, and if not, the w point and the w point are 0.
In some possible embodiments, the Louvain algorithm may also be optimized, and the optimization may be divided into two steps:
1) dividing communities for each node, calculating the modularity at the moment, combining a certain node with the communities of the neighbor nodes to calculate the modularity gain, finding the maximum modularity gain combination node, and repeating for several times until the modularity is not increased any more;
2) the new communities are merged into points and the operations in 1 are repeated until convergence.
By calculating the relationship between two entities extracted from the same community of the community division result into an entity pair and comparing the relationship between the entity pair formed by randomly extracting the entities, the effectiveness of the community discovery task can be explained.
Fig. 5 is a schematic diagram of an apparatus according to the technical solution provided in the example of the present application. As shown in fig. 5, the unsupervised learning-based news entity analyzing apparatus may include:
alabeling module 501, configured to perform word segmentation on each piece of news data to be processed, and label a plurality of entities included in each piece of news after the word segmentation to obtain a labeling result;
an obtainingmodule 502, configured to construct a distributed representation model based on the tagging result, to obtain distributed representation information of the multiple entities, where the distributed representation information is identified as an entity vector;
theclustering module 503 is configured to perform clustering analysis on the multiple entities according to the distributed representation information of the multiple entities to obtain a clustering result.
In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:
a newstopic acquisition module 504, configured to perform topic clustering according to the respective labeling results corresponding to each piece of news data including the plurality of pieces of news data, and acquire a topic to which each piece of news data belongs;
a topicdistribution obtaining module 505, configured to count, according to the topic to which each piece of news data belongs, a probability that each entity in the plurality of entities appears in the topic to which the news data belongs, so as to obtain topic distribution of each entity in the plurality of news.
In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:
an entitycluster analysis module 506, configured to determine a clustering effect of the clustering result according to the topic distribution of the plurality of entities and the clustering result, where the clustering effect is represented by an average distance of the topic distribution in the topic clustering result.
In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:
the seedinformation searching module 507 is configured to search according to seed information to obtain implicit information related to the multiple entities in the multiple pieces of news data.
In some possible embodiments, the unsupervised learning-based news entity analysis apparatus further includes:
acommunity discovery module 508 for constructing relationships between the plurality of entities according to the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A news entity analysis method based on unsupervised learning is characterized by comprising the following steps:
performing word segmentation on each piece of news data in a plurality of pieces of news data to be processed respectively, and labeling a plurality of entities contained in each piece of news after word segmentation to obtain a labeling result;
constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, wherein the distributed representation information is identified as entity vectors;
and according to the distributed representation information of the entities, carrying out clustering analysis on the entities to obtain a clustering result.
2. The method of claim 1, further comprising:
carrying out topic clustering according to the labeling result corresponding to each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs;
according to the topic to which each piece of news data belongs, the probability of each entity in the plurality of entities appearing in the topic to which the plurality of pieces of news data belong is counted, and the topic distribution of each entity in the plurality of entities in the plurality of pieces of news data is obtained.
3. The method of claim 2, further comprising:
determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented by an average distance of the topic distribution of each entity.
4. The method of claim 1, further comprising:
and searching according to the seed information to obtain the hidden information related to the plurality of entities in the plurality of news data.
5. The method of claim 1, further comprising:
constructing relationships between the plurality of entities according to the distributed representations of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of pieces of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
6. A news entity analysis apparatus based on unsupervised learning, comprising:
the marking module is used for respectively carrying out word segmentation on each piece of news data in the plurality of pieces of news data to be processed and marking a plurality of entities contained in each piece of news after the word segmentation so as to obtain a marking result;
the acquisition module is used for constructing a distributed representation model based on the labeling result to obtain distributed representation information of the entities, and the distributed representation information is marked as entity vectors;
and the clustering module is used for carrying out clustering analysis on the entities according to the distributed representation information of the entities so as to obtain a clustering result.
7. The apparatus of claim 6, further comprising:
the news topic acquisition module is used for carrying out topic clustering according to the labeling result corresponding to each piece of news data to obtain a topic clustering result and obtain the topic to which each piece of news data belongs;
and the topic distribution acquisition module is used for counting the probability of each entity in the plurality of entities appearing in the topics to which the plurality of news data belong according to the topic to which each piece of news data belongs, so as to obtain the topic distribution of each entity in the plurality of news data.
8. The apparatus of claim 6, further comprising:
and the entity clustering analysis module is used for determining the clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, and the clustering effect is represented by the average distance of the topic distribution of each entity.
9. The apparatus of claim 6, further comprising:
and the seed information searching module is used for searching according to the seed information to obtain the hidden information related to the entities in the news data.
10. The apparatus of claim 6, further comprising:
the community discovery module is used for constructing the relationship among the plurality of entities according to the distributed representation of the plurality of entities and the co-occurrence times of the plurality of entities in the plurality of news data; community structures that exist in the relationships between the plurality of entities are discovered using a community discovery algorithm.
CN202110685518.3A2021-06-212021-06-21 A news entity analysis method and device based on unsupervised learningActiveCN113420112B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110685518.3ACN113420112B (en)2021-06-212021-06-21 A news entity analysis method and device based on unsupervised learning

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110685518.3ACN113420112B (en)2021-06-212021-06-21 A news entity analysis method and device based on unsupervised learning

Publications (2)

Publication NumberPublication Date
CN113420112Atrue CN113420112A (en)2021-09-21
CN113420112B CN113420112B (en)2025-02-18

Family

ID=77789464

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110685518.3AActiveCN113420112B (en)2021-06-212021-06-21 A news entity analysis method and device based on unsupervised learning

Country Status (1)

CountryLink
CN (1)CN113420112B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104881401A (en)*2015-05-272015-09-02大连理工大学Patent literature clustering method
CN108334628A (en)*2018-02-232018-07-27北京东润环能科技股份有限公司A kind of method, apparatus, equipment and the storage medium of media event cluster
CN108829799A (en)*2018-06-052018-11-16中国人民公安大学Based on the Text similarity computing method and system for improving LDA topic model
CN109726394A (en)*2018-12-182019-05-07电子科技大学 Short text topic clustering method based on fusion BTM model
CN109739978A (en)*2018-12-112019-05-10中科恒运股份有限公司 A text clustering method, text clustering device and terminal device
CN109766437A (en)*2018-12-072019-05-17中科恒运股份有限公司A kind of Text Clustering Method, text cluster device and terminal device
CN110297988A (en)*2019-07-062019-10-01四川大学Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN111708880A (en)*2020-05-122020-09-25北京明略软件系统有限公司System and method for identifying class cluster

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104881401A (en)*2015-05-272015-09-02大连理工大学Patent literature clustering method
CN108334628A (en)*2018-02-232018-07-27北京东润环能科技股份有限公司A kind of method, apparatus, equipment and the storage medium of media event cluster
CN108829799A (en)*2018-06-052018-11-16中国人民公安大学Based on the Text similarity computing method and system for improving LDA topic model
CN109766437A (en)*2018-12-072019-05-17中科恒运股份有限公司A kind of Text Clustering Method, text cluster device and terminal device
CN109739978A (en)*2018-12-112019-05-10中科恒运股份有限公司 A text clustering method, text clustering device and terminal device
CN109726394A (en)*2018-12-182019-05-07电子科技大学 Short text topic clustering method based on fusion BTM model
CN110297988A (en)*2019-07-062019-10-01四川大学Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN111708880A (en)*2020-05-122020-09-25北京明略软件系统有限公司System and method for identifying class cluster

Also Published As

Publication numberPublication date
CN113420112B (en)2025-02-18

Similar Documents

PublicationPublication DateTitle
CN109190117B (en)Short text semantic similarity calculation method based on word vector
CN104699763B (en)The text similarity gauging system of multiple features fusion
CN110196906B (en) Text similarity detection method based on deep learning for financial industry
US20200302118A1 (en)Korean Named-Entity Recognition Method Based on Maximum Entropy Model and Neural Network Model
Demir et al.Improving named entity recognition for morphologically rich languages using word embeddings
CN106599032B (en)Text event extraction method combining sparse coding and structure sensing machine
CN113505200B (en) A method for sentence-level Chinese event detection combining key information of documents
CN104794169B (en)A kind of subject terminology extraction method and system based on sequence labelling model
US20040243408A1 (en)Method and apparatus using source-channel models for word segmentation
CN110633365A (en) A hierarchical multi-label text classification method and system based on word vectors
CN114065758A (en)Document keyword extraction method based on hypergraph random walk
CN114548321B (en)Self-supervision public opinion comment viewpoint object classification method based on contrast learning
CN114491062B (en)Short text classification method integrating knowledge graph and topic model
CN111325018A (en)Domain dictionary construction method based on web retrieval and new word discovery
CN117973381A (en)Method for automatically extracting text keywords
CN112711944A (en)Word segmentation method and system and word segmentation device generation method and system
CN111133429A (en)Extracting expressions for natural language processing
CN109815497B (en)Character attribute extraction method based on syntactic dependency
CN119557443A (en) An event detection and extraction method and system based on entity and trigger word fusion recognition
CN115309899B (en)Method and system for identifying and storing specific content in text
CN113420112B (en) A news entity analysis method and device based on unsupervised learning
CN111428475A (en)Word segmentation word bank construction method, word segmentation method, device and storage medium
CN117312955A (en)Character-level countermeasure sample generation method utilizing prefix embedding space
CN113986345B (en)Pre-training enhanced code clone detection method
CN116186423A (en) Personality detection method based on social text and links

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp