Movatterモバイル変換


[0]ホーム

URL:


CN113420112B - A news entity analysis method and device based on unsupervised learning - Google Patents

A news entity analysis method and device based on unsupervised learning
Download PDF

Info

Publication number
CN113420112B
CN113420112BCN202110685518.3ACN202110685518ACN113420112BCN 113420112 BCN113420112 BCN 113420112BCN 202110685518 ACN202110685518 ACN 202110685518ACN 113420112 BCN113420112 BCN 113420112B
Authority
CN
China
Prior art keywords
entities
entity
news
news data
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110685518.3A
Other languages
Chinese (zh)
Other versions
CN113420112A (en
Inventor
周军
张震
杨家豪
沈亮
张鹏远
王立强
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, National Computer Network and Information Security Management CenterfiledCriticalInstitute of Acoustics CAS
Priority to CN202110685518.3ApriorityCriticalpatent/CN113420112B/en
Publication of CN113420112ApublicationCriticalpatent/CN113420112A/en
Application grantedgrantedCritical
Publication of CN113420112BpublicationCriticalpatent/CN113420112B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明涉及一种基于无监督学习的新闻实体分析方法及装置。方法包括:对待处理的多条新闻数据中的每条新闻数据分别进行分词处理,将分词处理后的每条新闻中包含的多个实体进行标注以得到标注结果;基于所述标注结果构建分布式表示模型,得到所述多个实体的分布式表示信息,所述分布式表示信息标识为实体向量;根据所述多个实体的分布式表示信息,对所述多个实体进行聚类分析以得到聚类结果。本申请将分布式的思想引入新闻实体的处理当中,通过新闻实体所处位置的上下文来得到实体的分布式表示,通过对实体的聚类分析来得到实体的聚类结果。

The present invention relates to a news entity analysis method and device based on unsupervised learning. The method comprises: performing word segmentation processing on each of the multiple news data to be processed, and labeling the multiple entities contained in each news after the word segmentation processing to obtain labeling results; constructing a distributed representation model based on the labeling results to obtain distributed representation information of the multiple entities, and the distributed representation information is identified as an entity vector; and performing cluster analysis on the multiple entities according to the distributed representation information of the multiple entities to obtain clustering results. The present application introduces the idea of distribution into the processing of news entities, obtains the distributed representation of entities through the context of the location of the news entities, and obtains the clustering results of entities through cluster analysis of the entities.

Description

News entity analysis method and device based on unsupervised learning
Technical Field
The application relates to the field of text information mining, in particular to a news entity analysis method and device based on unsupervised learning.
Background
News is an important source for acquiring open source information, and because the news is low in acquisition difficulty, wide in spreading range and good in timeliness, analysis of the news is always a hot spot for text analysis and mining, and many researches aiming at news text analysis exist.
The entities in the news can be obtained through the relevant tools for identifying the Chinese named entities, the relevant work is mature, but the large-scale labeling is difficult because the corresponding entities are various in variety, wide in related range and not necessarily exist related terms, most of the existing works aiming at news analysis need information labeling, and no unsupervised method directly applied to modeling and analysis of the entities in the news is investigated.
Disclosure of Invention
In order to solve the problems, the application provides a news entity analysis method and device based on unsupervised learning.
In a first aspect, the present application provides a news entity analysis method based on unsupervised learning, including:
Respectively performing word segmentation on each piece of news data in the pieces of news data to be processed, and labeling a plurality of entities contained in each piece of news after the word segmentation to obtain labeling results;
Constructing a distributed representation model based on the labeling result to obtain distributed representation information of the plurality of entities, wherein the distributed representation information is marked as an entity vector;
and carrying out cluster analysis on the entities according to the distributed representation information of the entities to obtain a cluster result.
Preferably, the news entity method based on the unsupervised learning further comprises:
Performing topic clustering according to the labeling results corresponding to each piece of news data to obtain topic clustering results and obtaining topics to which each piece of news data belongs;
and counting the occurrence probability of each entity in the plurality of entities in the topics of the plurality of news data according to the topics of each news data, and obtaining the topic distribution of each entity in the plurality of news data.
Preferably, the news entity method based on the unsupervised learning further comprises:
And determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented through the average distance of the topic distribution of each entity.
Preferably, the news entity method based on the unsupervised learning further comprises:
Searching according to the seed information to obtain implicit information related to the entities in the news data.
Preferably, the news entity method based on the unsupervised learning further comprises:
Constructing a relationship between the plurality of entities according to the distributed representation of the plurality of entities and the number of co-occurrences of the plurality of entities in the plurality of pieces of news data; a community structure that exists in a relationship between the plurality of entities is discovered using a community discovery algorithm.
In a second aspect, the present application provides a news entity analysis device based on unsupervised learning, including:
The marking module is used for respectively carrying out word segmentation on each piece of news data in the pieces of news data to be processed, and marking a plurality of entities contained in each piece of news after the word segmentation so as to obtain marking results;
The acquisition module is used for constructing a distributed representation model based on the labeling result to obtain distributed representation information of the plurality of entities, and the distributed representation information is marked as an entity vector;
and the clustering module is used for carrying out clustering analysis on the entities according to the distributed representation information of the entities so as to obtain a clustering result.
Preferably, the news entity analyzing apparatus based on the unsupervised learning further includes:
The news topic acquisition module is used for carrying out topic clustering according to the labeling results corresponding to each piece of news data to obtain topic clustering results and obtaining topics to which each piece of news data belongs;
the topic distribution acquisition module is used for counting the probability of each entity in the plurality of entities in the topics of the plurality of news data according to the topics of each news data, so as to obtain the topic distribution of each entity in the plurality of news data.
Preferably, the news entity analyzing apparatus based on the unsupervised learning further includes:
And the entity cluster analysis module is used for determining a cluster effect of the cluster result through the topic distribution of each entity in the plurality of entities and the cluster result, wherein the cluster effect is represented through the average distance of the topic distribution of each entity.
Preferably, the news entity analyzing apparatus based on the unsupervised learning further includes:
and the seed information searching module is used for searching according to the seed information to obtain implicit information related to the entities in the news data.
Preferably, the news entity analyzing apparatus based on the unsupervised learning further includes:
The community discovery module is used for constructing a relation among the entities according to the distributed representation of the entities and the co-occurrence times of the entities in the news data, and discovering a community structure existing in the relation among the entities by using a community discovery algorithm.
The application introduces the distributed ideas into the processing of news entities, obtains the distributed representation of the entities through the context of the positions of the news entities, and obtains the clustering result of the entities through the clustering analysis of the entities.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present description, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an application schematic diagram of a technical solution provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a method according to the technical solution provided in the embodiment of the present application;
FIG. 3 is a schematic diagram of a method according to another embodiment of the present application;
FIG. 4 is a schematic diagram of a method according to another embodiment of the present application;
fig. 5 is a schematic diagram of an apparatus according to the technical solution provided in the embodiment of the present application.
Detailed Description
The technical scheme provided by the invention is further described in detail below with reference to the accompanying drawings and the embodiments.
Fig. 1 is an application schematic diagram of the technical solution provided in the embodiment of the present application. Referring to fig. 1, after obtaining news data, through the technical scheme, similarities among entities in the news data and community relations of the entities can be found, seed information search can be performed, and information related to the entities can be found.
Fig. 2 is a schematic diagram of a method according to the technical scheme provided in the embodiment of the present application. As shown in fig. 2, the method for analyzing news entities without supervised learning provided by the application comprises the following steps:
S201, word segmentation is carried out on each piece of news data in the pieces of news data to be processed, and a plurality of entities contained in each piece of news after the word segmentation are marked to obtain marked results.
In some possible embodiments, before each piece of news data in the pieces of news data to be processed is subjected to word segmentation, the acquired pieces of news data are preprocessed to obtain a plurality of pieces of news text for word segmentation. And the word segmentation processing is carried out on the news data, specifically, the word segmentation processing can be carried out on the news text corresponding to the news data. Acquisition of news data may rely on web crawlers to crawl from mainstream news websites.
The preprocessing process mainly comprises the steps of converting a Chinese coding system adopted by texts of a plurality of news data into an utf-8 format, removing illegal characters in the news texts by using a regular expression, only retaining Chinese characters, english words, numbers and common punctuation marks, converting numbers, uniformly converting all Arabic numbers in a news text file into a simplified Chinese standard writing method, converting punctuation marks, uniformly converting half-angle characters in the news text into corresponding full-angle characters, screening the sizes of the news texts, and deleting noise files with the number of words smaller than a preset number after the punctuation is removed in the news texts. It will be appreciated that the specific preprocessing rules will depend on the actual news text, for example the predetermined number may be 6 or some other value. The news data forms a corpus.
Specifically, a corresponding Chinese word segmentation and named entity recognition pre-training model can be downloaded, and the model is loaded and word segmentation is performed in a news text corresponding to each piece of news data in a plurality of pieces of news data. And checking word segmentation results, creating a dictionary for words which are easy to divide into errors in the news text, loading the dictionary in a word segmentation stage, and improving word segmentation precision. And carrying out named entity recognition on the news text subjected to word segmentation, namely carrying out entity labeling, and labeling corresponding entities in the text. And storing an entity labeling result generated by the first named entity labeling. And then searching and matching in the segmented and marked news text, and re-marking in the text when a word is searched to be matched with the identified entity but not marked as the entity. Thus, the labeling result can be obtained.
S202, constructing a distributed representation model based on the labeling result to obtain distributed representation information of the plurality of entities, wherein the distributed representation information is identified as an entity vector.
In a more specific example, based on the labeling results after Chinese word segmentation and entity labeling have been performed, a skip-gram model and a glove model are trained to obtain a distributed representation of each entity in the plurality of entities, that is, to obtain an entity vector of the entity.
The skip-gram model is a model describing word2vec, a word vector representation model in which the possible context is predicted by giving a central word wi [ wi-k,…,wi-1],wi,[wi+1,…,wi+k ]. The probability of occurrence of the word wc is denoted as P (wc|wi), and in this case, given that the SoftMax function is used to calculate the given wi, the probability of occurrence of the word wc is expressed as:
Wherein P represents the probability of occurrence of the word wc, wj represents the word vector of the word j in the vocabulary, V represents the total vocabulary,The token wc represents the transpose of the vector representation, wi represents a given center word.
The glove model is also a vectorized representation model of words, and can enable the vectors to contain as much semantic and grammatical information as possible. In addition, the glove model realizes the introduction of global information by acquiring a word co-occurrence matrix in the whole corpus. The optimized function after the weighting function is introduced is expressed as follows:
Wherein J represents the loss function to be solved, V represents the total vocabulary, Xij represents the number of times the word i and the word J co-occur, f (Xij) represents the weight function,The transpose of the vector of token i, wj the token vector of token j, bi and bj the token bias parameters. The word bias parameters need to be obtained through training.
Thus, the entity vector of the entity can be obtained through the skip-gram model and the glove model.
In some cases, some entity results that identify errors may be found using entity vector results, which are due to erroneous segmentations, but it may be found that erroneous entities due to erroneous segmentations tend to have similar entity vectors due to similar contexts and are likely to exist in the same piece of text. The invention judges whether different segmentation words are the same entity or not by using the entity vector result of model training. When a character string of a certain entity recognition result represents a substring of another entity recognition result, if any one of the following two rules is satisfied, the two results are considered to be the same entity:
Rule 1 cosine similarity of two entities >0.80.
Rule 2 cosine similarity of two entities >0.75 and appear in at least one news text at the same time.
Wherein, cosine similarity is expressed as:
wi and wj represent entity vectors for two entities, respectively. The result of whether the entities are identical may be adjusted by changing the threshold of cosine similarity.
For example, if a name is an entity in the news text, the name is specifically 4 words "ABCD", then when word segmentation is performed, the words may be classified into "ABCD", "ABC", "AB", and so on, and thus word segmentation may affect the generated entity. Or the news text has the names of people and places respectively, but the names of people and places are called as AA, so that the generated entities are influenced. Therefore, whether the entity is the same entity is judged by the two rules.
S203, carrying out cluster analysis on the entities according to the distributed representation information of the entities to obtain a cluster result.
In a more specific example, according to the distributed representation information of the entities obtained through the skip-gram model, that is, the entity vectors of the entities, k-means clustering is performed by taking cosine similarity between the entity vectors as a distance measurement function, so as to obtain a clustering result of the entities.
Fig. 3 is a schematic diagram of a method according to another embodiment of the present application. As shown in fig. 3, in some possible embodiments, the method for analyzing news entities without supervised learning may further include the steps of:
And S204, performing topic clustering according to the labeling results corresponding to each piece of news data to obtain topic clustering results and obtaining topics to which each piece of news data belongs.
Specifically, labeling results corresponding to each piece of news data are clustered by using a dirichlet Allocation (LDA) topic clustering model, topic distribution of each piece of news data is obtained by using a Gibbs sampling mode, and then topics with the highest probability are allocated to each piece of news data as topics to which each piece of news data belongs.
The LDA model is a three-layer Bayesian network, takes the Dirichlet distribution of each piece of news data and each topic which respectively meets the parameters alpha and beta into prior distribution, generates topic distribution of the position of each word in the news text by the topic distribution of each piece of news data, generates the word of each position by the topic distribution of each position, and finally can obtain a combination of words irrelevant in sequence. The probability of generation of the location is expressed as:
Wherein p represents probability, θ represents distribution of topics, N represents word number, z represents topics, w represents words, and α and β represent super parameters in LDA.
Thus, the probability of word generation in each news text is expressed as:
Wherein p represents probability, w represents word, alpha and beta represent super parameters in LDA, theta represents topic distribution, N represents word number, and z represents topic.
Then, estimating model parameters by using Gibbs sampling, and generating maximization of likelihood probability for the corpus in the corpus by an optimization target. And discarding the results in the combustion period after multiple iterations to obtain the parameter estimation result.
A specific algorithmic input is a sequence of text words, exemplified by w= { w1,w2,…,wM }. The input parameters are super parameters alpha and beta, and the number K of topics. The output is a sequence of text topics, exemplified by z= { z1,z2,…,zM }, and where a single textThe posterior probability distribution p (z|w, α, β), and the estimated value of the parameter θ.
The algorithm is expressed as follows:
S2041, setting elements nmk and nkv of all the count matrixes, and setting initial values of elements nk and nm of the count vectors to be 0. Wherein nmk represents the number of words belonging to the kth topic in the text corresponding to the mth news data, and nkv represents the number of times that word v belongs to the kth topic.
And S2042, performing topic sampling, text adding and topic adding on all words wmn in the mth text, and respectively counting nmk and nkv after text adding and topic adding.
S2043 the following operations are cyclically performed until the combustion period is entered:
s2043a, if the current word is the v-th word and the topic is the k-th topic, reducing the count;
S2043b, sampling according to the distribution of the full condition, wherein the formula is as follows;
And S2043c, adding a new count.
S2044, using the obtained sample count, a parameter θ can be calculated as follows:
In some possible embodiments, the labeling result corresponding to each piece of news data in the plurality of pieces of news data may include a plurality of interference words irrelevant to topics. In this case, it is necessary to remove the invalid information first.
For example, a labeling result corresponding to a news item includes a plurality of high-frequency common words such as "hello", and the like, and the method of removing the disabling words is adopted in the scheme.
And S205, counting the occurrence probability of each entity in the plurality of entities in the topics of the plurality of news data according to the topics of the each news data, and obtaining the topic distribution of each entity in the plurality of news data.
Specifically, for the occurrence frequency of all entities in news of a certain topic, the occurrence distribution of all entities in the news topic can be obtained, and the entity topic distribution is represented by a vector with length of M.
For example, there is an entity a, which appears in 4 news, one of which is the news to which the topic T0 belongs, and three of which are the news to which the topic TM-1 belongs, and then the topic distribution vector of the entity a can be expressed as
Fig. 4 is a schematic diagram of a method according to another embodiment of the present application, referring to fig. 4, in some possible embodiments, the method for analyzing news entities without supervised learning may further include:
S206, determining a clustering effect of the clustering result through the topic distribution of each entity in the plurality of entities and the clustering result, wherein the clustering effect is represented through the average distance of the topic distribution of each entity.
Specifically, according to the obtained topic distribution of the entity and the clustering result of the entity, the clustering effect of the entity can be determined, and the effectiveness of the distributed representation (i.e. entity vector) of the entity is represented by the clustering effect.
The distribution of entity topics in the cluster is identified as [ x1,x2,…,xM ], the similarity of the topics is measured by using the average distance of the topics in the class, and the formula is as follows:
wherein dist represents the average distance and M represents the number of topics.
The smaller the dist value, the higher the topic similarity. In addition, the topic with the largest proportion in the clustering result is taken as the topic of the entity clustering result, and the larger the value is, the more concentrated the topics of the clustering result are.
In some possible embodiments, referring to fig. 4, the method for analyzing news entities without supervised learning may further include:
And S207, searching according to the seed information to obtain hidden information related to the entities in the news data.
Specifically, the seed information is information used in retrieval. For example, if it is desired to obtain information related to the entity a, the entity a is seed information. And screening out vector representations of input seeds by using entity vectors generated by a skip-gram model, and screening out parts with higher similarity from K entities with similarity topK in a word stock as input associated contents.
Some information is not found in the text, and in order to find some hidden related information as far as possible, the specific flow of the designed algorithm is as follows:
s2071, searching all possible recognition results found by the seed entity in the entity merging stage, and taking all possible recognition conditions as candidates for input search.
For example, an entity is searched, and the entity is named as "ABCD", and then the information possibly related to "ABCD", "ABC", "AB" and the like is used as seeds to search, so as to obtain search recognition results of all related entities, and the recognition results are implicit information.
S2072, judging each element in the search candidates, if the element is an entity, extracting words with similar Top50 in the model dictionary and the similarity degree being greater than a threshold value of 0.80, and adding the words into the keyword list.
S2073, finding out all the entities from the keyword list, and searching the entities in the vocabulary of each entity similarity Top 30.
S2074, searching the intersection of the words of the new entity Top50 and the existing keyword list, adding the new entity Top50 and the existing keyword list into the list if the overlapping number is larger than 12, and adding the new entity Top50 and the existing keyword list into the list if the number of words in the keyword list is smaller than 24, wherein the overlapping number is larger than half of the number of words in the list.
S2075 the operations of S2073, S2074 are repeated until no more changes occur to the elements in the list.
The specific value of the threshold can be debugged according to the using results of different corpus, if the recall rate is required to be increased, the threshold is reduced, and if the accuracy is required to be increased, the threshold is increased.
In some possible embodiments, referring to fig. 4, the method for analyzing news entities without supervised learning may further include:
and S208, constructing a relation among the entities according to the distributed representation of the entities and the co-occurrence times of the entities in the news data. A community structure that exists in a relationship between the plurality of entities is discovered using a community discovery algorithm.
Specifically, the relationship between entities is constructed using the entity distributed representation (i.e., entity vector) generated by the glove model and the number of co-occurrences of the entity in the news text.
By way of example, the relationship between two entities may be constructed by the following formula:
rel'(w1,w2)=min(log2(co_times+1)*sim(w1,w2)/2,ξ2)
where w1 and w2 characterize the two entities and co-occurrence of entities w1 and w2 is characterized by co-t imes. The more times that co-occurrence occurs between entities, the greater the similarity of entity vectors, and the stronger the association between entities. When the similarity is higher than xi1, the entities are considered to have association, xi2 represents the maximum value of the entity relationship, and xi3 is used for removing some weak association so as to reduce some unnecessary edges and facilitate analysis results.
The threshold is adjusted according to specific needs, the smaller the threshold is, the more entity relation quantity is considered, but too small the threshold can introduce a large amount of noise, and too large the relation quantity is considered, the information can be lost too little.
After the relationship among the entities is constructed, a Louvain algorithm is used for carrying out community discovery tasks on the entity relationship graph. Communities are groups with potential close relations in a graph structure formed by entities and associations among the entities, and can be used for finding hidden groups in texts. The Louvain algorithm optimizes the modularity of community division based on greedy strategies. The modularity is an index for measuring the division level of communities under the condition that the real division of communities is not known, and can be used for measuring the association degree in communities, and is expressed as follows by a formula:
Wherein, Q represents modularity, m represents the sum of network connection weights, v and w are any two points in the entity relation construction diagram, Avw represents the connection weight between the two points, kv represents the degree of v point, kw represents the degree of w point, delta (cv,cw) represents whether the v point and the w point belong to the same community, if so, the v point and the w point are 1, and if not, the v point and the w point are 0.
In some possible embodiments, the Louvain algorithm may also be optimized, which may be divided into two steps:
1) Dividing communities for each node, calculating the modularity at the moment, then merging a node into communities of neighbor nodes to calculate the modularity gain, finding out the largest modularity gain merging node, and repeating for several times until the modularity is not increased any more;
2) The new communities are merged into points and the operations in 1 are repeated until convergence.
The effectiveness of the community discovery task can be illustrated by comparing the association of entity pairs which are extracted from the same community of the community division result in a pairwise combination mode with the association of entity pairs which are randomly extracted.
Fig. 5 is a schematic diagram of an apparatus according to the technical solution provided in the embodiment of the present application. As shown in fig. 5, the news entity analyzing apparatus based on the unsupervised learning may include:
the labeling module 501 is configured to perform word segmentation on each piece of news data in the pieces of news data to be processed, and label a plurality of entities included in each piece of news after the word segmentation to obtain a labeling result;
The obtaining module 502 is configured to construct a distributed representation model based on the labeling result, so as to obtain distributed representation information of the plurality of entities, where the distributed representation information is identified as an entity vector;
and the clustering module 503 is configured to perform cluster analysis on the plurality of entities according to the distributed representation information of the plurality of entities to obtain a clustering result.
In some possible embodiments, the news entity analyzing apparatus based on the unsupervised learning further includes:
the news topic obtaining module 504 is configured to cluster topics according to labeling results corresponding to each piece of news data including the plurality of pieces of news data, so as to obtain topics to which each piece of news data belongs;
The topic distribution obtaining module 505 is configured to calculate, according to topics to which each of the news data belongs, a probability that each of the entities appears in the topics to which the news data belongs, and obtain topic distribution of each of the entities in a plurality of news.
In some possible embodiments, the news entity analyzing apparatus based on the unsupervised learning further includes:
The entity cluster analysis module 506 is configured to determine a cluster effect of the cluster result according to topic distributions of the plurality of entities and the cluster result, where the cluster effect is represented by an average distance of topic distributions in the topic cluster result.
In some possible embodiments, the news entity analyzing apparatus based on the unsupervised learning further includes:
and the seed information searching module 507 is configured to search according to seed information to obtain implicit information related to the entities in the plurality of pieces of news data.
In some possible embodiments, the news entity analyzing apparatus based on the unsupervised learning further includes:
the community discovery module 508 is configured to construct a relationship between the entities according to the distributed representations of the entities and the co-occurrence times of the entities in the news data, and discover a community structure existing in the relationship between the entities by using a community discovery algorithm.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (6)

the method comprises the steps of obtaining a plurality of news data, carrying out topic clustering according to the marking results corresponding to each piece of the news data to obtain topic clustering results, obtaining topics to which each piece of the news data belongs, counting the probability of each entity in the plurality of entities in the topics to which each piece of the news data belongs according to the topics to which each piece of the news data belongs, obtaining topic distribution of each entity in the plurality of pieces of the news data, and determining clustering effects of the clustering results according to the topic distribution of each entity in the plurality of entities and the clustering results, wherein the clustering effects are represented by average distances of the topic distribution of each entity.
CN202110685518.3A2021-06-212021-06-21 A news entity analysis method and device based on unsupervised learningActiveCN113420112B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110685518.3ACN113420112B (en)2021-06-212021-06-21 A news entity analysis method and device based on unsupervised learning

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110685518.3ACN113420112B (en)2021-06-212021-06-21 A news entity analysis method and device based on unsupervised learning

Publications (2)

Publication NumberPublication Date
CN113420112A CN113420112A (en)2021-09-21
CN113420112Btrue CN113420112B (en)2025-02-18

Family

ID=77789464

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110685518.3AActiveCN113420112B (en)2021-06-212021-06-21 A news entity analysis method and device based on unsupervised learning

Country Status (1)

CountryLink
CN (1)CN113420112B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108829799A (en)*2018-06-052018-11-16中国人民公安大学Based on the Text similarity computing method and system for improving LDA topic model
CN111708880A (en)*2020-05-122020-09-25北京明略软件系统有限公司System and method for identifying class cluster

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104881401B (en)*2015-05-272017-10-17大连理工大学 A patent document clustering method
CN108334628A (en)*2018-02-232018-07-27北京东润环能科技股份有限公司A kind of method, apparatus, equipment and the storage medium of media event cluster
CN109766437A (en)*2018-12-072019-05-17中科恒运股份有限公司A kind of Text Clustering Method, text cluster device and terminal device
CN109739978A (en)*2018-12-112019-05-10中科恒运股份有限公司 A text clustering method, text clustering device and terminal device
CN109726394A (en)*2018-12-182019-05-07电子科技大学 Short text topic clustering method based on fusion BTM model
CN110297988B (en)*2019-07-062020-05-01四川大学Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108829799A (en)*2018-06-052018-11-16中国人民公安大学Based on the Text similarity computing method and system for improving LDA topic model
CN111708880A (en)*2020-05-122020-09-25北京明略软件系统有限公司System and method for identifying class cluster

Also Published As

Publication numberPublication date
CN113420112A (en)2021-09-21

Similar Documents

PublicationPublication DateTitle
CN104699763B (en)The text similarity gauging system of multiple features fusion
CN109408743B (en)Text link embedding method
CN108519971B (en) A cross-language news topic similarity comparison method based on parallel corpus
De Amorim et al.Effective spell checking methods using clustering algorithms
CN110633365A (en) A hierarchical multi-label text classification method and system based on word vectors
CN108132927A (en)A kind of fusion graph structure and the associated keyword extracting method of node
CN108959305A (en)A kind of event extraction method and system based on internet big data
CN107844533A (en)A kind of intelligent Answer System and analysis method
CN114491062B (en)Short text classification method integrating knowledge graph and topic model
CN113011194A (en)Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN116756347B (en)Semantic information retrieval method based on big data
Grütze et al.CohEEL: Coherent and efficient named entity linking through random walks
CN107688630A (en)A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
CN117973381A (en)Method for automatically extracting text keywords
CN112711944B (en) A word segmentation method, system, word segmenter generation method and system
CN111325018A (en)Domain dictionary construction method based on web retrieval and new word discovery
CN114492390B (en) Data expansion method, device, equipment and medium based on keyword recognition
CN118869292A (en) Domain name management method and device
Kaysar et al.Word sense disambiguation of Bengali words using FP-growth algorithm
CN113420112B (en) A news entity analysis method and device based on unsupervised learning
LiuAutomatic argumentative-zoning using word2vec
CN115309899B (en)Method and system for identifying and storing specific content in text
CN117312955A (en)Character-level countermeasure sample generation method utilizing prefix embedding space
CN115098848B (en)Small sample password set guessing method based on multitask learning
CN114610880B (en)Text classification method, system, electronic equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp