CN111723201B

Movatterモバイル変換

Info

Publication number: CN111723201B
Application number: CN201910221823.XA
Authority: CN
Inventors: 李飞
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2019-03-22
Filing date: 2019-03-22
Publication date: 2025-05-23
Anticipated expiration: 2039-03-22
Also published as: CN111723201A

Abstract

The invention discloses a method and a device for clustering text data, and relates to the technical field of computers. The method comprises the steps of obtaining batched text data, determining a characteristic word set of each text data in the batched text data, determining the weight of each characteristic word in the characteristic word set for the characteristic word set of each text data, sorting the batched text data according to the weight of the characteristic word in the characteristic word set, and carrying out clustering calculation on the batched text data based on the sorting result. According to the method, the sequence of the text clustering is rearranged by utilizing the characteristic weights, the text data with rich information can be clustered to form topic classes preferentially, then the text clustering is carried out according to the clustered topic classes, and the clustering accuracy can be improved.

Description

Method and device for text data clustering

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for text data clustering.

Background

In the text clustering process, whether the subsequent text data belongs to the created cluster is judged based on the previously created cluster. Therefore, if the cluster class is created first from text data containing a small amount of information, the accuracy of the subsequent clusters is affected. In the prior art, when text clustering is calculated, the text is clustered based on the sequence of time dimension, and the inaccuracy of a clustering result is easily caused.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method and an apparatus for text data clustering, which can rearrange the order of text clustering by using feature weights, and can preferentially cluster text data with abundant information into topic types, and then perform text clustering according to the clustered topic types, so as to improve the clustering accuracy.

To achieve the above object, according to one aspect of an embodiment of the present invention, there is provided a method for text data clustering.

The method for clustering text data comprises the steps of obtaining batched text data, determining a characteristic word set of each text data in the batched text data, determining the weight of each characteristic word in the characteristic word set for the characteristic word set of each text data, sorting the batched text data according to the weight of the characteristic word in the characteristic word set, and carrying out clustering calculation on the batched text data based on the sorting result.

Optionally, the step of sorting the batch of text data according to weights of feature words in the feature word sets comprises determining a weight sum of all feature words in each feature word set, and sorting the batch of text data according to the weight sum.

Optionally, based on the sorting result, the step of clustering the batch of text data comprises traversing the batch of text data based on the sorting result and sequentially determining similarity values of the text data and a clustering center of the created topic class; judging whether the similarity value accords with a preset threshold value, if so, adding text data into a corresponding created topic class, updating a clustering center of the created topic class, and if not, creating a topic class.

Optionally, the step of determining the similarity value of each text data and the clustering center of the created topic class comprises the steps of determining the hash value of the feature word of each text data through a hash algorithm, carrying out weighting processing on the hash value according to the weight of each determined feature word to obtain a weighted hash value of the feature word, accumulating the weighted hash values of the feature word of each text data to obtain a serial string representation of the text data, determining the sim-hash value of the text data according to the serial string representation of the text data, and determining the similarity value of the text data and the clustering center of the created topic class according to the sim-hash value of the text data.

Optionally, the similarity value includes a title similarity value and a content similarity value;

the step of judging whether the similarity value accords with a preset threshold value comprises the step of judging whether the content similarity value accords with a second preset threshold value under the condition that the title similarity value does not accord with the first preset threshold value.

Optionally, for each feature word set of the text data, determining the weight of each feature word in the feature word set includes determining a TF-IDF weight value for the feature word in each feature word set by a word frequency-inverse document frequency algorithm.

To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided an apparatus for text data clustering.

The device for clustering text data in the embodiment of the invention comprises the following steps:

The characteristic word determining module is used for acquiring batch text data and determining characteristic word sets of each text data in the batch text data;

The weight determining module is used for determining the weight of each feature word in the feature word set of each text data;

The sequencing module is used for sequencing the batch of text data according to the weight of the feature words in the feature word set;

And the clustering module is used for carrying out clustering calculation on the batch text data based on the sequencing result.

Optionally, the sorting module is further configured to determine a weight sum of all feature words in each feature word set, and sort the batch text data according to the weight sum.

Optionally, the clustering module is further configured to traverse the batch of text data based on the sorting result, sequentially determine a similarity value between the text data and a clustering center of the created topic class, determine whether the similarity value meets a preset threshold, add the text data to the corresponding created topic class and update the clustering center of the created topic class if the similarity value meets the preset threshold, and otherwise, create a new topic class.

Optionally, the clustering module is further configured to determine a hash value of a feature word of each text data through a hash algorithm, perform weighting processing on the hash value according to the determined weight of each feature word to obtain a weighted hash value of the feature word, accumulate the weighted hash values of the feature word of each text data to obtain a serial string representation of the text data, determine a sim-hash value of the text data according to the serial string representation of the text data, and determine a similarity value of the text data and a clustering center of the created topic class according to the sim-hash value of the text data.

Optionally, the clustering module is further configured to determine whether the similarity value meets a preset threshold, where determining that the title similarity value does not meet a first preset threshold, determines whether the content similarity value meets a second preset threshold;

Optionally, the weight determining module is further configured to determine TF-IDF weight values of the feature words in each feature word set through a word frequency-reverse file frequency algorithm.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.

The electronic device comprises one or more processors and a storage device, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the method for clustering text data.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided a computer-readable medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the method for text data clustering of any one of the above.

One embodiment of the invention has the advantages that the text data with rich information can be clustered into topic classes preferentially by rearranging the sequence of the text clusters through the characteristic weights, and then the text clustering is carried out according to the clustered topic classes, thereby improving the clustering accuracy

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main flow of a method for text data clustering in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a method of hot topic discovery according to an embodiment of the invention;

FIG. 3 is a schematic diagram of clustering computations of text data according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the main modules of an apparatus for text data clustering in accordance with an embodiment of the present invention;

FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

Fig. 6 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of main flow of a method for text data clustering according to an embodiment of the present invention, and as shown in fig. 1, the method for text data clustering according to an embodiment of the present invention mainly includes:

step S101, acquiring batch text data and determining a characteristic word set of each text data in the batch text data. Through the process, words with relatively strong representative meanings can be screened out from the text data to serve as characteristic words. And in order to reduce the complexity of calculation and improve the text processing speed and efficiency, words to be processed are relatively fewer as much as possible during selection. Therefore, further, the screened initial feature words can be scored, the initial feature words with the later scores can be deleted according to the size of the scores (the size of the screening scores can be set according to service requirements), and a plurality of items with the earlier scores can be selected as the feature words extracted from the text. For a text data, there are typically a plurality of extracted feature words, and the plurality of feature words form a feature word set of the text data.

Step S102, for each characteristic word set of the text data, determining the weight of each characteristic word in the characteristic word set. The weight of a feature word refers to the magnitude of the effect that the feature word plays in certain text data. In the embodiment of the invention, the weight can be calculated through data statistics or through scoring, etc.

And step 103, sequencing the batch text data according to the weights of the feature words in the feature word set. In the embodiment of the invention, the weight sum of all the feature words in each feature word set is determined, and the batch text data is ordered according to the weight sum. Specifically, the feature weights and the text data from high to low are ordered, the information content of the front text data is rich, topic clusters (topic classes) can be formed by preferential clustering, and then text clustering is performed according to the clustered topic clusters, so that the accuracy is higher. The present invention is not limited to the above-described embodiments, as long as the text data is sorted by the weight of the feature words to represent the information content of the text data, for example, the text data is sorted according to the weight product of the feature words in each feature word set.

And step S104, carrying out clustering calculation on the batch text data based on the sequencing result. In the process, based on the sorting result, traversing the batch of text data, and sequentially determining the similarity value of the text data and the clustering center of the created topic class. And judging whether the similarity value accords with a preset threshold value, if so, adding the text data into the corresponding created topic class, updating the clustering center of the created topic class, and if not, creating a topic class.

In the process of determining the similarity value of each text data and the clustering center of the created topic class, the hash value of the feature word of each text data is determined through a hash algorithm, and each feature word is represented through a string of numbers, so that the purpose of dimension reduction is achieved. And then, carrying out weighting processing on the hash value according to the determined weight of each feature word to obtain the weighted hash value of the feature word. And then accumulating the weighted hash values of the feature words of each text data to obtain the serial string representation of the text data. The sim-hash value of the text data is determined from the serial string representation of the text data. And determining the similarity value of the text data and the clustering center of the created topic class according to the sim-hash value of the text data. simhash is a locality sensitive hash. For local sensitivity, it is assumed that two strings have a certain similarity, and the similarity can still be maintained after the hash, which is called as local sensitivity hash, and common hash does not have the attribute. When the similarity value of the text data and the clustering center of the created topic class is determined, the Hamming distance between the sim-hash value of the text data and the sim-hash value of the clustering center of the created topic class can be calculated, and the similarity value of the Hamming distance text data and the clustering center of the created topic class is determined.

The step of judging whether the similarity value accords with a preset threshold value comprises judging whether the content similarity value accords with a second preset threshold value under the condition that the similarity value of the title does not accord with the first preset threshold value.

According to the embodiment of the invention, the sequence of the text clusters is rearranged by utilizing the characteristic weights, for example, the text clusters are ordered from high to low according to the characteristic weights, and the information content of the text data in front is rich. The text data with rich information can be clustered to form topic classes preferentially, and then text clustering is carried out according to the clustered topic classes, so that the clustering accuracy can be improved. Especially, clustering is carried out on streaming data, so that the clustering accuracy can be greatly improved. And setting a clustering center of the topic class, wherein the clustering center represents topics of text data in the topic class. The new text data only needs to be compared with the clustering centers in the topics in a similar way, so that the clustering calculation speed is improved. And when the similarity of the text data is calculated, the Hamming distance between the text data is determined based on the sim-hash value of the text data, so that the calculation complexity is greatly reduced. If the calculated Hamming distance is smaller than the preset threshold, the similarity value among the texts accords with the preset threshold, and the texts can be gathered into a topic class.

Hot topic discovery refers to a technique of automatically discovering topics that are of great interest to people in a certain period of time through various information resources and fusing topic-related content together. Hot topic discovery relies on an incremental text clustering technology, information flows are gathered into limited topic class clusters, the topics in the classes are highly similar, and the similarity between different classes is low, so that mass data fusion is performed. The identification of the hot topics aims at acquiring corresponding topics from the semi-structured massive network data and polymerizing the topics into new hot events for analysis. The hot topic analysis has great practical significance on public opinion analysis, can provide the focus of attention of netizens for the content marketing platform in real time, assist the network public opinion analysis, and timely help the public gateway team of each big company to process crisis events.

Fig. 2 is a schematic diagram of a method of hot topic discovery according to an embodiment of the invention. As shown in fig. 2, the method for discovering hot topics according to the embodiment of the present invention includes:

Step 201, acquiring batch text data according to keyword timing. In order to determine the real-time topics, the data can be crawled at regular time (for example, 15 minutes apart) according to the keywords in webpage news, microblog, weChat public numbers, knowledge and the like so as to obtain batched text data to be clustered, and hot topics can be identified through clustering the obtained text data. The keywords may be one or more, and the keywords may be set according to the service requirement.

And 202, extracting characteristic words of the text data. The extraction of text feature words is to screen words with relatively strong representative meaning from text contents or titles. On the premise of not influencing the theme ideas to be expressed by the text, the words to be processed are relatively fewer as much as possible during selection, so that the calculation complexity can be greatly reduced, and the text processing speed and efficiency are improved. The less a word appears in a document, the less the word affects the document, so in an embodiment of the invention, feature words are extracted by a word frequency method.

Specifically, word segmentation processing may be performed on the text data to obtain an initial feature word. Since parts of speech has a great influence on topic clustering, all words except nouns and verbs in the initial feature words are filtered. And then scoring the filtered initial feature words according to a preset feature evaluation index, sorting according to the magnitude of the scores, and finally selecting a plurality of items with the top scores as feature words of the text (forming a feature word set).

And 203, calculating the weight of the feature words. In the embodiment of the invention, the TF-IDF weight value of the feature words in each feature word set is determined by a word frequency-reverse file frequency (TF-IDF) algorithm, and the weight value determined by the method is more accurate and has higher calculation efficiency. Where TF represents the number of times a feature word appears in text d, and IDF is the inverse document frequency (obtained by dividing the total number of documents by the number of documents containing the word, and taking the logarithm of the quotient obtained) which is inversely proportional to the number of times a word appears in the document set. The TF-IDF weight value is calculated as follows:

w_ij＝tf_ij×idf_ij

w_ij is the weight value of the feature word i in the text j, TF_ij is the TF value of the feature word i in the text j, and IDF_ij is the IDF value of the feature word i in the text j.

And 204, sequencing the batch text data according to the weight sum of the feature words. Because the single-pass algorithm is sensitive to the order of document clustering, the embodiment of the invention adjusts the clustering order of text data. Specifically, as for the feature words in the feature word set of each text data, the TF-IDF value is calculated, the kth text data after normalization processing can be expressed as D_k＝(w_k1d_k1,w_k2d_k2,...w_knd_kn),k＝1、2、…、m,i＝1、2,…、n,m as the number of batch text data, n as the number of feature words of all text data, w_kn as the weight of the normalized feature words, and d_ki as whether the ith word contained in the text data d_k is 1 when the ith word is contained and is not contained as 0. W_kn,w_kn = for normalization processing

Weight (i, d_k) is represented as the TF-IDF weight value of the i-th feature word in the text data d_k.

The weight sum of the feature words of the kth text data is as followsD_ki＝w_kid_ki denotes a weight item of the i-th feature word in the text data D_k. And according to the weight sum of the feature words, m text data are ordered, topic information contained in the front text data is rich, and the topic clusters are formed by preferential clustering.

Step 205, calculating sim-hash value of the text data. Each character string is changed into a string of numbers by a hash algorithm to achieve the purpose of dimension reduction. Then, the weighted number strings are formed according to the weights of the feature term words obtained in the above steps, for example, the hash value of the term ' U.S. "is 100101, the weight is 5, and the weighted calculation is ' 5-5-5 5-5 5'. And accumulating each bit of the weighted digital string calculated by each characteristic word of the text data, changing the accumulated digital string into 01 strings, namely, marking the number larger than 0 as 1 and marking the number smaller than 0 as 0, so as to obtain a sim-hash value of each text data, and improving the accuracy of the calculation of the similarity value of the subsequent text data.

Step 206, calculating Hamming distance between the text data and the clustering center of the created topic class. After obtaining the sim-hash value of each text data, the distance is called hamming distance by comparing how much the sim-hash values of two text data are different from each other by 01. For example, the third, fourth and fifth bits of '10101' and '10010' are different, and the Hamming distance therebetween is 3. When the batch text data is clustered, a threshold value can be preset, and when the Hamming distance is lower than the threshold value (the similarity value accords with the preset threshold value), the batch text data is considered to be similar and is classified as a topic cluster. For determination of the distance center, centroid method or center method can be used. The center method, which means that a topic class is represented by a certain document as a topic center, requires a lot of time to search for a proper center. For the centroid method, the average value of all document vectors in the class is adopted for similarity comparison, so that the efficiency of a single-pass clustering algorithm can be improved.

And step 207, judging the topic class to which the text data belongs according to the calculated Hamming distance.

Fig. 3 is a schematic diagram of clustering calculation of text data according to an embodiment of the present invention, as shown in fig. 3, in the embodiment of the present invention, text data is sequentially acquired based on sorting of batch text, after one text data is acquired, whether the text data is first determined, if yes, it is indicated that no topic class is created at this time, and then the first topic class is established. And sequentially acquiring text data according to the sequencing result and performing clustering calculation. And judging that the acquired text data is not the first text data, and indicating that the created topic class exists. Then a similarity determination is performed on the title in the text data, for example, the first threshold T_e1 is set to 3, and if the hamming distance D_{Title of the book} between the title of the text data and the title of the cluster center of the created topic class is lower than this threshold 3, the text data is considered to be clustered in the topic class. If the Hamming distance between the titles is greater than the threshold 3, the calculation of the Hamming distance between the contents is continued. For example, if the second threshold T_e2 is set to 10 and the hamming distance D_Content between the content of the text data and the content of the cluster center of the created topic class is smaller than this threshold 10, the topic class is classified. If the Hamming distance of the content of the text data to the content of the cluster centers of all the created topic classes is greater than this threshold 10, a new topic cluster is created.

And step 208, outputting all the created topic classes.

Fig. 4 is a schematic diagram of main modules of an apparatus for text data clustering according to an embodiment of the present invention, and as shown in fig. 4, an apparatus 400 for text data clustering according to an embodiment of the present invention includes a feature word determining module 401, a weight determining module 402, a ranking module 403, and a clustering module 404.

The feature word determining module 401 is configured to obtain a batch of text data, and determine a feature word set of each text data in the batch of text data.

The weight determining module 402 is configured to determine a weight of each feature word in the feature word set for each feature word set of the text data. The weight determining module is further used for determining TF-IDF weight values of the feature words in each feature word set through a word frequency-reverse file frequency algorithm.

The sorting module 403 is configured to sort the batch text data according to weights of feature words in the feature word set. The sequencing module is also used for determining the weight sum of all the feature words in each feature word set and sequencing the batch text data according to the weight sum.

The clustering module 404 is configured to perform clustering calculation on the batch text data based on the sorting result. The clustering module is also used for traversing the batch text data based on the sequencing result and sequentially determining the similarity value of the text data and the clustering center of the created topic class, judging whether the similarity value meets a preset threshold, adding the text data into the corresponding created topic class and updating the clustering center of the created topic class if the similarity value meets the preset threshold, and creating the topic class if the similarity value does not meet the preset threshold.

The clustering module is also used for determining the hash value of the characteristic word of each text data through a hash algorithm, carrying out weighting processing on the hash value according to the weight of each determined characteristic word to obtain a weighted hash value of the characteristic word, accumulating the weighted hash values of the characteristic word of each text data to obtain a sequence string representation of the text data, determining the sim-hash value of the text data according to the sequence string representation of the text data, and determining the similarity value of the text data and the clustering center of the created topic class according to the sim-hash value of the text data. When the similarity value of the text data and the clustering center of the created topic class is determined, the Hamming distance between the sim-hash value of the text data and the sim-hash value of the clustering center of the created topic class can be calculated, and the similarity value of the Hamming distance text data and the clustering center of the created topic class is determined.

The clustering module is further configured to determine whether the similarity value meets a preset threshold, where determining that the title similarity value does not meet the first preset threshold, and determining whether the content similarity value meets a second preset threshold. The similarity values include a title similarity value and a content similarity value.

Fig. 5 shows an exemplary system architecture 500 of a method for text data clustering or an apparatus for text data clustering to which embodiments of the invention may be applied.

As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 is used as a medium to provide communication links between the terminal devices 501, 502, 503 and the server 505. The network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 505 via the network 504 using the terminal devices 501, 502, 503 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 501, 502, 503, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The terminal devices 501, 502, 503 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 505 may be a server providing various services, such as a background management server (by way of example only) providing support for shopping-type websites browsed by users using the terminal devices 501, 502, 503. The background management server can analyze and other data of the received product information inquiry request and feed back the processing result to the terminal equipment.

It should be noted that, the method for text data clustering provided by the embodiment of the present invention is generally performed by the server 505, and accordingly, the device for text data clustering is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 6, there is illustrated a schematic diagram of a computer system 600 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 6 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Connected to the I/O interface 605 are an input section 606 including a keyboard, a mouse, and the like, an output section 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like, a storage section 608 including a hard disk, and the like, and a communication section 609 including a network interface card such as a LAN card, a modem, and the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 601.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, a processor may be described as including an acquisition feature word determination module, a weight determination module, a ranking module, and a clustering module. The names of these modules do not constitute limitations on the module itself in some cases, and for example, the feature word determining module may also be described as "a module that acquires batch text data, and determines a feature word set for each text data in the batch text data".

As a further aspect, the invention also provides a computer readable medium which may be comprised in the device described in the above embodiments or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include obtaining a batch of text data and determining a feature word set for each text data in the batch of text data, determining a weight for each feature word in the feature word set for the feature word set of each text data, sorting the batch of text data according to the weights of the feature words in the feature word set, and performing a clustering calculation on the batch of text data based on the sorting result.

According to the embodiment of the invention, the sequence of the text clusters is rearranged by utilizing the characteristic weights, for example, the text clusters are ordered from high to low according to the characteristic weights, and the information content of the text data in front is rich. The text data with rich information can be clustered to form topic classes preferentially, and then text clustering is carried out according to the clustered topic classes, so that the clustering accuracy can be improved.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for text data clustering, comprising:

Acquiring batched text data and determining a characteristic word set of each text data in the batched text data;

For each feature word set of the text data, determining the weight of each feature word in the feature word set;

the weight sum of all feature words in each feature word set is determined, and the batch text data is ordered according to the weight sum;

and carrying out clustering calculation on the batch of text data based on the sequencing result, wherein the weight and the text data from high to low are sequenced, the weight and the text data before are rich in information content, topic classes are formed based on the weight and the text data before clustering, and then text clustering is carried out according to the clustered topic classes.

2. The method of claim 1, wherein the step of clustering the batch of text data based on the ranking result comprises:

traversing the batch of text data based on the sequencing result, and sequentially determining similarity values of the text data and clustering centers of the established topic class;

judging whether the similarity value accords with a preset threshold value, if so, adding text data into a corresponding created topic class, updating a clustering center of the created topic class, and if not, creating a topic class.

3. The method of claim 2, wherein the step of determining a similarity value for each text data to a cluster center of the created topic class comprises:

Determining a hash value of the feature word of each text data through a hash algorithm;

Weighting the hash value of each feature word according to the determined weight of the feature word to obtain a weighted hash value of the feature word;

accumulating the weighted hash values of the feature words of each text data to obtain a sequence string representation of the text data;

Determining sim-hash values of the text data according to the sequence string representation of the text data;

and determining the similarity value of the text data and the clustering center of the created topic class according to the sim-hash value of the text data.

4. The method of claim 2, wherein the similarity values include a title similarity value and a content similarity value;

5. The method of claim 1, wherein for each set of feature words of the text data, determining the weight of each feature word in the set of feature words comprises:

And determining the TF-IDF weight value of the feature words in each feature word set through a word frequency-reverse file frequency algorithm.

6. An apparatus for text data clustering, comprising:

the sequencing module is used for determining the weight sum of all the feature words in each feature word set, and sequencing the batch of text data according to the weight sum;

And the clustering module is used for carrying out clustering calculation on the batch of text data based on the sequencing result, wherein the weight and the text data from high to low are sequenced, the information content contained in the weight and the text data before are rich, topic classes are formed based on the weight and the text data before clustering, and then text clustering is carried out according to the clustered topic classes.

7. The apparatus of claim 6, wherein the clustering module is further configured to traverse the batch of text data based on the ranking result and sequentially determine a similarity value of the text data and a cluster center of a created topic class, determine whether the similarity value meets a preset threshold, add the text data to a corresponding created topic class and update the cluster center of the created topic class if so, and otherwise, create a new topic class.

8. The apparatus of claim 7, wherein the clustering module is further configured to determine a hash value of a feature word of each text data by a hash algorithm, weight the hash value of each feature word according to a weight of the determined feature word to obtain a weighted hash value of the feature word, accumulate the weighted hash values of the feature word of each text data to obtain a serial string representation of the text data, determine a sim-hash value of the text data according to the serial string representation of the text data, and determine a similarity value of the text data to a clustering center of the created topic class according to the sim-hash value of the text data.

9. The apparatus of claim 7, wherein the clustering module is further configured to determine whether the similarity value meets a preset threshold value, including determining whether the content similarity value meets a second preset threshold value if the title similarity value does not meet the first preset threshold value;

10. The apparatus of claim 6, wherein the weight determination module is further configured to determine TF-IDF weight values for the feature words in each feature word set by a word frequency-reverse file frequency algorithm.

11. An electronic device, comprising:

One or more processors;

storage means for storing one or more programs,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.

12. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.