Disclosure of Invention
Embodiments of the present invention provide a text classification noise monitoring method, apparatus, device, and computer readable medium to solve or alleviate one or more technical problems in the prior art.
In a first aspect, an embodiment of the present invention provides a text classification noise monitoring method, including:
obtaining similarity distribution of title pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category;
obtaining a confidence interval of the similarity distribution according to a set confidence level;
and determining a noise title pair according to the confidence interval of the similarity distribution.
With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining a similarity distribution of pairs of titles belonging to the same text category includes:
randomly extracting a set number of titles from the text categories;
pairing the extracted titles pairwise, and calculating the similarity of each title pair;
and carrying out distribution statistics according to the obtained similarity of the title pairs.
With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the calculation formula of the similarity of the header pair is formula 1:
wherein,
and
the title vectors representing the ith and jth titles, respectively.
With reference to the first aspect, in a third implementation manner of the first aspect, an embodiment of the present invention further includes:
respectively calculating the proportion of the number of each header appearing in the noise header pair to the number of all the noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
With reference to the first aspect, in a fourth implementation manner of the first aspect, an embodiment of the present invention further includes: and calculating the clustering density of the text category according to the title similarity of the same text category.
With reference to the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the calculation formula for calculating the clustering density of the text category is formula 2:
wherein ξ
H(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,
a word vector representing the ith heading belonging to the text category z,
representing a title vector
And a title vector
I and j are positive integers.
In a second aspect, an embodiment of the present invention further provides a text classification noise monitoring apparatus, including:
the similarity distribution acquisition module is used for acquiring the similarity distribution of the title pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category;
the confidence interval acquisition module is used for acquiring the confidence interval of the similarity distribution according to a set confidence level;
and the noise header pair acquisition module is used for determining a noise header pair according to the confidence interval of the similarity distribution.
With reference to the second aspect, in a first implementation manner of the second aspect, the similarity distribution obtaining module includes:
the extraction submodule is used for randomly extracting a set number of titles from the text categories;
the calculation submodule is used for pairwise matching the extracted titles and calculating the similarity of each title pair;
and the statistic submodule is used for carrying out distribution statistics according to the obtained similarity of the title pairs.
With reference to the first implementation manner of the second aspect, in an embodiment of the present invention, in the second implementation manner of the second aspect, in the calculating sub-module, a calculation formula of the similarity of the header pair is as follows:
wherein,
and
the title vectors representing the ith and jth titles, respectively.
With reference to the second aspect, in a third implementation manner of the second aspect, the embodiment of the present invention further includes:
the noise header judging module is used for respectively calculating the proportion of the number of each header appearing in the noise header pair to the number of all the noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
With reference to the second aspect, in a fourth implementation manner of the second aspect, the embodiment of the present invention further includes:
and the clustering density calculating module is used for calculating the clustering density of the text category according to the title similarity of the same text category.
With reference to the fourth implementation manner of the second aspect, in an embodiment of the present invention, in a fifth implementation manner of the second aspect, in the cluster density calculation module, a calculation formula for calculating the cluster density of the text category is as follows:
wherein ξ
H(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,
a word vector representing the ith heading belonging to the text category z,
representing a title vector
And a title vector
I and j are positive integers.
The functions of the device can be realized by hardware, and can also be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions.
In a third aspect, in a possible design, the structure of the text classification noise monitoring apparatus includes a processor and a memory, where the memory is used to store a program that supports the text classification noise monitoring apparatus to execute the text classification noise monitoring method in the first aspect, and the processor is configured to execute the program stored in the memory. The text classification noise monitoring apparatus may further include a communication interface for the text classification noise monitoring apparatus to communicate with other devices or a communication network.
In a fourth aspect, an embodiment of the present invention provides a computer readable medium for storing computer software instructions for a text classification noise monitoring apparatus, which includes a program for executing the text classification noise monitoring method according to the first aspect.
According to the embodiment of the invention, the corresponding similarity distribution can be obtained by calculating the similarity between the header pairs, so that the noise sample is determined through the confidence interval of the similarity distribution, and the accuracy of noise monitoring can be ensured.
Further, the range of the confidence interval can be adjusted by adjusting the confidence level, the accuracy of the filtering can be adjusted by adjusting the noise threshold, and generally, the larger the threshold is set, the higher the accuracy of the category to which the filtered text belongs.
In addition, the embodiment of the invention is suitable for the condition of more classification quantity, and can quickly monitor the noise data of each classification.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive. The embodiment of the invention mainly provides a method and a device for monitoring through text classification noise, and the technical scheme is expanded and described through the following embodiments respectively.
The embodiment of the invention provides a text classification noise monitoring method and a text classification noise monitoring device, which are used for detecting noise of classification training samples. The following describes a specific processing flow and principle of the text classification noise monitoring method and apparatus according to the embodiment of the present invention in detail.
Fig. 1 is a flowchart of a text classification noise monitoring method according to an embodiment of the present invention. The text classification noise monitoring method of the embodiment of the invention can comprise the following steps:
s110: and obtaining similarity distribution of title (title) pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category.
In order to identify the noise titles belonging to the same text category, in the present embodiment, similarity distribution of the title pairs is obtained by counting the similarity between two titles belonging to the same text category, so as to further perform screening of the noise titles.
As shown in fig. 2, in one embodiment, the step S110 may include:
s111: a set number of titles are randomly drawn in the text category.
The number of titles to be randomly extracted may be set in a certain ratio based on the total number of titles belonging to a certain text category, or may be set to a specific numerical value. For example, assuming that the current text category includes 200 titles, 100 samples may be extracted as the comparison samples, or 60% or 120 samples may be extracted as the comparison samples.
S112: and pairwise matching the extracted titles, and calculating the similarity of each title pair.
In one embodiment, the extracted titles may be paired pairwise. For example, 100 titles are extracted from 200 titles, and then pairwise matching is performed on the 100 titles, and the similarity is calculated respectively, so as to reduce the calculated sample capacity.
In another embodiment, the extracted titles may be paired with the current title respectively. For example, when 200 titles are numbered in order and the similarity between the title with the number 1 and another title is determined, 100 titles may be extracted from the remaining 199 titles as matching targets. Then, similarity calculation is performed on the 200 titles in sequence according to the sequence numbers.
In one embodiment, the similarity of the header pair is calculated according to formula 1:
wherein,
and
the title vectors representing the ith and jth titles, respectively.
In an implementation manner, the title vector may be obtained by segmenting a title by a word segmentation device, converting the segmented words into corresponding vectors by a word to vector (word to vector) model, then superimposing the vectors to serve as the title vector of the current title, and finally performing normalization processing on the current title vector. For example, a header vector of 200 dimensions can be obtained by accumulating based on the word vectors after word segmentation.
S113: and carrying out distribution statistics according to the obtained similarity of the title pairs.
As shown in fig. 3, which is a schematic diagram of the normal distribution of the header pairs. Based on the obtained data of the similarity of the title pair, a normal distribution diagram can be obtained by using ln (similarity) as the abscissa, wherein the similarity is the similarity of the title pair, and using the frequency of occurrence or frequency (frequency) as the ordinate.
S120: and acquiring a confidence interval of the similarity distribution according to a set confidence level.
After determining that the similarity of the title pair obeys normal distribution through normalization transformation, a title (title) pair containing a noise sample needs to be obtained and clipped to reduce the influence of the noise sample on the classification performance. The confidence interval of the probability distribution obeyed by the similarity of each title pair is calculated, and according to the characteristic of normal distribution, for a given confidence level 1-alpha, see formula 3:
in formula 3, μ represents the center position of the normal curve, - μα/2And muα/2Representing the position of the lower and upper limit, respectively, of the confidence interval,represents the mean of the abscissa, S represents the variance, and n represents the number of samples (i.e., title versus number)
From equation 3, it can be concluded that the confidence interval with a confidence level of 1- α for μ is equation 4:
at the center of the normal curve denoted by μ, - μα/2And muα/2Representing the position of the lower and upper limit, respectively, of the confidence interval,represents the mean of the abscissa, S represents the variance, and n represents the number of samples (i.e., number of header pairs).
By looking up the table
And determining a confidence interval. In addition, the larger the value of α is, the smaller the probability of erroneous detection of noise sample data is, and the probability of mixing correct data into abnormal data is also increased, for example, α is 0.95.
S130: and determining a noise title pair according to the confidence interval of the similarity distribution.
The title pairs to the left of the single-sided confidence lower limit are obtained by calculating the confidence interval of the probability distribution to which the similarity of the title pairs obeys. For example, if the lower limit of the calculated confidence interval is-6, then the headline pair having an ln (similarity) between [ -8, -6] is a noise headline pair. The similarity values between these pairs of headers appear less frequently and the similarity values are also smaller. According to the characteristics of the noise sample, the degree of representing the category is weaker, and the similarity between the noise sample and other header pairs is smaller, so that the header pair determining the lower limit of the confidence interval belongs to the noise header pair.
After the noise header pairs are obtained, the noise headers can also be extracted from these noise-containing header pairs. As shown in fig. 4, in another embodiment, the text classification noise monitoring method may further include the steps of:
s140: respectively calculating the proportion of the number of each header appearing in the noise header pair to the number of all the noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
For example, the number of acquired noise header pairs is 10, that is, 20 headers are included. It is assumed that the 20 titles include 11 titles, i.e., title 1 to title 11. The number of occurrences of title 1 is 10, and the number of occurrences oftitles 2 to 11 is 1. I.e., title 1 accounts for 100% of the title to number ratio of 10/10. Then, it can be determined that the similarity of the title 1 to other titles is low, and thus the title 1 can be determined to be a noise title.
In addition, after the similarity of the title pairs under the same classified text is obtained, the clustering density under the whole classification can be calculated to measure the average similarity of the current classification. As shown in fig. 5, in another embodiment, the text classification noise monitoring method may further include the steps of:
s150: and calculating the clustering density of the text category according to the title similarity of the same text category.
The calculation formula for calculating the clustering density of the text categories is shown as formula 2:
wherein ξ
H(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,
a word vector representing the ith heading belonging to the text category z,
representing a title vector
And a title vector
I and j are positive integers.
According to the embodiment of the invention, the corresponding similarity distribution can be obtained by calculating the similarity between the title pairs, so that the noise sample is determined by setting the threshold value, and the accuracy of noise monitoring can be ensured, wherein the higher the threshold value is set, the higher the accuracy of the category to which the filtered text belongs is. In addition, the embodiment of the invention is suitable for the condition of more classification quantity, and can quickly monitor the noise data of each classification.
As shown in fig. 6, in another embodiment, an embodiment of the present invention further provides a text classification noise monitoring apparatus, including:
the similaritydistribution obtaining module 110 is configured to obtain similarity distribution of a pair of titles belonging to the same text category, where the similarity of the pair of titles is the similarity of any two titles belonging to the same text category.
A confidenceinterval obtaining module 120, configured to obtain a confidence interval of the similarity distribution according to a set confidence level.
And a noise headerpair obtaining module 130, configured to determine a noise header pair according to the confidence interval of the similarity distribution.
As shown in fig. 7, the similaritydistribution obtaining module 110 includes:
and anextraction submodule 111 for randomly extracting a set number of titles from the text category.
And the calculatingsubmodule 112 is configured to pair the extracted titles two by two, and calculate a similarity of each title pair.
In the calculation submodule, a calculation formula of the similarity between the two titles is represented by formula 1:
wherein,
and
the title vectors representing the ith and jth titles, respectively.
And thestatistic submodule 113 is configured to perform distribution statistics according to the obtained similarity of the header pairs.
As shown in fig. 8, in another embodiment, the text classification noise monitoring apparatus further includes:
a noiseheader judging module 140, configured to calculate a ratio of the number of each header appearing in the noise header pair to the number of all noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
As shown in fig. 9, in another embodiment, the text classification noise monitoring apparatus further includes:
and the clusteringdensity calculating module 150 is configured to calculate the clustering density of the text category according to the title similarity of the same text category.
In the clusterdensity calculation module 150, a calculation formula for calculating the cluster density of the text category is as follows:
wherein ξ
H(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,
a word vector representing the ith heading belonging to the text category z,
representing a title vector
And a title vector
I and j are positive integers.
The functions of the modules of the apparatus of this embodiment are similar to the principles of the text classification noise monitoring method of the above embodiment, and therefore are not described again.
In another embodiment, the present invention further provides a text classification noise monitoring apparatus, as shown in fig. 10, including: amemory 510 and aprocessor 520, thememory 510 having stored therein computer programs that are executable on theprocessor 520. Theprocessor 520, when executing the computer program, implements the text classification noise monitoring method in the above embodiments. The number of thememory 510 and theprocessor 520 may be one or more.
The apparatus further comprises:
thecommunication interface 530 is used for communicating with an external device to perform data interactive transmission.
Memory 510 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If thememory 510, theprocessor 520, and thecommunication interface 530 are implemented independently, thememory 510, theprocessor 520, and thecommunication interface 530 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.
Optionally, in an implementation, if thememory 510, theprocessor 520, and thecommunication interface 530 are integrated on a chip, thememory 510, theprocessor 520, and thecommunication interface 530 may complete communication with each other through an internal interface.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The computer readable medium described in embodiments of the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
In embodiments of the present invention, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, input method, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, Radio Frequency (RF), etc., or any suitable combination of the preceding.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
In summary, in the embodiments of the present invention, by calculating the similarity between the header pairs, the corresponding similarity distribution can be obtained, so that the noise sample is determined by setting the threshold, and thus the accuracy of noise monitoring can be ensured, and the greater the threshold is set, the higher the accuracy of the category to which the filtered text belongs is. In addition, the embodiment of the invention is suitable for the condition of more classification quantity, and can quickly monitor the noise data of each classification.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.