Movatterモバイル変換


[0]ホーム

URL:


CN110717033A - Text classification noise monitoring method, device, equipment and computer readable medium - Google Patents

Text classification noise monitoring method, device, equipment and computer readable medium
Download PDF

Info

Publication number
CN110717033A
CN110717033ACN201810668995.7ACN201810668995ACN110717033ACN 110717033 ACN110717033 ACN 110717033ACN 201810668995 ACN201810668995 ACN 201810668995ACN 110717033 ACN110717033 ACN 110717033A
Authority
CN
China
Prior art keywords
similarity
text
title
noise
header
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810668995.7A
Other languages
Chinese (zh)
Inventor
田绍伟
姚源林
薛璐影
叶君健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co LtdfiledCriticalBeijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810668995.7ApriorityCriticalpatent/CN110717033A/en
Publication of CN110717033ApublicationCriticalpatent/CN110717033A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The invention provides a text classification noise monitoring method, a text classification noise monitoring device, text classification noise monitoring equipment and a computer readable medium. The method comprises the following steps: obtaining similarity distribution of title pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category; obtaining a confidence interval of the similarity distribution according to a set confidence level; and determining a noise title pair according to the confidence interval of the similarity distribution. According to the embodiment of the invention, the corresponding similarity distribution can be obtained by calculating the similarity between the header pairs, and the noise sample is determined by the confidence interval of the similarity distribution, so that the accuracy of noise monitoring can be ensured.

Description

Text classification noise monitoring method, device, equipment and computer readable medium
Technical Field
The invention relates to the technical field of text classification noise monitoring, in particular to a text classification noise monitoring method, a text classification noise monitoring device, text classification noise monitoring equipment and a computer readable medium based on distribution statistics.
Background
The text classification technology is an important basis of information retrieval and text mining and is one of core research contents in the field of artificial intelligence. When a machine learning method is adopted for text classification, a classifier learns the classification knowledge through a training sample labeled in advance according to the classification and forms a feature space, rules capable of effectively classifying are automatically mined out, and then the rules are used for classifying test samples.
The method can be roughly divided into two stages: (1) constructing a classifier by utilizing a training text set with class labels; (2) the new text is classified using a classifier. It can be seen that the quality of the classifier has a direct impact on the final result of the text classification. While the quality of the classifier depends strongly on the quality of the training text set. Generally speaking, the more accurate the training text set category and the more comprehensive the content, the higher the quality of the obtained classifier. However, in practical applications, such a comprehensive and accurate training text set is difficult to obtain. In the member document set under each category, a certain number of documents with wrong category labels often exist, namely the document content does not accord with the labeled category. We refer to such class-tagged erroneous documents as noisy data. Noisy data is a problem often encountered in text automatic classification applications, especially where the data is large in scale. In real text classification applications, the training data generally inevitably contains noise, and these noise samples will have an important influence on the final classification result, affecting the final classification accuracy and performance.
Disclosure of Invention
Embodiments of the present invention provide a text classification noise monitoring method, apparatus, device, and computer readable medium to solve or alleviate one or more technical problems in the prior art.
In a first aspect, an embodiment of the present invention provides a text classification noise monitoring method, including:
obtaining similarity distribution of title pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category;
obtaining a confidence interval of the similarity distribution according to a set confidence level;
and determining a noise title pair according to the confidence interval of the similarity distribution.
With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining a similarity distribution of pairs of titles belonging to the same text category includes:
randomly extracting a set number of titles from the text categories;
pairing the extracted titles pairwise, and calculating the similarity of each title pair;
and carrying out distribution statistics according to the obtained similarity of the title pairs.
With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the calculation formula of the similarity of the header pair is formula 1:
Figure BDA0001708613040000021
wherein,
Figure BDA0001708613040000022
and
Figure BDA0001708613040000023
the title vectors representing the ith and jth titles, respectively.
With reference to the first aspect, in a third implementation manner of the first aspect, an embodiment of the present invention further includes:
respectively calculating the proportion of the number of each header appearing in the noise header pair to the number of all the noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
With reference to the first aspect, in a fourth implementation manner of the first aspect, an embodiment of the present invention further includes: and calculating the clustering density of the text category according to the title similarity of the same text category.
With reference to the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the calculation formula for calculating the clustering density of the text category is formula 2:
Figure BDA0001708613040000024
wherein ξH(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,a word vector representing the ith heading belonging to the text category z,
Figure BDA0001708613040000026
representing a title vector
Figure BDA0001708613040000027
And a title vector
Figure BDA0001708613040000028
I and j are positive integers.
In a second aspect, an embodiment of the present invention further provides a text classification noise monitoring apparatus, including:
the similarity distribution acquisition module is used for acquiring the similarity distribution of the title pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category;
the confidence interval acquisition module is used for acquiring the confidence interval of the similarity distribution according to a set confidence level;
and the noise header pair acquisition module is used for determining a noise header pair according to the confidence interval of the similarity distribution.
With reference to the second aspect, in a first implementation manner of the second aspect, the similarity distribution obtaining module includes:
the extraction submodule is used for randomly extracting a set number of titles from the text categories;
the calculation submodule is used for pairwise matching the extracted titles and calculating the similarity of each title pair;
and the statistic submodule is used for carrying out distribution statistics according to the obtained similarity of the title pairs.
With reference to the first implementation manner of the second aspect, in an embodiment of the present invention, in the second implementation manner of the second aspect, in the calculating sub-module, a calculation formula of the similarity of the header pair is as follows:
Figure BDA0001708613040000031
wherein,
Figure BDA0001708613040000032
and
Figure BDA0001708613040000033
the title vectors representing the ith and jth titles, respectively.
With reference to the second aspect, in a third implementation manner of the second aspect, the embodiment of the present invention further includes:
the noise header judging module is used for respectively calculating the proportion of the number of each header appearing in the noise header pair to the number of all the noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
With reference to the second aspect, in a fourth implementation manner of the second aspect, the embodiment of the present invention further includes:
and the clustering density calculating module is used for calculating the clustering density of the text category according to the title similarity of the same text category.
With reference to the fourth implementation manner of the second aspect, in an embodiment of the present invention, in a fifth implementation manner of the second aspect, in the cluster density calculation module, a calculation formula for calculating the cluster density of the text category is as follows:
Figure BDA0001708613040000034
wherein ξH(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,
Figure BDA0001708613040000035
a word vector representing the ith heading belonging to the text category z,
Figure BDA0001708613040000036
representing a title vector
Figure BDA0001708613040000037
And a title vector
Figure BDA0001708613040000038
I and j are positive integers.
The functions of the device can be realized by hardware, and can also be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions.
In a third aspect, in a possible design, the structure of the text classification noise monitoring apparatus includes a processor and a memory, where the memory is used to store a program that supports the text classification noise monitoring apparatus to execute the text classification noise monitoring method in the first aspect, and the processor is configured to execute the program stored in the memory. The text classification noise monitoring apparatus may further include a communication interface for the text classification noise monitoring apparatus to communicate with other devices or a communication network.
In a fourth aspect, an embodiment of the present invention provides a computer readable medium for storing computer software instructions for a text classification noise monitoring apparatus, which includes a program for executing the text classification noise monitoring method according to the first aspect.
According to the embodiment of the invention, the corresponding similarity distribution can be obtained by calculating the similarity between the header pairs, so that the noise sample is determined through the confidence interval of the similarity distribution, and the accuracy of noise monitoring can be ensured.
Further, the range of the confidence interval can be adjusted by adjusting the confidence level, the accuracy of the filtering can be adjusted by adjusting the noise threshold, and generally, the larger the threshold is set, the higher the accuracy of the category to which the filtered text belongs.
In addition, the embodiment of the invention is suitable for the condition of more classification quantity, and can quickly monitor the noise data of each classification.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.
Drawings
In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.
FIG. 1 is a flowchart of a text classification noise monitoring method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the detailed steps of step S110 according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating normal distribution of pairs of titles according to an embodiment of the present invention;
FIG. 4 is a flowchart of a text classification noise monitoring method according to another embodiment of the present invention;
FIG. 5 is a flowchart of a text classification noise monitoring method according to another embodiment of the present invention;
FIG. 6 is a diagram illustrating a text classification noise monitoring apparatus according to another embodiment of the present invention;
FIG. 7 is an internal block diagram of a similarity distribution obtaining module according to another embodiment of the present invention;
FIG. 8 is a diagram of a text classification noise monitoring apparatus according to another embodiment of the present invention;
FIG. 9 is a diagram of a text classification noise monitoring apparatus according to another embodiment of the present invention;
fig. 10 is a block diagram of a text classification noise monitoring apparatus according to another embodiment of the present invention.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive. The embodiment of the invention mainly provides a method and a device for monitoring through text classification noise, and the technical scheme is expanded and described through the following embodiments respectively.
The embodiment of the invention provides a text classification noise monitoring method and a text classification noise monitoring device, which are used for detecting noise of classification training samples. The following describes a specific processing flow and principle of the text classification noise monitoring method and apparatus according to the embodiment of the present invention in detail.
Fig. 1 is a flowchart of a text classification noise monitoring method according to an embodiment of the present invention. The text classification noise monitoring method of the embodiment of the invention can comprise the following steps:
s110: and obtaining similarity distribution of title (title) pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category.
In order to identify the noise titles belonging to the same text category, in the present embodiment, similarity distribution of the title pairs is obtained by counting the similarity between two titles belonging to the same text category, so as to further perform screening of the noise titles.
As shown in fig. 2, in one embodiment, the step S110 may include:
s111: a set number of titles are randomly drawn in the text category.
The number of titles to be randomly extracted may be set in a certain ratio based on the total number of titles belonging to a certain text category, or may be set to a specific numerical value. For example, assuming that the current text category includes 200 titles, 100 samples may be extracted as the comparison samples, or 60% or 120 samples may be extracted as the comparison samples.
S112: and pairwise matching the extracted titles, and calculating the similarity of each title pair.
In one embodiment, the extracted titles may be paired pairwise. For example, 100 titles are extracted from 200 titles, and then pairwise matching is performed on the 100 titles, and the similarity is calculated respectively, so as to reduce the calculated sample capacity.
In another embodiment, the extracted titles may be paired with the current title respectively. For example, when 200 titles are numbered in order and the similarity between the title with the number 1 and another title is determined, 100 titles may be extracted from the remaining 199 titles as matching targets. Then, similarity calculation is performed on the 200 titles in sequence according to the sequence numbers.
In one embodiment, the similarity of the header pair is calculated according to formula 1:
wherein,
Figure BDA0001708613040000062
and
Figure BDA0001708613040000063
the title vectors representing the ith and jth titles, respectively.
In an implementation manner, the title vector may be obtained by segmenting a title by a word segmentation device, converting the segmented words into corresponding vectors by a word to vector (word to vector) model, then superimposing the vectors to serve as the title vector of the current title, and finally performing normalization processing on the current title vector. For example, a header vector of 200 dimensions can be obtained by accumulating based on the word vectors after word segmentation.
S113: and carrying out distribution statistics according to the obtained similarity of the title pairs.
As shown in fig. 3, which is a schematic diagram of the normal distribution of the header pairs. Based on the obtained data of the similarity of the title pair, a normal distribution diagram can be obtained by using ln (similarity) as the abscissa, wherein the similarity is the similarity of the title pair, and using the frequency of occurrence or frequency (frequency) as the ordinate.
S120: and acquiring a confidence interval of the similarity distribution according to a set confidence level.
After determining that the similarity of the title pair obeys normal distribution through normalization transformation, a title (title) pair containing a noise sample needs to be obtained and clipped to reduce the influence of the noise sample on the classification performance. The confidence interval of the probability distribution obeyed by the similarity of each title pair is calculated, and according to the characteristic of normal distribution, for a given confidence level 1-alpha, see formula 3:
Figure BDA0001708613040000064
in formula 3, μ represents the center position of the normal curve, - μα/2And muα/2Representing the position of the lower and upper limit, respectively, of the confidence interval,represents the mean of the abscissa, S represents the variance, and n represents the number of samples (i.e., title versus number)
From equation 3, it can be concluded that the confidence interval with a confidence level of 1- α for μ is equation 4:
Figure BDA0001708613040000066
at the center of the normal curve denoted by μ, - μα/2And muα/2Representing the position of the lower and upper limit, respectively, of the confidence interval,represents the mean of the abscissa, S represents the variance, and n represents the number of samples (i.e., number of header pairs).
By looking up the table
Figure BDA0001708613040000072
And determining a confidence interval. In addition, the larger the value of α is, the smaller the probability of erroneous detection of noise sample data is, and the probability of mixing correct data into abnormal data is also increased, for example, α is 0.95.
S130: and determining a noise title pair according to the confidence interval of the similarity distribution.
The title pairs to the left of the single-sided confidence lower limit are obtained by calculating the confidence interval of the probability distribution to which the similarity of the title pairs obeys. For example, if the lower limit of the calculated confidence interval is-6, then the headline pair having an ln (similarity) between [ -8, -6] is a noise headline pair. The similarity values between these pairs of headers appear less frequently and the similarity values are also smaller. According to the characteristics of the noise sample, the degree of representing the category is weaker, and the similarity between the noise sample and other header pairs is smaller, so that the header pair determining the lower limit of the confidence interval belongs to the noise header pair.
After the noise header pairs are obtained, the noise headers can also be extracted from these noise-containing header pairs. As shown in fig. 4, in another embodiment, the text classification noise monitoring method may further include the steps of:
s140: respectively calculating the proportion of the number of each header appearing in the noise header pair to the number of all the noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
For example, the number of acquired noise header pairs is 10, that is, 20 headers are included. It is assumed that the 20 titles include 11 titles, i.e., title 1 to title 11. The number of occurrences of title 1 is 10, and the number of occurrences oftitles 2 to 11 is 1. I.e., title 1 accounts for 100% of the title to number ratio of 10/10. Then, it can be determined that the similarity of the title 1 to other titles is low, and thus the title 1 can be determined to be a noise title.
In addition, after the similarity of the title pairs under the same classified text is obtained, the clustering density under the whole classification can be calculated to measure the average similarity of the current classification. As shown in fig. 5, in another embodiment, the text classification noise monitoring method may further include the steps of:
s150: and calculating the clustering density of the text category according to the title similarity of the same text category.
The calculation formula for calculating the clustering density of the text categories is shown as formula 2:
wherein ξH(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,a word vector representing the ith heading belonging to the text category z,
Figure BDA0001708613040000075
representing a title vector
Figure BDA0001708613040000081
And a title vector
Figure BDA0001708613040000082
I and j are positive integers.
According to the embodiment of the invention, the corresponding similarity distribution can be obtained by calculating the similarity between the title pairs, so that the noise sample is determined by setting the threshold value, and the accuracy of noise monitoring can be ensured, wherein the higher the threshold value is set, the higher the accuracy of the category to which the filtered text belongs is. In addition, the embodiment of the invention is suitable for the condition of more classification quantity, and can quickly monitor the noise data of each classification.
As shown in fig. 6, in another embodiment, an embodiment of the present invention further provides a text classification noise monitoring apparatus, including:
the similaritydistribution obtaining module 110 is configured to obtain similarity distribution of a pair of titles belonging to the same text category, where the similarity of the pair of titles is the similarity of any two titles belonging to the same text category.
A confidenceinterval obtaining module 120, configured to obtain a confidence interval of the similarity distribution according to a set confidence level.
And a noise headerpair obtaining module 130, configured to determine a noise header pair according to the confidence interval of the similarity distribution.
As shown in fig. 7, the similaritydistribution obtaining module 110 includes:
and anextraction submodule 111 for randomly extracting a set number of titles from the text category.
And the calculatingsubmodule 112 is configured to pair the extracted titles two by two, and calculate a similarity of each title pair.
In the calculation submodule, a calculation formula of the similarity between the two titles is represented by formula 1:
Figure BDA0001708613040000083
wherein,and
Figure BDA0001708613040000085
the title vectors representing the ith and jth titles, respectively.
And thestatistic submodule 113 is configured to perform distribution statistics according to the obtained similarity of the header pairs.
As shown in fig. 8, in another embodiment, the text classification noise monitoring apparatus further includes:
a noiseheader judging module 140, configured to calculate a ratio of the number of each header appearing in the noise header pair to the number of all noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
As shown in fig. 9, in another embodiment, the text classification noise monitoring apparatus further includes:
and the clusteringdensity calculating module 150 is configured to calculate the clustering density of the text category according to the title similarity of the same text category.
In the clusterdensity calculation module 150, a calculation formula for calculating the cluster density of the text category is as follows:
Figure BDA0001708613040000091
wherein ξH(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,a word vector representing the ith heading belonging to the text category z,
Figure BDA0001708613040000093
representing a title vector
Figure BDA0001708613040000094
And a title vector
Figure BDA0001708613040000095
I and j are positive integers.
The functions of the modules of the apparatus of this embodiment are similar to the principles of the text classification noise monitoring method of the above embodiment, and therefore are not described again.
In another embodiment, the present invention further provides a text classification noise monitoring apparatus, as shown in fig. 10, including: amemory 510 and aprocessor 520, thememory 510 having stored therein computer programs that are executable on theprocessor 520. Theprocessor 520, when executing the computer program, implements the text classification noise monitoring method in the above embodiments. The number of thememory 510 and theprocessor 520 may be one or more.
The apparatus further comprises:
thecommunication interface 530 is used for communicating with an external device to perform data interactive transmission.
Memory 510 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If thememory 510, theprocessor 520, and thecommunication interface 530 are implemented independently, thememory 510, theprocessor 520, and thecommunication interface 530 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.
Optionally, in an implementation, if thememory 510, theprocessor 520, and thecommunication interface 530 are integrated on a chip, thememory 510, theprocessor 520, and thecommunication interface 530 may complete communication with each other through an internal interface.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The computer readable medium described in embodiments of the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
In embodiments of the present invention, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, input method, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, Radio Frequency (RF), etc., or any suitable combination of the preceding.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
In summary, in the embodiments of the present invention, by calculating the similarity between the header pairs, the corresponding similarity distribution can be obtained, so that the noise sample is determined by setting the threshold, and thus the accuracy of noise monitoring can be ensured, and the greater the threshold is set, the higher the accuracy of the category to which the filtered text belongs is. In addition, the embodiment of the invention is suitable for the condition of more classification quantity, and can quickly monitor the noise data of each classification.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (14)

1. A text classification noise monitoring method is characterized by comprising the following steps:
obtaining similarity distribution of title pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category;
obtaining a confidence interval of the similarity distribution according to a set confidence level;
and determining a noise title pair according to the confidence interval of the similarity distribution.
2. The method for monitoring text classification noise according to claim 1, wherein the obtaining the similarity distribution of the header pairs belonging to the same text category comprises:
randomly extracting a set number of titles from the text categories;
pairing the extracted titles pairwise, and calculating the similarity of each title pair;
and carrying out distribution statistics according to the obtained similarity of the title pairs.
3. The text classification noise monitoring method according to claim 2, wherein the calculation formula of the similarity of the header pair is formula 1:
Figure FDA0001708613030000011
wherein,
Figure FDA0001708613030000012
andthe title vectors representing the ith and jth titles, respectively.
4. The text classification noise monitoring method of claim 1, further comprising the steps of:
respectively calculating the proportion of the number of each header appearing in the noise header pair to the number of all the noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
5. The text classification noise monitoring method of claim 1, further comprising the steps of: and calculating the clustering density of the text category according to the title similarity of the same text category.
6. The text classification noise monitoring method according to claim 5, wherein the calculation formula for calculating the clustering density of the text classes is formula 2:
Figure FDA0001708613030000014
wherein ξH(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,a word vector representing the ith heading belonging to the text category z,
Figure FDA0001708613030000016
representing a title vector
Figure FDA0001708613030000017
And a title vector
Figure FDA0001708613030000018
I and j are positive integers.
7. A text classification noise monitoring apparatus, comprising:
the similarity distribution acquisition module is used for acquiring the similarity distribution of the title pairs belonging to the same text category, wherein the similarity of the title pairs is the similarity of any two titles belonging to the same text category;
the confidence interval acquisition module is used for acquiring the confidence interval of the similarity distribution according to a set confidence level;
and the noise header pair acquisition module is used for determining a noise header pair according to the confidence interval of the similarity distribution.
8. The text classification noise monitoring device according to claim 7, wherein the similarity distribution obtaining module comprises:
the extraction submodule is used for randomly extracting a set number of titles from the text categories;
the calculation submodule is used for pairwise matching the extracted titles and calculating the similarity of each title pair;
and the statistic submodule is used for carrying out distribution statistics according to the obtained similarity of the title pairs.
9. The text classification noise monitoring device according to claim 8, wherein in the calculation submodule, the calculation formula of the similarity of the header pair is formula 1:
Figure FDA0001708613030000021
wherein,and
Figure FDA0001708613030000023
the title vectors representing the ith and jth titles, respectively.
10. The text classification noise monitoring device of claim 7, further comprising:
the noise header judging module is used for respectively calculating the proportion of the number of each header appearing in the noise header pair to the number of all the noise header pairs; and when the proportion reaches a set threshold value, the header is a noise sample.
11. The text classification noise monitoring device of claim 7, further comprising:
and the clustering density calculating module is used for calculating the clustering density of the text category according to the title similarity of the same text category.
12. The text classification noise monitoring device according to claim 11, wherein in the clustering density calculating module, a calculation formula for calculating the clustering density of the text category is formula 2:
Figure FDA0001708613030000024
wherein ξH(z) represents the clustering density of the corpus H in the text category z, N represents the number of headings belonging to the text category z,a word vector representing the ith heading belonging to the text category z,
Figure FDA0001708613030000032
representing a title vector
Figure FDA0001708613030000033
And a title vector
Figure FDA0001708613030000034
I and j are positive integers.
13. A text classification noise monitoring apparatus, characterized in that the apparatus comprises:
one or more processors;
storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the text classification noise monitoring method of any of claims 1-6.
14. A computer-readable medium, in which a computer program is stored which, when being executed by a processor, carries out a text classification noise monitoring method as claimed in any one of claims 1 to 6.
CN201810668995.7A2018-06-262018-06-26Text classification noise monitoring method, device, equipment and computer readable mediumPendingCN110717033A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201810668995.7ACN110717033A (en)2018-06-262018-06-26Text classification noise monitoring method, device, equipment and computer readable medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201810668995.7ACN110717033A (en)2018-06-262018-06-26Text classification noise monitoring method, device, equipment and computer readable medium

Publications (1)

Publication NumberPublication Date
CN110717033Atrue CN110717033A (en)2020-01-21

Family

ID=69208827

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201810668995.7APendingCN110717033A (en)2018-06-262018-06-26Text classification noise monitoring method, device, equipment and computer readable medium

Country Status (1)

CountryLink
CN (1)CN110717033A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114764872A (en)*2020-12-312022-07-19富泰华工业(深圳)有限公司Data processing method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103294817A (en)*2013-06-132013-09-11华东师范大学Text feature extraction method based on categorical distribution probability
CN103729798A (en)*2014-01-292014-04-16河南理工大学Coal mine safety evaluation system based on improved k-means clustering
CN104142918A (en)*2014-07-312014-11-12天津大学 Short text clustering and hot topic extraction method based on TF-IDF feature

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103294817A (en)*2013-06-132013-09-11华东师范大学Text feature extraction method based on categorical distribution probability
CN103729798A (en)*2014-01-292014-04-16河南理工大学Coal mine safety evaluation system based on improved k-means clustering
CN104142918A (en)*2014-07-312014-11-12天津大学 Short text clustering and hot topic extraction method based on TF-IDF feature

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李湘东等: "文本分类中基于类别数据分布特性的噪声处理方法", 《现代图书情报技术》*

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114764872A (en)*2020-12-312022-07-19富泰华工业(深圳)有限公司Data processing method and device, electronic equipment and storage medium
CN114764872B (en)*2020-12-312025-07-15富泰华工业(深圳)有限公司 Data processing method, device, electronic device and storage medium

Similar Documents

PublicationPublication DateTitle
CN113177409B (en)Intelligent sensitive word recognition system
CN112580734B (en)Target detection model training method, system, terminal equipment and storage medium
CN118279304B (en)Abnormal recognition method, device and medium for special-shaped metal piece based on image processing
CN112561080A (en)Sample screening method, sample screening device and terminal equipment
CN114120127A (en)Target detection method, device and related equipment
CN107341143A (en)A kind of sentence continuity determination methods and device and electronic equipment
CN113177479A (en)Image classification method and device, electronic equipment and storage medium
CN112766387B (en)Training data error correction method, device, equipment and storage medium
CN115546811A (en)Method, device and equipment for identifying seal and storage medium
CN112508062A (en)Open set data classification method, device, equipment and storage medium
CN111738290A (en)Image detection method, model construction and training method, device, equipment and medium
CN110135428B (en)Image segmentation processing method and device
CN113688263B (en)Method, computing device, and storage medium for searching for image
CN110717033A (en)Text classification noise monitoring method, device, equipment and computer readable medium
CN111860299B (en)Method and device for determining grade of target object, electronic equipment and storage medium
CN113537253A (en)Infrared image target detection method and device, computing equipment and storage medium
CN113705625A (en)Method and device for identifying abnormal life guarantee application families and electronic equipment
CN111523322A (en)Requirement document quality evaluation model training method and requirement document quality evaluation method
CN115564776B (en)Abnormal cell sample detection method and device based on machine learning
CN108985350B (en)Method and device for recognizing blurred image based on gradient amplitude sparse characteristic information, computing equipment and storage medium
CN115620083B (en)Model training method, face image quality evaluation method, equipment and medium
CN111428767B (en)Data processing method and device, processor, electronic equipment and storage medium
CN113554474B (en)Model verification method and device, electronic equipment and computer readable storage medium
CN115272682A (en)Target object detection method, target detection model training method and electronic equipment
CN112015960B (en)Clustering method for vehicle-mounted radar measurement data, storage medium and electronic device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication

Application publication date:20200121

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp