Summary of the invention
The purpose of the present invention is to provide a kind of for the short text classification method of shopping webpage, device, equipment and its JieMatter can be improved by the use of denoising and field word set to short essay for the type characteristic of short text data in shopping webpageThe accuracy of this classification.
In order to solve the above technical problems, embodiments of the present invention disclose a kind of short text classification for shopping webpageMethod, this method comprises:
Short text to be sorted is obtained from shopping webpage;
Word segmentation processing is carried out to short text to be sorted and obtains the first participle collection of short text to be sorted;
Denoising disposal is carried out to first participle collection and obtains the second participle collection of short text to be sorted;
The keyword of corresponding short text to be sorted is extracted based on the second participle collection;
According to the keyword of extraction and commodity field word set, short text to be sorted is classified.
In a demonstration example, Denoising disposal is carried out to first participle collection and obtains the second participle Ji Bao of short text to be sortedIt includes:
The different degree for the participle that the first participle is concentrated is calculated using the anti-document frequency algorithm of word frequency-;
Noise word is selected in participle using document frequency algorithm from different degree lower than predetermined different degree;
Erased noise word is concentrated to obtain the second participle collection from the first participle.
In another demonstration example, according to the keyword of extraction and commodity field word set, short text to be sorted is classifiedInclude:
Keyword based on extraction carries out vectorization processing to short text to be sorted, obtains vectorization short text;
By matching the keyword of extraction in commodity domain term concentration, commodity belonging to short text to be sorted are determinedField;
Based on commodity field belonging to short text to be sorted, clustering processing is carried out to vectorization short text;
Classified according to the result of clustering processing to short text to be sorted.
In another demonstration example, based on commodity field belonging to short text to be sorted, vectorization short text is clusteredProcessing includes:
The soft cluster calculation of mahalanobis distance is executed to vectorization short text, obtains the similarity of bibliography system,
Carrying out classification to short text to be sorted according to cluster result includes:
Classified according to the similarity of bibliography system to short text to be sorted.
Embodiments of the present invention also disclose a kind of short text sorter for shopping webpage, which includes:
Acquiring unit, for obtaining short text to be sorted from shopping webpage;
Participle unit obtains the first participle collection of short text to be sorted for carrying out word segmentation processing to short text to be sorted;
Unit is denoised, obtains the second participle collection of short text to be sorted for carrying out Denoising disposal to first participle collection;
Extraction unit, for extracting the keyword of corresponding short text to be sorted based on the second participle collection;
Taxon classifies short text to be sorted for the keyword and commodity field word set according to extraction.
In a demonstration example, denoising unit includes:
Computation subunit, for calculating the different degree for the participle that the first participle is concentrated using the anti-document frequency algorithm of word frequency-;
Subelement is selected, for selecting noise in the participle using document frequency algorithm from different degree lower than predetermined different degreeWord;
Subelement is deleted, for concentrating erased noise word to obtain the second participle collection from the first participle.
In another demonstration example, taxon includes:
Vectorization subelement, for determining wait divide by matching the keyword of extraction in commodity domain term concentrationCommodity field belonging to class short text;
Subelement is clustered, for being clustered to vectorization short text based on commodity field belonging to short text to be sortedProcessing;
Classification subelement, for being classified according to the result of clustering processing to short text to be sorted.
In another demonstration example, clustering processing is accomplished by the following way in cluster subelement:
The soft cluster calculation of mahalanobis distance is executed to vectorization short text, obtains the similarity of bibliography system,
Carrying out classification to short text to be sorted according to cluster result includes:
Classified according to the similarity of bibliography system to short text to be sorted.
Embodiments of the present invention also disclose a kind of non-volatile computer storage Jie using computer program codeMatter, computer program include instruction, when instruction is executed by more than one computer, are instructed so that more than one computerOperation is executed, operation includes:
Short text to be sorted is obtained from shopping webpage;
Word segmentation processing is carried out to short text to be sorted and obtains the first participle collection of short text to be sorted;
Denoising disposal is carried out to first participle collection and obtains the second participle collection of short text to be sorted;
The keyword of corresponding short text to be sorted is extracted based on the second participle collection;
According to the keyword of extraction and commodity field word set, short text to be sorted is classified.
Embodiments of the present invention also disclose a kind of equipment, which includes being stored with depositing for computer executable instructionsReservoir and processor, processor is configured as executing the short text classification method for being used for shopping webpage, for the short of shopping webpageFile classification method includes:
Short text to be sorted is obtained from shopping webpage;
Word segmentation processing is carried out to short text to be sorted and obtains the first participle collection of short text to be sorted;
Denoising disposal is carried out to first participle collection and obtains the second participle collection of short text to be sorted;
The keyword of corresponding short text to be sorted is extracted based on the second participle collection;
According to the keyword of extraction and commodity field word set, short text to be sorted is classified.
Compared with prior art, the main distinction and its effect are embodiment of the present invention:
It can be mentioned for the type characteristic of short text data in shopping webpage by the use of denoising and field word setThe accuracy that height classifies to short text.
It is inevitable from certain when further, due to generally web crawlers being used to automatically grab the merchandise news of shopping webpageNoise can be brought in degree, therefore, using the method for TF-IDF and DF Double Selection, can effectively filter out noise word.
Further, using predetermined commodity field word set, the neck of commodity described in short text can be substantially determinedDomain, to effectively improve the efficiency clustered to short text.
Specific embodiment
In the following description, in order to make the reader understand this application better, many technical details are proposed.But thisThe those of ordinary skill in field is appreciated that even if without these technical details and many variations based on the following respective embodimentsAnd modification, each claim of the application technical solution claimed can also be realized.
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to implementation of the inventionMode is described in further detail.
First embodiment of the invention is related to a kind of short text classification method for shopping webpage.Fig. 1 is this methodFlow diagram.Specifically, as shown in Figure 1, this be used for the short text classification method of shopping webpage the following steps are included:
Step 101, short text to be sorted is obtained from shopping webpage.For example, being obtained by web crawler from shopping webpageTake short text to be sorted.
Hereafter, 102 are entered step.
In a step 102, word segmentation processing is carried out to short text to be sorted and obtains the first participle collection of short text to be sorted.
In this application, short text refers to the information in shopping webpage about descriptive labelling.Word segmentation processing can be will be wait divideContent of text in class short text is divided, and multiple words are divided into, will be to for example, read the text in short text to be sortedClassification short text is divided into multiple words, and when short text to be sorted is divided into multiple words, semanteme based on context carries out wordDivision so that more accurate to the participle of short text to be sorted.For example, to a commodity " storm wind AI on certain shopping webpageThe full HD ultra-thin liquid crystal TV set of TV (black) " carry out word segmentation processing, obtain " storm wind ", " AI ", " TV ", " full HD "," ultra-thin ", " liquid crystal ", " television set " and " (black) ".After can be by division when executing word segmentation processing to short text to be sortedWord carry out part-of-speech tagging, for example, by short text to be sorted divide after word mark " noun ", " adjective " and " adverbial word " etc.,In order to improve the efficiency and accuracy of short text classification method.
Hereafter, 103 are entered step.
In step 103, Denoising disposal is carried out to first participle collection and obtains the second participle collection of short text to be sorted.
In a preferred example, step 103 includes:
Using the anti-document frequency of word frequency-(term frequency-inverse document frequency, TF-IDF)Algorithm calculates the different degree for the participle that the first participle is concentrated;Using document frequency (Document Frequency, DF) algorithm fromDifferent degree is lower than selection noise word in the participle of predetermined different degree;Erased noise word is concentrated to obtain the second participle from the first participleCollection.
It, necessarily to a certain extent can band when due to generally web crawlers being used to automatically grab the merchandise news of shopping webpageCarry out noise, therefore, using the method for TF-IDF and DF Double Selection, can effectively filter out noise word.
It can be after the type for determining short text to be sorted, if the noise word occurred in short text to be sorted is not gone outNow in the noise word set of corresponding the type, then it is added into the noise word set of the type.
Hereafter, 104 are entered step.
At step 104, the keyword of corresponding short text to be sorted is extracted based on the second participle collection.
For example, keyword is maximally related with commodity described in short text to be sorted in one section of short text to be sortedWord can represent the attribute of the commodity of the short text to be sorted, and therefore, the keyword for extracting short text to be sorted helps to treatClassification short text is accurately classified.It, can be according in short text to be sorted when extracting the keyword of short text to be sortedThe frequency of middle appearance extracts, can also be according to the part of speech of the participle of short text to be sorted to the keyword of short text to be sortedIt extracts, for example, in one section of short text, "Yes", " ", the frequency that the words such as " " occur is very high, but short text to be sortedExpressed meaning can not be indicated with above-mentioned word, therefore the noun in the short text that can be classified with preferential treatment carries out keywordIt extracts, so that the extraction of the keyword of short text to be sorted is more accurate.
Hereafter, 105 are entered step.
In step 105, according to the keyword of extraction and commodity field word set, short text to be sorted is classified.
It is appreciated that commodity field word set has in advance, it includes being capable of effectively distinguishing each commodity fieldCore word.For example, short text quickly can be categorized into it by some brands, device parameter etc. is for field of electronic deviceBelong to this field, for example, based on the famous brand of field of electronic device (such as Huawei, millet) or device parameter is (such asDouble-card dual-standby, 5,000,000 pixels etc.), it can determine that the short text comprising these domain terms belongs to field of electronic device.
Hereafter, terminate this process.
In a demonstration example, above-mentioned steps 105 include:
1) based on the keyword of extraction, vectorization processing is carried out to short text to be sorted, obtains vectorization short text.
Vectorization processing can be carried out to short text to be sorted using the various algorithms in text classification.For example, using wordBag model (Continuous Bag of Word Model, referred to as CBOW), Skip-Gram training algorithm, word are embedded in vectorModel executes vectorization processing to target short text, obtains vectorization short text.
2) by matching the keyword of extraction in commodity domain term concentration, quotient belonging to short text to be sorted is determinedProduct field.For example, in the above example, short text " the full HD ultra-thin liquid crystal TV set of storm wind AI TV (black) " to be sortedKeyword is " AI ", " TV ", " full HD ", " ultra-thin ", " liquid crystal ", " television set " and " black ", if household electrical appliance are ledThere is word " TV ", " television set " in the word set of domain, can determine that the short text to be sorted is field of television.
3) based on commodity field belonging to short text to be sorted, clustering processing is carried out to vectorization short text.
Cluster calculation can be used to analyze short text, to calculate the similarity between short text to be sorted.It canSelection of land, clustering algorithm can be partition clustering method, hierarchical clustering method, density clustering method, the cluster side based on gridMethod, the clustering method etc. based on model.
Using predetermined commodity field word set, the field of commodity described in short text can be substantially determined, to haveEffect improves the efficiency clustered to short text.
In a demonstration example, based on commodity field belonging to short text to be sorted, vectorization short text is carried out at clusterReason includes:
The soft cluster calculation of mahalanobis distance is executed to vectorization short text, obtains the similarity of bibliography system,
4) classified according to the result of clustering processing to short text to be sorted.I.e. according to cluster result, by the same categoryTarget short text as a class text.
In a demonstration example, carrying out classification to short text to be sorted according to cluster result includes:
Classified according to the similarity of bibliography system to short text to be sorted.
The present invention can pass through making for denoising and field word set for the type characteristic of short text data in shopping webpageWith the accuracy that raising classifies to short text.
Second embodiment of the present invention discloses a kind of short text sorter for shopping webpage.Fig. 2 is the dressThe structural schematic diagram set.
Specifically, as shown in Fig. 2, the short text sorter of the shopping webpage includes:
Acquiring unit, for obtaining short text to be sorted from shopping webpage.
Participle unit obtains the first participle collection of short text to be sorted for carrying out word segmentation processing to short text to be sorted.
Unit is denoised, obtains the second participle collection of short text to be sorted for carrying out Denoising disposal to first participle collection.
Extraction unit, for extracting the keyword of corresponding short text to be sorted based on the second participle collection.
Taxon classifies short text to be sorted for the keyword and commodity field word set according to extraction.
In a demonstration example, denoising unit includes:
Computation subunit, for calculating the different degree for the participle that the first participle is concentrated using the anti-document frequency algorithm of word frequency-;
Subelement is selected, for selecting noise in the participle using document frequency algorithm from different degree lower than predetermined different degreeWord;
Subelement is deleted, for concentrating erased noise word to obtain the second participle collection from the first participle.
In another demonstration example, taxon includes:
Vectorization subelement, for determining wait divide by matching the keyword of extraction in commodity domain term concentrationCommodity field belonging to class short text;
Subelement is clustered, for being clustered to vectorization short text based on commodity field belonging to short text to be sortedProcessing;
Classification subelement, for being classified according to the result of clustering processing to short text to be sorted.
In another demonstration example, clustering processing is accomplished by the following way in cluster subelement:
The soft cluster calculation of mahalanobis distance is executed to vectorization short text, obtains the similarity of bibliography system,
Carrying out classification to short text to be sorted according to cluster result includes:
Classified according to the similarity of bibliography system to short text to be sorted.
First embodiment is method implementation corresponding with present embodiment, and present embodiment can be implemented with firstMode is worked in coordination implementation.The relevant technical details mentioned in first embodiment are still effective in the present embodiment, in order toIt reduces and repeats, which is not described herein again.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in the first implementationIn mode.
It can be mentioned for the type characteristic of short text data in shopping webpage by the use of denoising and field word setThe accuracy that height classifies to short text.
Third embodiment of the present invention discloses a kind of non-volatile computer storage using computer program codeMedium, wherein computer program include instruction, when instruction by more than one computer execute when, instruction so that more than oneComputer execute operation, operation includes:
Short text to be sorted is obtained from shopping webpage;
Word segmentation processing is carried out to short text to be sorted and obtains the first participle collection of short text to be sorted;
Denoising disposal is carried out to first participle collection and obtains the second participle collection of short text to be sorted;
The keyword of corresponding short text to be sorted is extracted based on the second participle collection;
According to the keyword of extraction and commodity field word set, short text to be sorted is classified.
It can be mentioned for the type characteristic of short text data in shopping webpage by the use of denoising and field word setThe accuracy that height classifies to short text.
First embodiment is method implementation corresponding with present embodiment, and present embodiment can be implemented with firstMode is worked in coordination implementation.The relevant technical details mentioned in first embodiment are still effective in the present embodiment, in order toIt reduces and repeats, which is not described herein again.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in the first implementationIn mode.
Third embodiment of the present invention discloses a kind of equipment, which includes being stored with computer executable instructionsMemory and processor, processor are configured as executing the short text classification method for being used for shopping webpage, wherein are used for shopping networkPage short text classification method include:
Short text to be sorted is obtained from shopping webpage;
Word segmentation processing is carried out to short text to be sorted and obtains the first participle collection of short text to be sorted;
Denoising disposal is carried out to first participle collection and obtains the second participle collection of classification short text;
The keyword of corresponding short text to be sorted is extracted based on the second participle collection;
According to the keyword of extraction and commodity field word set, short text to be sorted is classified.Specifically, such as Fig. 1 instituteShow, it should
It can be mentioned for the type characteristic of short text data in shopping webpage by the use of denoising and field word setThe accuracy that height classifies to short text.
First embodiment is method implementation corresponding with present embodiment, and present embodiment can be implemented with firstMode is worked in coordination implementation.The relevant technical details mentioned in first embodiment are still effective in the present embodiment, in order toIt reduces and repeats, which is not described herein again.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in the first implementationIn mode.
Each method embodiment of the invention can be realized in a manner of software, hardware, firmware etc..Regardless of the present invention be withSoftware, hardware or firmware mode realize that instruction code may be stored in any kind of computer-accessible memoryIn (such as permanent perhaps revisable volatibility is perhaps non-volatile solid or non-solid, it is fixed orThe replaceable medium etc. of person).Equally, memory may, for example, be programmable logic array (Programmable ArrayLogic, referred to as " PAL "), random access memory (Random Access Memory, referred to as " RAM "), it may be programmed read-only depositReservoir (Programmable Read Only Memory, referred to as " PROM "), read-only memory (Read-Only Memory, letterClaim " ROM "), electrically erasable programmable read-only memory (Electrically Erasable Programmable ROM, referred to as" EEPROM "), disk, CD, digital versatile disc (Digital Versatile Disc, referred to as " DVD ") etc..
It should be noted that each unit mentioned in each equipment embodiment of the present invention is all logic unit, physically,One logic unit can be a physical unit, be also possible to a part of a physical unit, can also be with multiple physicsThe combination of unit realizes that the Physical realization of these logic units itself is not most important, these logic units institute realityThe combination of existing function is only the key for solving technical problem proposed by the invention.In addition, in order to protrude innovation of the inventionPart, there is no the technical problem relationship proposed by the invention with solution is less close for the above-mentioned each equipment embodiment of the present inventionUnit introduce, this does not indicate above equipment embodiment and there is no other units.
It should be noted that in the claim and specification of this patent, such as first and second or the like relationshipTerm is only used to distinguish one entity or operation from another entity or operation, without necessarily requiring or implyingThere are any actual relationship or orders between these entities or operation.Moreover, the terms "include", "comprise" or itsAny other variant is intended to non-exclusive inclusion so that include the process, methods of a series of elements, article orEquipment not only includes those elements, but also including other elements that are not explicitly listed, or further include for this process,Method, article or the intrinsic element of equipment.In the absence of more restrictions, being wanted by what sentence " including one " limitedElement, it is not excluded that there is also other identical elements in the process, method, article or apparatus that includes the element.
Although being shown and described to the present invention by referring to some of the preferred embodiment of the invention,It will be understood by those skilled in the art that can to it, various changes can be made in the form and details, without departing from this hairBright spirit and scope.