Summary of the invention
(1) technical problems to be solved
The technical problem to be solved by the present invention is how to solve at present since algorithm is single, can not be tied in conjunction with a variety of scanningsFruit carries out the problem of accurate comprehensive analysis.
(2) technical solution
In order to solve the above technical problems, the present invention provides a kind of knowledge for promoting the accuracy of file keyword based on many algorithmsOther method, the recognition methods are implemented based on identifying system, and the identifying system includes: that original text input module, text are pre-Processing module, the Chinese key extraction module based on disjunctive model, the Chinese key based on High Dimensional Clustering Analysis technology extract mouldBlock, semantic-based Chinese key extraction module, the Chinese key extraction module based on model-naive Bayesian, algorithm powerAgain than distribution module, keyword recognition result generation module;Specifically,
The recognition methods includes the following steps:
Step 1: the original text of pending keyword identification is inputted by the original text input module;
Step 2: text formatting being carried out to the original text that original text input module inputs by the Text Pretreatment module and is turnedPretreatment is changed, the candidate word handled for subsequent recognizer is formed;
Step 3: by the Chinese key extraction module based on disjunctive model, disjunctive model is based on, to from textThe candidate word of preprocessing module, carries out key words extraction and crucial word string is extracted, and generates the calculated result based on disjunctive model,Obtain keyword number of extracted information;
Step 4: by the Chinese key extraction module based on High Dimensional Clustering Analysis technology, it is based on High Dimensional Clustering Analysis technology, it is rightCandidate word from Text Pretreatment module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis skillThe calculated result of art obtains keyword number of extracted information;
Step 5: by being set forth in semantic Chinese key extraction module, semantic-based Chinese text keyword extraction is calculatedMethod carries out key words extraction and crucial word string is extracted, generate semantic-based to the candidate word from Text Pretreatment moduleCalculated result obtains keyword number of extracted information;
Step 6: by the Chinese key extraction module based on model-naive Bayesian, being based on naive Bayesian mouldType carries out key words extraction and crucial word string is extracted, generate and be based on simple shellfish to the candidate word from Text Pretreatment moduleThe calculated result of this model of leaf obtains keyword number of extracted information;
Step 7: by the algorithm weights than distribution module, configuring the above-mentioned calculated result based on disjunctive model, based on heightTie up each comfortable final pass of calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering techniqueWeight ratio in keyword result operation generating process;
Step 8: by the keyword recognition result generation module, comparing the calculated result based on disjunctive model, be based on heightIt ties up in the calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique respectively to keyThe hit-count of word, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.
Wherein, which is characterized in that the Chinese key extraction module based on disjunctive model, using based on disjunctive modelChinese key extraction algorithm, the identification of keyword is extracted as a classification, to candidate keywords each in text areaDivide keyword or non-key word.
Wherein, which is characterized in that the disjunctive model is respectively established to key words and crucial word string, in keyIn the selection of word feature, each model established respectively chooses different features.
Wherein, which is characterized in that the Chinese key extraction module of the High Dimensional Clustering Analysis technology, by according to small dictionaryFast word segmentation, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.
Wherein, which is characterized in that the semantic-based Chinese key extraction module incorporates phrase semantic featureDuring keyword extraction, constructs semantic similarity network and utilize degree Density Metric phrase semantic criticality between two parties.
Wherein, which is characterized in that the Chinese key extraction module based on model-naive Bayesian passes through firstTraining process obtains the parameters in model-naive Bayesian, then takes it as a basis, and completes keyword in test process and mentionsIt takes.
Wherein, which is characterized in that the algorithm weights than distribution module according to 2:3:4:3 ratio-dependent it is above-mentioned based on pointFrom the calculated result of model, the calculated result based on High Dimensional Clustering Analysis technology, semantic-based calculated result, model-naive BayesianEach comfortable final keyword results operation generating process of calculated result in weight ratio.
Wherein, which is characterized in that the weight ratio of the 2:3:4:3 is default configuration.
Wherein, which is characterized in that the weight ratio is voluntarily to configure according to concrete application scene.
Wherein, the format of the original text includes WORD format, PDF format.
(3) beneficial effect
Compared with prior art, the present invention uses the Chinese key extraction algorithm of disjunctive model, is based on High Dimensional Clustering AnalysisThe Chinese key extraction algorithm of technology, semantic-based Chinese text keyword extraction algorithm are based on model-naive BayesianChinese key extraction algorithm, comprehensive matching judgement, come promoted keyword extraction identification accuracy.
Each algorithm is compared to keyword hit-count, the weight ratio default of each algorithm configuration is calculated using 2:3:4:3Recognition result, weight can voluntarily be configured according to concrete application scene, be carried out according to the weight ratio of each algorithm to hit-countIt calculates, and as final result.
By this way, in keyword retrieval technical field, by promoting the accuracy of file keyword based on many algorithmsRecognition methods.
Specific embodiment
To keep the purpose of the present invention, content and advantage clearer, with reference to the accompanying drawings and examples, to of the inventionSpecific embodiment is described in further detail.
In order to solve the above technical problems, the present invention provides a kind of knowledge for promoting the accuracy of file keyword based on many algorithmsOther method, the recognition methods are implemented based on identifying system, and the identifying system includes: that original text input module, text are pre-Processing module, the Chinese key extraction module based on disjunctive model, the Chinese key based on High Dimensional Clustering Analysis technology extract mouldBlock, semantic-based Chinese key extraction module, the Chinese key extraction module based on model-naive Bayesian, algorithm powerAgain than distribution module, keyword recognition result generation module;Specifically,
The recognition methods includes the following steps:
Step 1: the original text of pending keyword identification is inputted by the original text input module;
Step 2: text formatting being carried out to the original text that original text input module inputs by the Text Pretreatment module and is turnedPretreatment is changed, the candidate word handled for subsequent recognizer is formed;
Step 3: by the Chinese key extraction module based on disjunctive model, disjunctive model is based on, to from textThe candidate word of preprocessing module, carries out key words extraction and crucial word string is extracted, and generates the calculated result based on disjunctive model,Obtain keyword number of extracted information;
Step 4: by the Chinese key extraction module based on High Dimensional Clustering Analysis technology, it is based on High Dimensional Clustering Analysis technology, it is rightCandidate word from Text Pretreatment module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis skillThe calculated result of art obtains keyword number of extracted information;
Step 5: by being set forth in semantic Chinese key extraction module, semantic-based Chinese text keyword extraction is calculatedMethod carries out key words extraction and crucial word string is extracted, generate semantic-based to the candidate word from Text Pretreatment moduleCalculated result obtains keyword number of extracted information;
Step 6: by the Chinese key extraction module based on model-naive Bayesian, being based on naive Bayesian mouldType carries out key words extraction and crucial word string is extracted, generate and be based on simple shellfish to the candidate word from Text Pretreatment moduleThe calculated result of this model of leaf obtains keyword number of extracted information;
Step 7: by the algorithm weights than distribution module, configuring the above-mentioned calculated result based on disjunctive model, based on heightTie up each comfortable final pass of calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering techniqueWeight ratio in keyword result operation generating process;
Step 8: by the keyword recognition result generation module, comparing the calculated result based on disjunctive model, be based on heightIt ties up in the calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique respectively to keyThe hit-count of word, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.
Wherein, which is characterized in that the Chinese key extraction module based on disjunctive model, using based on disjunctive modelChinese key extraction algorithm, the identification of keyword is extracted as a classification, to candidate keywords each in text areaDivide keyword or non-key word.
Wherein, which is characterized in that the disjunctive model is respectively established to key words and crucial word string, in keyIn the selection of word feature, each model established respectively chooses different features.
Wherein, which is characterized in that the Chinese key extraction module of the High Dimensional Clustering Analysis technology, by according to small dictionaryFast word segmentation, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.
Wherein, which is characterized in that the semantic-based Chinese key extraction module incorporates phrase semantic featureDuring keyword extraction, constructs semantic similarity network and utilize degree Density Metric phrase semantic criticality between two parties.
Wherein, which is characterized in that the Chinese key extraction module based on model-naive Bayesian passes through firstTraining process obtains the parameters in model-naive Bayesian, then takes it as a basis, and completes keyword in test process and mentionsIt takes.
Wherein, which is characterized in that the algorithm weights than distribution module according to 2:3:4:3 ratio-dependent it is above-mentioned based on pointFrom the calculated result of model, the calculated result based on High Dimensional Clustering Analysis technology, semantic-based calculated result, model-naive BayesianEach comfortable final keyword results operation generating process of calculated result in weight ratio.
Wherein, which is characterized in that the weight ratio of the 2:3:4:3 is default configuration.
Wherein, which is characterized in that the weight ratio is voluntarily to configure according to concrete application scene.
Wherein, the format of the original text includes WORD format, PDF format.
In addition, the present invention also provides a kind of identifying system for promoting the accuracy of file keyword based on many algorithms, such as Fig. 1Shown, the identifying system includes:
Original text input module is used to input the original text of pending keyword identification;
Text Pretreatment module is used to carry out the original text that original text input module inputs at the pre- place of text formatting conversionReason forms the candidate word handled for subsequent recognizer;
Chinese key extraction module based on disjunctive model is used for based on disjunctive model, to from Text PretreatmentThe candidate word of module, carries out key words extraction and crucial word string is extracted, and generates the calculated result based on disjunctive model, acquisition is closedKeyword number of extracted information;
Chinese key extraction module based on High Dimensional Clustering Analysis technology is used for based on High Dimensional Clustering Analysis technology, to from textThe candidate word of this preprocessing module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis technologyIt calculates as a result, obtaining keyword number of extracted information;
Semantic-based Chinese key extraction module is used for semantic-based Chinese text keyword extraction (SKE)Algorithm carries out key words extraction and crucial word string is extracted, generate and be based on semanteme to the candidate word from Text Pretreatment moduleCalculated result, obtain keyword number of extracted information;
Chinese key extraction module based on model-naive Bayesian is used for based on model-naive Bayesian, to nextFrom the candidate word of Text Pretreatment module, carries out key words extraction and crucial word string is extracted, generate based on naive Bayesian mouldThe calculated result of type obtains keyword number of extracted information;
Algorithm weights than distribution module, be used for concrete application scene configure the above-mentioned calculated result based on disjunctive model,Calculated result, semantic-based calculated result, each leisure of the calculated result of model-naive Bayesian based on High Dimensional Clustering Analysis technologyWeight ratio in final keyword results operation generating process;
Keyword recognition result generation module is used to compare the calculated result based on disjunctive model, is based on High Dimensional Clustering AnalysisThe calculated result of technology, semantic-based calculated result, in the calculated result of model-naive Bayesian respectively to the life of keywordMiddle number, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.
Wherein, the Chinese key extraction module based on disjunctive model, it is crucial using the Chinese based on disjunctive modelWord extraction algorithm extracts the identification of keyword as a classification, distinguishes keyword also to candidate keywords each in textIt is non-keyword;
Wherein, disjunctive model is respectively established to key words and crucial word string, in the selection of keyword feature,The each model established respectively chooses different features.
Key words are extracted and crucial word string extracts the accuracy for improving extraction according to different features.The algorithm isKeyword identifies most common algorithm, and calculated result accounts for the 2/10 of result operation specific gravity.
Wherein, the Chinese key extraction module of the High Dimensional Clustering Analysis technology, to based on statistical information keyword extraction sideThe low problem of method accuracy rate proposes the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology;By according to the fast of small dictionarySpeed participle, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.
Theory analysis and experiment display, the Chinese key extracting method based on High Dimensional Clustering Analysis technology have better stabilizationProperty, higher efficiency and more accurate result.The algorithm speed is very fast and recognition accuracy is very high, and calculated result accounts for result operationThe 3/10 of specific gravity.
Wherein, the semantic-based Chinese key extraction module, is mentioned using semantic-based Chinese text keywordTake (SKE) algorithm;During phrase semantic feature is incorporated keyword extraction by it, constructs semantic similarity network and utilizeDensity Metric phrase semantic criticality is spent between two parties.
Compared with the keyword extraction algorithm based on statistical nature, it is more excellent that SKE algorithm extracts key word algorithm performance.The calculationThe keyword discrimination accuracy of method is high, and calculated result accounts for the 4/10 of result operation specific gravity.
Wherein, the Chinese key extraction module based on model-naive Bayesian, using based on naive Bayesian mouldThe Chinese key extraction algorithm of type;It obtains the parameters in model-naive Bayesian by training process first, thenIt takes it as a basis, completes keyword extraction in test process.Experiment shows that relative to traditional method, the algorithm can be from small ruleMore accurate keyword is extracted in the document sets of mould, and can neatly increase the characteristic item of characterization word importance, toolThere is better scalability.The keyword of the algorithm identifies that accuracy is very high in small document, and calculated result accounts for result operation ratioThe 3/10 of weight.
Wherein, the algorithm weights are more above-mentioned based on disjunctive model according to the ratio-dependent of 2:3:4:3 than distribution moduleCalculate result, the calculated result based on High Dimensional Clustering Analysis technology, semantic-based calculated result, the calculated result of model-naive BayesianWeight ratio in each comfortable final keyword results operation generating process.
Wherein, the weight ratio of the 2:3:4:3 is default configuration.
Wherein, the weight ratio is voluntarily to configure according to concrete application scene.
Wherein, the format of the original text includes WORD format, PDF format.
Embodiment 1
The present embodiment provides a kind of methods for promoting the recognition accuracy of file keyword based on many algorithms, adopt to fileWith the Chinese key extraction algorithm of use disjunctive model, the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology, it is based onSemantic Chinese text keyword extraction (SKE) algorithm, the Chinese key extraction algorithm based on model-naive Bayesian carry outKeyword processing parsing simultaneously judges to promote accuracy by weight.
Wherein, the Chinese key extraction algorithm based on disjunctive model extracts and crucial word string key wordsIt extracts, according to the Chinese key extraction algorithm based on disjunctive model, key words is extracted and crucial word string extracts the twoProblem devises different features to improve the accuracy of extraction.
Wherein, the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology, to based on statistical information keywordThe low problem of extracting method accuracy rate proposes the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology.Algorithm passes through foundationThe fast word segmentation of small dictionary, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.Theory pointAnalysis and experiment display, the Chinese key extracting method based on High Dimensional Clustering Analysis technology have better stability, higher efficiencyAnd more accurate result.
Wherein, phrase semantic feature is incorporated and is closed by semantic-based Chinese text keyword extraction (SKE) algorithmIn keyword extraction process, constructs semantic similarity network and utilize degree Density Metric phrase semantic criticality between two parties.With baseIt is compared in the keyword extraction algorithm of statistical nature, it is more excellent that SKE algorithm extracts key word algorithm performance.
Wherein, the Chinese key extraction algorithm based on model-naive Bayesian, the algorithm pass through training firstProcess obtains the parameters in model-naive Bayesian, then takes it as a basis, and completes keyword extraction in test process.It is realIt tests and shows that, relative to traditional if*idf method, which can extract more accurate key from small-scale document setsWord, and can neatly increase the characteristic item of characterization word importance, there is better scalability.
Keyword is extracted by each algorithm, the keyword quantity to be accurately obtained in file/folder mentionsIt wins the confidence breath.Each algorithm is compared to keyword hit-count, the weight ratio default of each algorithm configuration is calculated using 2:3:4:3 to be knownNot as a result, weight can voluntarily be configured according to concrete application scene, hit-count is counted according to the weight ratio of each algorithmIt calculates, and as final result.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the artFor member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformationsAlso it should be regarded as protection scope of the present invention.