Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of classification and search method of file, to solve the above-mentioned problems in the prior art.
The sorting technique that the invention provides a kind of file, comprises the following steps:
A, add index for obtaining file, and the content of described file is carried out to participle;
B, will in participle, reach set the word of liveness as the label of described file;
C, the establishment quantity file identical with the number of labels of described file, and described label is distributed to respectively to described file, use as the attribute of described file;
D, create the shortcut of quantity and the label described file identical and that be associated with the physical storage address of described file by described file index of described file, and identify respectively the shortcut of described file with each described label;
E, the shortcut of described file is put into respectively to the described file corresponding with its label.
As seen from the above, the participle that meets certain standard according to " liveness " in file content, as folder tabs, can carry out document classification according to different application requirements.And be only stored in a situation under physical address with a file, can not cause the redundancy of file data.In addition, under identical file folder, only deposit the file shortcut with a kind of label, therefore can greatly improve ff speed.
In above-mentioned method, also comprise:
F, in the time that the content of described file changes, the content of described file is re-started to participle, then return to above-mentioned steps B, to redefine the label of described file and to upgrade described file corresponding to described file and the shortcut of described file.
As seen from the above, the label of file can upgrade along with the variation of file content, can increase and decrease in real time thus the quantity of corresponding folder and file shortcut.
In above-mentioned method, the method for described participle is mechanical Chinese word segmentation method.
In above-mentioned method, described liveness is the frequency of utilization of word.
In above-mentioned method, described liveness is the tolerance that the word frequency of occurrences is greater than certain value.
The present invention also provides a kind of document retrieval method based on file classifying method described in above-mentioned any one, it is characterized in that, comprises the following steps:
Attribute according to file described in the keyword retrieval of input: if hit, return to the file shortcut in this file;
Otherwise, the shortcut according to file described in the keyword retrieval of input: if hit, return to corresponding file shortcut, otherwise, exit retrieval.
Embodiment
Below in conjunction with accompanying drawing, introduce in detail classification and the search method of following a kind of file provided by the invention.
As shown in Figure 1, the sorting technique of above-mentioned file comprises the following steps:
Step 100: add index for obtaining file, and the content of this file is carried out to participle.
In the present embodiment, can adopt some conventional segmenting methods to carry out participle to the content of file, for example, based on the segmenting method (mechanical Chinese word segmentation method) of dictionary, dictionary coupling.
This method is mated Chinese character string to be analyzed according to certain strategy with the entry in " fully large " machine dictionary, if find certain character string in dictionary, the match is successful.Identify a word, be divided into forward coupling and reverse coupling according to the difference of direction of scanning.According to the situation of the preferential coupling of different length, be divided into maximum (the longest) coupling and minimum (the shortest) coupling.According to whether combining with part-of-speech tagging process, can be divided into again the integral method that simple segmenting method and participle combine with mark.Conventional method is as maximum forward matching method, reverse maximum matching method, minimum syncopation: make the word cutting out in each count minimum and bi-directional matching method.Because above-mentioned segmenting method is comparatively conventional, therefore do not repeat them here.
Step 200: the label using the word that in participle, " liveness " is higher as file.
In the present embodiment, " liveness " of so-called word can mean the frequency of utilization (number of times occurring hereof) of word.Can, according to " liveness " order from high to low, select successively multiple words, respectively as multiple labels of file.For example, in file content, the frequency that three words of " world cup ", " Brazil ", " winning the championship " occur comes the front three of all words in this content, with the label using these three words as file respectively.Or the word that the frequency of occurrences is greater than to certain value uses as having height " liveness " label.
Step 300: create the file corresponding with file label quantity, and using each label as folder attribute, distribute to respectively newly-built file.
Taking above-mentioned label as example, create three attributes and be respectively the file of " world cup ", " Brazil " and " winning the championship ".
Step 400: according to the quantity of label, create the shortcut of the above-mentioned file being associated with above-mentioned file physical storage address by file index, and with each label shortcut of identification document respectively.
For example, described file has three file labels, creates three shortcuts.By add logic association mark in file shortcut, file shortcut and file index are set up to logic association relation, in this example, identify respectively three shortcuts of described file with label " world cup ", " Brazil " and " winning the championship ".
Step 500: the shortcut of described file is put into respectively to the file corresponding with its label.
For example, in the file that attribute is " world cup ", deposit the file shortcut of label for " world cup ", in the file of attribute for " Brazil ", deposit the file shortcut of label for " Brazil ", by that analogy.
Step 600: in the time that file content changes, the content of this file is re-started to participle, then return to above-mentioned steps 200, redefine file label, with file corresponding to this file of real-time update and file shortcut.
In the present embodiment, if after file content changes, " Italy " word also meets the condition (having height " liveness ") as label, the new label as this file by " Italy " word, create file and a file shortcut with label " Italy " mark that a new file attribute is " Italy ", and this shortcut is put into this new file.The label not changing for other is not done any variation.Concrete steps are the same, repeat no more.So just, realized the object that changes real-time update document classification with file content.
Based on above-mentioned file classifying method, the present invention also provides a kind of document retrieval method, and the method comprises the following steps:
First, according to the attribute of the above-mentioned file of keyword retrieval of input.If hit (thering is the folder attribute corresponding with above-mentioned keyword), return to file shortcut wherein; Otherwise, the above-mentioned file shortcut of keyword retrieval according to input: if hit, return to corresponding file shortcut, otherwise, exit retrieval.
In above-mentioned retrieving, different search conditions may be corresponding to different files (having different folder attribute), although and different file is deposited different file shortcuts, these shortcuts may be pointed to a file (concrete reason is with reference to above) simultaneously.Like this, in retrieving files process, only need retrieving files folder or file shortcut just can find the physical address of respective file, and without the direct physical address of retrieving files.So greatly save retrieval time, improved recall precision.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.