Chinese and English thesis data classification and query methodTechnical Field
The invention belongs to the technical field of data classification, and particularly relates to a Chinese and English thesis data classification and query method.
Background
The knowledge base is a knowledge set for storing, organizing and processing knowledge and providing knowledge services, and the knowledge base can be used for better understanding and discovering the current research situation and development trend in a certain field, and meanwhile, the establishment of the knowledge base in various industries is gradually the basis for managing the knowledge services. Since English is an international universal language, the excellent English thesis is also a countless number, so that when a knowledge base is constructed, it is imperative to add Chinese and English thesis at the same time.
The construction of the knowledge base has two important steps: one is to classify the papers, i.e. to which domain the papers belong, and also to say label the papers. The other is the inquiry of the thesis, and the same is true of the Chinese-English fused knowledge base.
The paper is labeled, namely which field the paper belongs to, and the research trend of a certain field at present can be analyzed. In the process of labeling, word segmentation processing needs to be carried out on characters in a paper firstly, but when words are segmented in the paper, the traditional word segmentation device often cannot achieve the desired effect, for example, if word segmentation is carried out on the word segment that machine learning is a popular technology, the word segment is desired to be machine learning, but the traditional word segmentation device can divide the machine learning into the machine learning and the learning, the word segmentation is not the desired result, and the word segmentation in English is also the same. In addition, on the premise of Chinese-English fusion, how to accurately identify the same type of Chinese-English papers is also a problem of paper fusion.
Meanwhile, the chinese-english thesis also relates to a problem of cross-language query, namely, when "Machine Learning" is queried, articles containing "Machine Learning" need to be found out at the same time of querying documents containing "Machine Learning". But at the moment of rapid development of scientific technology, technical terms in different disciplines and different fields are continuously emerging, and the technical terms are often keywords of the disciplines and the fields. When the keywords are translated, since the keywords in the text are generally abstracts to a certain technical field or emerging words in a certain technical field, the results translated by the machine are often not satisfactory, and thus the retrieved results are often not expected. This severely affects the efficiency of the search.
Disclosure of Invention
The invention provides a method for classifying and querying Chinese and English thesis data, which aims at the defects and problems that the existing data classification cannot achieve the effect of word segmentation, Chinese and English fusion is difficult to accurately identify, and cross-language query cannot achieve the expected effect.
The technical scheme adopted by the invention for solving the technical problems is as follows: a Chinese and English thesis data classification and query method comprises the following steps:
traversing original data of a Chinese thesis according to Chinese and English keywords carried by the Chinese thesis during publication, extracting the Chinese and English keywords in all the Chinese thesis, cleaning and filtering the extracted data, aggregating Chinese translation results after abnormal data is eliminated, taking a translation list larger than a threshold agg as a Chinese and English comparison library corresponding to Chinese, and extracting the Chinese keywords according to the Chinese and English comparison library to generate a Chinese word library;
acquiring an English academic field label library through the existing model or the constructed LDA field model, wherein the English academic field label library is of a two-layer tree structure and comprises a large field label and a small field label, and the small field label belongs to the large field label;
step three, associating the label library in the English academic field with a Chinese and English comparison library, and associating the corresponding Chinese characters with the English labels in the English label library if the labels in the English label library can be found in the Chinese and English comparison library; if the tags in the English tag library cannot be found in the Chinese and English comparison library, converting through the existing machine translation, and performing machine translation on tag _1 through a machine translation model; and finally, generating a Chinese and English field tag library from the English tag library and corresponding the Chinese and English field tag library to the English tag library.
Combining the Chinese word stock generated in the step one with a Chinese word segmentation device to generate a word segmentation device with a user-defined word stock, and segmenting the keywords, the abstract and the title of the original data of the Chinese thesis and the original data of the English thesis respectively through the word segmentation device to generate a corresponding Chinese word segmentation list and an English word segmentation list;
step five, calculating the field of the thesis by using a KNN algorithm;
and step six, inquiring the information by combining the word bank.
In the method for classifying and querying Chinese and English thesis data, the first step of processing the data is as follows: traversing original data of a Chinese thesis, extracting Chinese and English keywords, then eliminating abnormal data, performing aggregation processing on the results of Chinese translation, and then taking a translation list larger than a certain threshold value as a translation result corresponding to Chinese.
In the method for classifying and querying Chinese and English thesis data, the threshold agg in the first step makes a policy as follows: recording the maximum translation number of the key words as max _ trans and the minimum translation number as min _ trans,
if max _ trans-min _ trans <3, the threshold value is agg-1; if max _ trans-min _ trans is more than or equal to 3, the threshold value is max _ trans-3.
The method for classifying and querying Chinese and English thesis data comprises the following fifth step:
(1) facilitating a list of participles generated by each paper;
(2) calculating the relevance of the word segmentation list by adopting a K nearest neighbor algorithm, wherein the formula is as follows:
in the formula: q. q.sxFor the label in Chinese and English label library, if xjEqual to a certain label q in the Chinese-English label librarykIf so, the numerical value is marked as 1, otherwise, the numerical value is 0; count indicates that all participles in the ith paper are related to the label q in the label librarykThe calculated value of (a);
if the Count is greater than the set threshold, the article is recorded to belong to the field specified by the label;
and if the count is not greater than the set threshold, comparing the matching number of the small-field labels under all the large-field labels in the English label library, and recording that the paper belongs to the field with large matching number of the small-field labels.
The invention has the beneficial effects that: the method comprises the steps of extracting Chinese and English keywords of a Chinese thesis, processing data to form a Chinese and English comparison library and a Chinese word library, acquiring an English tag library by using a model, and fusing the English tag library and the Chinese and English comparison library to form a Chinese and English tag library; meanwhile, a Chinese and English word segmentation list is obtained by performing word segmentation on original data of Chinese and English papers, the fields of the papers are divided by calculating the correlation, so that the research field labels of the Chinese and English papers can be effectively unified, the retrieval accuracy can be improved, the Chinese and English papers of the same type can be accurately identified, and the accuracy of cross-language query is improved.
Drawings
FIG. 1 is a flow chart of the acquisition of Chinese word library and Chinese-English reference library in the present invention.
FIG. 2 is a flow chart of the present invention for implementing unified tagging.
Fig. 3 is a flow chart of a query approach.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
Example 1: aiming at the defects and problems that the existing data classification cannot achieve the effect of word segmentation, Chinese and English fusion is difficult to accurately identify, and cross-language query cannot achieve the expected effect, the invention provides a Chinese and English thesis data classification and query method based on how to unify thesis labels in constructed Chinese and English texts and improve the cross-language query accuracy. The method includes the following.
Firstly, traversing original data of a Chinese paper according to Chinese and English keywords in publication of the Chinese paper, and extracting the Chinese and English keywords in all the Chinese paper.
Then, excluding abnormal data in the Chinese and English keywords, mainly excluding data lacking Chinese or English keywords, aggregating the results of Chinese translation after excluding the abnormal data, and using a translation list larger than a certain threshold agg as the translation result corresponding to Chinese, for example: aiming at the Clustering system, different authors have different translation modes, some authors translate into the Clustering coeffient, other authors translate into the Cluster coeffient, and the number of the Clustering coeffients and the number of the Cluster coeffients are larger than a specified threshold value through aggregation sequencing, and the number of the Clustering coeffients is larger than the specified threshold value, so that a Chinese-English comparison library such as the Clustering system [ Clustering coeffient, Cluster coeffient ] } is finally generated.
Noting that the maximum translation number of the keywords is max _ trans, and the minimum translation number is min _ trans
If max _ trans-min _ trans <3, the threshold value is agg-1;
if max _ trans-min _ trans is more than or equal to 3, the threshold value is max _ trans-3
And finally, extracting Chinese keywords according to the Chinese and English comparison library to generate a Chinese word library.
And step two, acquiring an English label library through the existing model or the constructed LDA field model, wherein the English label library is of a two-layer tree structure and is collectively called as a large field label and a small field label, and the small field label belongs to the large field label.
Step three, associating the English academic domain label library with a Chinese and English comparison library,
if the label in the English label library can be found in the Chinese and English comparison library, the corresponding Chinese is associated with the English label in the English label library;
if the tags in the English tag library cannot be found in the Chinese and English comparison library, converting through the existing machine translation, and performing machine translation on tag _1 through a machine translation model;
and finally, generating a Chinese and English field label library from the English label library.
And step four, combining the Chinese word stock generated in the step one with a Chinese word segmentation device to generate a word segmentation device with a user-defined word stock, and segmenting the keywords, the abstract and the title of the original data of the Chinese thesis and the original data of the English thesis respectively through the word segmentation device to generate a corresponding Chinese word segmentation list and an English word segmentation list.
Step five, calculating the relevancy of the word segmentation list by adopting a K nearest neighbor algorithm, wherein the method comprises the following steps:
(1) traversing a word segmentation list generated by each paper;
(2) calculating the relevancy of the word segmentation list, wherein the formula is as follows:
in the formula: q. q.sxFor the label in Chinese and English label library, if xjEqual to a certain label q in the Chinese-English label librarykIf so, the numerical value is marked as 1, otherwise, the numerical value is 0; count indicates that all participles in the ith paper are related to the label q in the label librarykThe calculated value of (a).
If the Count is greater than the set threshold, the thesis belongs to the field specified by the label, and the threshold is set to be 3 reasonably through verification;
and if the count is not greater than the set threshold, comparing the matching number of the small-field labels under all the large-field labels in the English label library. If Algorithm, Artificial Intelligence under Computer Science and the maximum number of all participles in the article are present, it is indicated that this article belongs to the field of Computer Science → [ Algorithm, Artificial Intelligence ].
Therefore, the fields of all the papers are calculated, and the field classification of Chinese and English papers is completed.
Step six, inquiring the information to be inquired by combining the English keyword lexicon, wherein the method comprises the following steps:
(1) inputting information to be inquired;
(2) performing word segmentation processing on information to be queried;
(3) using the Chinese and English comparison library generated in the step one as a term library, inquiring the word segmentation result in the term library, and if the word segmentation result exists, pulling the translated document from the term library for outputting; if not, the translation is performed using a conventional translation machine.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be covered thereby.