Movatterモバイル変換


[0]ホーム

URL:


CN112632282A - Chinese and English thesis data classification and query method - Google Patents

Chinese and English thesis data classification and query method
Download PDF

Info

Publication number
CN112632282A
CN112632282ACN202011613854.9ACN202011613854ACN112632282ACN 112632282 ACN112632282 ACN 112632282ACN 202011613854 ACN202011613854 ACN 202011613854ACN 112632282 ACN112632282 ACN 112632282A
Authority
CN
China
Prior art keywords
chinese
english
library
label
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011613854.9A
Other languages
Chinese (zh)
Other versions
CN112632282B (en
Inventor
康锐文
冯凯
王元卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Original Assignee
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Research Institute Institute Of Computing Technology Chinese Academy Of SciencesfiledCriticalBig Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority to CN202011613854.9ApriorityCriticalpatent/CN112632282B/en
Publication of CN112632282ApublicationCriticalpatent/CN112632282A/en
Application grantedgrantedCritical
Publication of CN112632282BpublicationCriticalpatent/CN112632282B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The invention belongs to the technical field of data classification, and particularly relates to a Chinese and English thesis data classification and query method. The method comprises the steps of extracting Chinese and English keywords of a Chinese thesis, processing data to form a Chinese and English comparison library and a Chinese word library, acquiring an English tag library by using a model, and fusing the English tag library and the Chinese and English comparison library to form a Chinese and English tag library; meanwhile, a Chinese and English word segmentation list is obtained by performing word segmentation on original data of Chinese and English papers, the fields of the papers are divided by calculating the correlation, so that the research field labels of the Chinese and English papers can be effectively unified, the retrieval accuracy can be improved, the Chinese and English papers of the same type can be accurately identified, and the accuracy of cross-language query is improved.

Description

Chinese and English thesis data classification and query method
Technical Field
The invention belongs to the technical field of data classification, and particularly relates to a Chinese and English thesis data classification and query method.
Background
The knowledge base is a knowledge set for storing, organizing and processing knowledge and providing knowledge services, and the knowledge base can be used for better understanding and discovering the current research situation and development trend in a certain field, and meanwhile, the establishment of the knowledge base in various industries is gradually the basis for managing the knowledge services. Since English is an international universal language, the excellent English thesis is also a countless number, so that when a knowledge base is constructed, it is imperative to add Chinese and English thesis at the same time.
The construction of the knowledge base has two important steps: one is to classify the papers, i.e. to which domain the papers belong, and also to say label the papers. The other is the inquiry of the thesis, and the same is true of the Chinese-English fused knowledge base.
The paper is labeled, namely which field the paper belongs to, and the research trend of a certain field at present can be analyzed. In the process of labeling, word segmentation processing needs to be carried out on characters in a paper firstly, but when words are segmented in the paper, the traditional word segmentation device often cannot achieve the desired effect, for example, if word segmentation is carried out on the word segment that machine learning is a popular technology, the word segment is desired to be machine learning, but the traditional word segmentation device can divide the machine learning into the machine learning and the learning, the word segmentation is not the desired result, and the word segmentation in English is also the same. In addition, on the premise of Chinese-English fusion, how to accurately identify the same type of Chinese-English papers is also a problem of paper fusion.
Meanwhile, the chinese-english thesis also relates to a problem of cross-language query, namely, when "Machine Learning" is queried, articles containing "Machine Learning" need to be found out at the same time of querying documents containing "Machine Learning". But at the moment of rapid development of scientific technology, technical terms in different disciplines and different fields are continuously emerging, and the technical terms are often keywords of the disciplines and the fields. When the keywords are translated, since the keywords in the text are generally abstracts to a certain technical field or emerging words in a certain technical field, the results translated by the machine are often not satisfactory, and thus the retrieved results are often not expected. This severely affects the efficiency of the search.
Disclosure of Invention
The invention provides a method for classifying and querying Chinese and English thesis data, which aims at the defects and problems that the existing data classification cannot achieve the effect of word segmentation, Chinese and English fusion is difficult to accurately identify, and cross-language query cannot achieve the expected effect.
The technical scheme adopted by the invention for solving the technical problems is as follows: a Chinese and English thesis data classification and query method comprises the following steps:
traversing original data of a Chinese thesis according to Chinese and English keywords carried by the Chinese thesis during publication, extracting the Chinese and English keywords in all the Chinese thesis, cleaning and filtering the extracted data, aggregating Chinese translation results after abnormal data is eliminated, taking a translation list larger than a threshold agg as a Chinese and English comparison library corresponding to Chinese, and extracting the Chinese keywords according to the Chinese and English comparison library to generate a Chinese word library;
acquiring an English academic field label library through the existing model or the constructed LDA field model, wherein the English academic field label library is of a two-layer tree structure and comprises a large field label and a small field label, and the small field label belongs to the large field label;
step three, associating the label library in the English academic field with a Chinese and English comparison library, and associating the corresponding Chinese characters with the English labels in the English label library if the labels in the English label library can be found in the Chinese and English comparison library; if the tags in the English tag library cannot be found in the Chinese and English comparison library, converting through the existing machine translation, and performing machine translation on tag _1 through a machine translation model; and finally, generating a Chinese and English field tag library from the English tag library and corresponding the Chinese and English field tag library to the English tag library.
Combining the Chinese word stock generated in the step one with a Chinese word segmentation device to generate a word segmentation device with a user-defined word stock, and segmenting the keywords, the abstract and the title of the original data of the Chinese thesis and the original data of the English thesis respectively through the word segmentation device to generate a corresponding Chinese word segmentation list and an English word segmentation list;
step five, calculating the field of the thesis by using a KNN algorithm;
and step six, inquiring the information by combining the word bank.
In the method for classifying and querying Chinese and English thesis data, the first step of processing the data is as follows: traversing original data of a Chinese thesis, extracting Chinese and English keywords, then eliminating abnormal data, performing aggregation processing on the results of Chinese translation, and then taking a translation list larger than a certain threshold value as a translation result corresponding to Chinese.
In the method for classifying and querying Chinese and English thesis data, the threshold agg in the first step makes a policy as follows: recording the maximum translation number of the key words as max _ trans and the minimum translation number as min _ trans,
if max _ trans-min _ trans <3, the threshold value is agg-1; if max _ trans-min _ trans is more than or equal to 3, the threshold value is max _ trans-3.
The method for classifying and querying Chinese and English thesis data comprises the following fifth step:
(1) facilitating a list of participles generated by each paper;
(2) calculating the relevance of the word segmentation list by adopting a K nearest neighbor algorithm, wherein the formula is as follows:
Figure BDA0002875814650000041
Figure BDA0002875814650000042
in the formula: q. q.sxFor the label in Chinese and English label library, if xjEqual to a certain label q in the Chinese-English label librarykIf so, the numerical value is marked as 1, otherwise, the numerical value is 0; count indicates that all participles in the ith paper are related to the label q in the label librarykThe calculated value of (a);
if the Count is greater than the set threshold, the article is recorded to belong to the field specified by the label;
and if the count is not greater than the set threshold, comparing the matching number of the small-field labels under all the large-field labels in the English label library, and recording that the paper belongs to the field with large matching number of the small-field labels.
The invention has the beneficial effects that: the method comprises the steps of extracting Chinese and English keywords of a Chinese thesis, processing data to form a Chinese and English comparison library and a Chinese word library, acquiring an English tag library by using a model, and fusing the English tag library and the Chinese and English comparison library to form a Chinese and English tag library; meanwhile, a Chinese and English word segmentation list is obtained by performing word segmentation on original data of Chinese and English papers, the fields of the papers are divided by calculating the correlation, so that the research field labels of the Chinese and English papers can be effectively unified, the retrieval accuracy can be improved, the Chinese and English papers of the same type can be accurately identified, and the accuracy of cross-language query is improved.
Drawings
FIG. 1 is a flow chart of the acquisition of Chinese word library and Chinese-English reference library in the present invention.
FIG. 2 is a flow chart of the present invention for implementing unified tagging.
Fig. 3 is a flow chart of a query approach.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
Example 1: aiming at the defects and problems that the existing data classification cannot achieve the effect of word segmentation, Chinese and English fusion is difficult to accurately identify, and cross-language query cannot achieve the expected effect, the invention provides a Chinese and English thesis data classification and query method based on how to unify thesis labels in constructed Chinese and English texts and improve the cross-language query accuracy. The method includes the following.
Firstly, traversing original data of a Chinese paper according to Chinese and English keywords in publication of the Chinese paper, and extracting the Chinese and English keywords in all the Chinese paper.
Then, excluding abnormal data in the Chinese and English keywords, mainly excluding data lacking Chinese or English keywords, aggregating the results of Chinese translation after excluding the abnormal data, and using a translation list larger than a certain threshold agg as the translation result corresponding to Chinese, for example: aiming at the Clustering system, different authors have different translation modes, some authors translate into the Clustering coeffient, other authors translate into the Cluster coeffient, and the number of the Clustering coeffients and the number of the Cluster coeffients are larger than a specified threshold value through aggregation sequencing, and the number of the Clustering coeffients is larger than the specified threshold value, so that a Chinese-English comparison library such as the Clustering system [ Clustering coeffient, Cluster coeffient ] } is finally generated.
Noting that the maximum translation number of the keywords is max _ trans, and the minimum translation number is min _ trans
If max _ trans-min _ trans <3, the threshold value is agg-1;
if max _ trans-min _ trans is more than or equal to 3, the threshold value is max _ trans-3
And finally, extracting Chinese keywords according to the Chinese and English comparison library to generate a Chinese word library.
And step two, acquiring an English label library through the existing model or the constructed LDA field model, wherein the English label library is of a two-layer tree structure and is collectively called as a large field label and a small field label, and the small field label belongs to the large field label.
Step three, associating the English academic domain label library with a Chinese and English comparison library,
if the label in the English label library can be found in the Chinese and English comparison library, the corresponding Chinese is associated with the English label in the English label library;
if the tags in the English tag library cannot be found in the Chinese and English comparison library, converting through the existing machine translation, and performing machine translation on tag _1 through a machine translation model;
and finally, generating a Chinese and English field label library from the English label library.
And step four, combining the Chinese word stock generated in the step one with a Chinese word segmentation device to generate a word segmentation device with a user-defined word stock, and segmenting the keywords, the abstract and the title of the original data of the Chinese thesis and the original data of the English thesis respectively through the word segmentation device to generate a corresponding Chinese word segmentation list and an English word segmentation list.
Step five, calculating the relevancy of the word segmentation list by adopting a K nearest neighbor algorithm, wherein the method comprises the following steps:
(1) traversing a word segmentation list generated by each paper;
(2) calculating the relevancy of the word segmentation list, wherein the formula is as follows:
Figure BDA0002875814650000061
Figure BDA0002875814650000062
in the formula: q. q.sxFor the label in Chinese and English label library, if xjEqual to a certain label q in the Chinese-English label librarykIf so, the numerical value is marked as 1, otherwise, the numerical value is 0; count indicates that all participles in the ith paper are related to the label q in the label librarykThe calculated value of (a).
If the Count is greater than the set threshold, the thesis belongs to the field specified by the label, and the threshold is set to be 3 reasonably through verification;
and if the count is not greater than the set threshold, comparing the matching number of the small-field labels under all the large-field labels in the English label library. If Algorithm, Artificial Intelligence under Computer Science and the maximum number of all participles in the article are present, it is indicated that this article belongs to the field of Computer Science → [ Algorithm, Artificial Intelligence ].
Therefore, the fields of all the papers are calculated, and the field classification of Chinese and English papers is completed.
Step six, inquiring the information to be inquired by combining the English keyword lexicon, wherein the method comprises the following steps:
(1) inputting information to be inquired;
(2) performing word segmentation processing on information to be queried;
(3) using the Chinese and English comparison library generated in the step one as a term library, inquiring the word segmentation result in the term library, and if the word segmentation result exists, pulling the translated document from the term library for outputting; if not, the translation is performed using a conventional translation machine.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be covered thereby.

Claims (4)

1. A Chinese and English thesis data classification and query method is characterized by comprising the following steps: the method comprises the following steps:
traversing original data of a Chinese thesis according to Chinese and English keywords carried by the Chinese thesis during publication, extracting the Chinese and English keywords in all the Chinese thesis, cleaning and filtering the extracted data, aggregating Chinese translation results after abnormal data is eliminated, taking a translation list larger than a threshold agg as a Chinese and English comparison library corresponding to Chinese, and extracting the Chinese keywords according to the Chinese and English comparison library to generate a Chinese word library;
acquiring an English academic field label library through the existing model or the constructed LDA field model, wherein the English academic field label library is of a two-layer tree structure and comprises a large field label and a small field label, and the small field label belongs to the large field label;
step three, associating the label library in the English academic field with a Chinese and English comparison library, and associating the corresponding Chinese characters with the English labels in the English label library if the labels in the English label library can be found in the Chinese and English comparison library; if the tags in the English tag library cannot be found in the Chinese and English comparison library, converting through the existing machine translation, and performing machine translation on tag _1 through a machine translation model; finally, generating a Chinese and English field tag library from the English tag library and corresponding the Chinese and English field tag library to the English tag library;
combining the Chinese word stock generated in the step one with a Chinese word segmentation device to generate a word segmentation device with a user-defined word stock, and segmenting the keywords, the abstract and the title of the original data of the Chinese thesis and the original data of the English thesis respectively through the word segmentation device to generate a corresponding Chinese word segmentation list and an English word segmentation list;
step five, calculating the field of the thesis by using a KNN algorithm;
and step six, inquiring the information by combining the word bank.
2. The method for classifying and querying English paper data as claimed in claim 1, wherein: the data processing method in the first step comprises the following steps: traversing original data of a Chinese thesis, extracting Chinese and English keywords, then eliminating abnormal data, performing aggregation processing on the results of Chinese translation, and then taking a translation list larger than a certain threshold value as a translation result corresponding to Chinese.
3. The method for classifying and querying English paper data as claimed in claim 1, wherein: in the first step, the threshold agg is set as the following strategy: recording the maximum translation number of the key words as max _ trans and the minimum translation number as min _ trans,
if max _ trans-min _ trans <3, the threshold value is agg-1; if max _ trans-min _ trans is more than or equal to 3, the threshold value is max _ trans-3.
4. The method for classifying and querying English paper data as claimed in claim 1, wherein: the fifth step comprises the following steps:
(1) facilitating a list of participles generated by each paper;
(2) calculating the relevance of the word segmentation list by adopting a K nearest neighbor algorithm, wherein the formula is as follows:
Figure FDA0002875814640000021
Figure FDA0002875814640000022
in the formula: q. q.sxFor the label in Chinese and English label library, if xjEqual to a certain label q in the Chinese-English label librarykIf so, the numerical value is marked as 1, otherwise, the numerical value is 0; count indicates that all participles in the ith paper are related to the label q in the label librarykThe calculated value of (a);
if the Count is greater than the set threshold, the article is recorded to belong to the field specified by the label;
and if the count is not greater than the set threshold, comparing the matching number of the small-field labels under all the large-field labels in the English label library, and recording the paper which belongs to the field with large matching number of the small-field labels.
CN202011613854.9A2020-12-302020-12-30Chinese and English thesis data classification and query methodActiveCN112632282B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011613854.9ACN112632282B (en)2020-12-302020-12-30Chinese and English thesis data classification and query method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011613854.9ACN112632282B (en)2020-12-302020-12-30Chinese and English thesis data classification and query method

Publications (2)

Publication NumberPublication Date
CN112632282Atrue CN112632282A (en)2021-04-09
CN112632282B CN112632282B (en)2021-11-19

Family

ID=75286956

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011613854.9AActiveCN112632282B (en)2020-12-302020-12-30Chinese and English thesis data classification and query method

Country Status (1)

CountryLink
CN (1)CN112632282B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113177420A (en)*2021-04-292021-07-27同方知网(北京)技术有限公司Chinese-English bilingual dictionary construction method based on academic literature
CN114492425A (en)*2021-12-302022-05-13中科大数据研究院Method for communicating multi-dimensional data by adopting one set of field label system
CN115712738A (en)*2022-11-152023-02-24国家计算机网络与信息安全管理中心Telegram Chinese group retrieval method, device and equipment integrating multi-source data

Citations (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1804834A (en)*2005-12-232006-07-19赵然Chinese-English search engine realizing website-level search and webpage-level display
US20060217961A1 (en)*2005-03-222006-09-28Fuji Xerox Co., Ltd.Translation device, translation method, and storage medium
US20070073669A1 (en)*2005-09-292007-03-29Reiner KraftTagging offline content with context-sensitive search-enabling keywords
CN101276328A (en)*2007-03-292008-10-01上海汉光知识产权数据科技有限公司Patent data translating system
CN101425087A (en)*2008-09-162009-05-06网易有道信息技术(北京)有限公司Method and system for constructing dictionary
CN102262621A (en)*2010-05-262011-11-30钟长林Device and method for checking translated text
CN105677634A (en)*2015-07-182016-06-15孙维国Method for extracting sentences with similar meanings and standard grammar from academic documents
CN105955958A (en)*2016-05-062016-09-21长沙市麓智信息科技有限公司English patent application document write auxiliary system and write auxiliary method thereof
CN106570191A (en)*2016-11-112017-04-19浙江大学Wikipedia-based Chinese and English cross-language entity matching method
CN108491399A (en)*2018-04-022018-09-04上海杓衡信息科技有限公司Chinese to English machine translation method based on context iterative analysis

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20060217961A1 (en)*2005-03-222006-09-28Fuji Xerox Co., Ltd.Translation device, translation method, and storage medium
US20070073669A1 (en)*2005-09-292007-03-29Reiner KraftTagging offline content with context-sensitive search-enabling keywords
CN1804834A (en)*2005-12-232006-07-19赵然Chinese-English search engine realizing website-level search and webpage-level display
CN101276328A (en)*2007-03-292008-10-01上海汉光知识产权数据科技有限公司Patent data translating system
CN101425087A (en)*2008-09-162009-05-06网易有道信息技术(北京)有限公司Method and system for constructing dictionary
CN102262621A (en)*2010-05-262011-11-30钟长林Device and method for checking translated text
CN105677634A (en)*2015-07-182016-06-15孙维国Method for extracting sentences with similar meanings and standard grammar from academic documents
CN105955958A (en)*2016-05-062016-09-21长沙市麓智信息科技有限公司English patent application document write auxiliary system and write auxiliary method thereof
CN106570191A (en)*2016-11-112017-04-19浙江大学Wikipedia-based Chinese and English cross-language entity matching method
CN108491399A (en)*2018-04-022018-09-04上海杓衡信息科技有限公司Chinese to English machine translation method based on context iterative analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUNXIAO LIU: "A Top-k keywords searching approach based on the relationship of keywords", 《SYSTEMS, MAN, AND CYBERNETICS》*
李佳: "基于词共现的跨语言检索平台研究", 《情报杂志》*

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113177420A (en)*2021-04-292021-07-27同方知网(北京)技术有限公司Chinese-English bilingual dictionary construction method based on academic literature
CN114492425A (en)*2021-12-302022-05-13中科大数据研究院Method for communicating multi-dimensional data by adopting one set of field label system
CN115712738A (en)*2022-11-152023-02-24国家计算机网络与信息安全管理中心Telegram Chinese group retrieval method, device and equipment integrating multi-source data
CN115712738B (en)*2022-11-152025-07-29国家计算机网络与信息安全管理中心Telegram Chinese group retrieval method, device and equipment integrating multi-source data

Also Published As

Publication numberPublication date
CN112632282B (en)2021-11-19

Similar Documents

PublicationPublication DateTitle
CN112632282B (en)Chinese and English thesis data classification and query method
CN111753514B (en)Automatic generation method and device of patent application text
Bisandu et al.Clustering news articles using efficient similarity measure and N-grams
CN113672718B (en)Dialogue intention recognition method and system based on feature matching and field self-adaption
JP2006510114A (en) Representation of content in conceptual model space and method and apparatus for retrieving it
CN107391565B (en)Matching method of cross-language hierarchical classification system based on topic model
CN113962293A (en) A Name Disambiguation Method and System Based on LightGBM Classification and Representation Learning
CN110688461B (en) An online text-based educational resource label generation method integrating multi-source knowledge
Li et al.Automatic image annotation based on wordnet and hierarchical ensembles
Ayyavaraiah et al.Cross media feature retrieval and optimization: A contemporary review of research scope, challenges and objectives
Li et al.Self-supervised learning-based weight adaptive hashing for fast cross-modal retrieval
Wang et al.Collaborative visual modeling for automatic image annotation via sparse model coding
CN111209393A (en)Method for realizing specialized document classification label based on natural language processing
Barnard et al.Recognition as translating images into text
CN113961702B (en)Method for extracting hierarchical structure of article title
TW202004519A (en)Method for automatically classifying images
Chaudhari et al.A survey on automatic annotation and annotation based image retrieval
Abd Manaf et al.Review on statistical approaches for automatic image annotation
Lau et al.Fusing visual and textual retrieval techniques to effectively search large collections of wikipedia images
Shrivastava et al.Comparison between K-mean and C-mean clustering for CBIR
Zand et al.Visual and semantic context modeling for scene-centric image annotation
Wiesen et al.Overview of uni-modal and multi-modal representations for classification tasks
Ahmed et al.Text Extraction and Clustering for Multimedia: A review on Techniques and Challenges
Gao et al.Detecting data records in semi-structured web sites based on text token clustering
Sutha et al.Image Retrieval with Relational Semantic Indexing Color and Gray Images

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp