KR20010102687A

Movatterモバイル変換

Info

Publication number: KR20010102687A
Application number: KR1020000024030A
Authority: KR
Inventors: 김성주; 최성택; 성정호
Original assignee: 정만원; 주식회사 아이윙즈
Priority date: 2000-05-04
Filing date: 2000-05-04
Publication date: 2001-11-16

Abstract

Translated fromKorean

본 발명은 카테고리 학습 기법을 이용한 주제별 웹 문서 자동분류방법 및 시스템에 관한 것으로서, 본 발명에 의한 카테고리 학습기법을 이용한 주제별 웹 문서 자동분류방법은,The present invention relates to a web document automatic classification method and system for each subject using a category learning technique, and a web document automatic classification method for each subject using a category learning technique according to the present invention,

특정 카테고리에 대한 문서의 분류를 요청하는 단계(a); 상위 카테고리 대표단어와 하위 카테고리 대표단어를 이용하여 학습문서를 획득하는 단계(b); 확보된 상기 학습문서를 처리하여 학습 데이터를 생성하는 단계(c); 및 생성된 상기 학습 데이터를 이용하여 요청된 상기 문서의 분류 포함가능여부를 판단하는 단계(d)를 포함하는 것을 특징으로 한다.Requesting a classification of the document for a particular category (a); (B) obtaining a learning document using the upper category representative word and the lower category representative word; (C) generating learning data by processing the acquired learning document; And (d) determining whether to classify the requested document using the generated learning data.

본 발명에 의한 카테고리 학습기법을 이용한 주제별 웹 문서 자동분류시스템은, 특정 카테고리의 문서에 대한 분류요청에 의해서 상기 특정 카테고리의 문서가 상기 특정 카테고리에 포함되는지 여부를 판단하는 카테고리 학습기법을 이용한 주제별 웹 문서 자동분류시스템에 있어서, 분류할 문서, 분류에 필요한 상위 카테고리 대표단어 및 하위 카테고리 대표단어를 입력하고, 상기 문서의 분류 결과를 확인하고, 상기 주제별 웹 문서 자동분류시스템을 관리하는 사용자 컴퓨터; 상기 상위 카테고리 대표단어에 의한 검색식 및 상기 하위 카테고리 대표단어에 의한 검색식을 만들고, 상기 검색식을 상용검색엔진에 질의하여 학습문서를 확보하는 학습 문서 확보기; 상기 학습 문서 확보기에 의해 확보된 상기 학습문서를 이용하여 상기 특정 카테고리의 문서의 분류에 필요한 학습 데이터를 생성하는 학습 데이터 생성기; 상기 학습 데이터를 이용하여 상기 특정 카테고리의 문서가 상기 특정 카테고리에 분류 포함 가능한지 여부를 판단하는 분류기를 포함하는 것을 특징으로 한다.The subject-specific web document automatic classification system using the category learning technique according to the present invention, the subject-specific web using the category learning technique that determines whether the document of the specific category is included in the specific category by the request for classification of the document of the specific category. An automatic document classification system, comprising: a user computer for inputting a document to be classified, an upper category representative word and a lower category representative word for classification, checking a classification result of the document, and managing the automatic web document classification system for each subject; A learning document securer for making a search expression by the upper category representative word and the search expression by the lower category representative word, and querying the search expression to a commercial search engine to secure a learning document; A training data generator for generating training data necessary for classifying the document of the specific category by using the training document secured by the training document securer; And a classifier configured to determine whether the document of the specific category can be included in the specific category by using the learning data.

본 발명에 의한 카테고리 학습기법을 이용한 주제별 웹 문서 자동분류방법 및 시스템은, 웹 상의 문서를 수집해 온라인으로 학습을 수행하므로 각각의 카테고리별로 학습문서의 준비과정이 필요 없다. 이로 인해 카테고리의 구성과 학습이 유동적으로 이루어질 수 있게 된다.The automatic web document classification method and system for each subject using the category learning method according to the present invention does not need to prepare a learning document for each category because the web documents are collected and performed online. This allows the composition and learning of categories to be made flexible.

또한 최종 카테고리를 판별하기 위한 분류 과정에서도 상위 카테고리와 하위 카테고리를 이용한 2 단계 분류 기법을 적용함으로써 분류의 정확성을 높이고 전체 카테고리에 대한 각각의 학습문서를 필요로 하지 않는 구조를 제공한다.In addition, in the classification process for determining the final category, by applying the two-stage classification technique using the upper category and the lower category, the classification accuracy is improved and the structure does not require each learning document for the entire category.

Description

Translated fromKorean

카테고리 학습 기법을 이용한 주제별 웹 문서 자동 분류 방법 및 시스템{Method and System for Web Documents Sort Using Category Learning Skill}Method and System for Web Documents Sorting Using Category Learning Skill by Category

현재 우리가 인터넷에서 겪고 있는 '정보의 과다(Information Overload)'는 더 이상 새로울 것이 없는 이야기이며, 그에 대한 긴 설명이나 경고마저도 이제 불필요한 일이 되었다. 문제의 중심에 놓여 있는 '정보의 양'은 더욱 빠른 속도로 늘어가고 있으며, 이를 통제하기란 불가능한 것으로 보인다. 이러한 문제를 해결하기 위해 일반적인 검색 서비스들은 인터넷에서 일정 주제에 관한 양질의 문서들을 수집해 이를 디렉토리 서비스를 통해 사용자에게 제공해 주고 있다. 그러나, 이러한 문서의 수집은 많은 사람들의 수작업을 통해 이루어지고 있어 인터넷의 폭발적인 증가에 따라 점차 비효율적인 작업이 되고 있다. 이에 대한 대안으로 웹 문서 자동 분류를 기반으로 한 문서 자동 수집에 대한 연구가 진행되고 있다. 웹 문서 자동 수집에서 웹 문서 자동 분류는 주어진 문서가 어떤 카테고리에 속하는지를 판단하는 핵심적인 역할을 한다.The 'information overload' that we are experiencing on the Internet is nothing new, and even long explanations and warnings are no longer necessary. The 'volume of information' at the heart of the problem is growing at a faster pace, and it seems impossible to control it. In order to solve this problem, general search services collect high quality documents on a certain subject from the Internet and provide them to users through directory services. However, the collection of such documents is made by many people, and as the Internet explodes, it becomes increasingly inefficient. As an alternative, research on automatic document collection based on web document automatic classification is being conducted. In web document automatic collection, web document automatic classification plays a key role in determining which category a given document belongs to.

웹 문서 자동 분류 시스템의 기본이 되는 일반 문서 자동 분류 시스템에 관한 연구는 크게 2가지 접근 방식이 있다. 자연어 처리 기법을 기반으로 문서의 언어적인 의미를 이용하는 방법과 단순히 문서의 표층적인 현상(예, 빈도수)을 통계및 확률을 이용하여 모델링 하는 방법으로 나누어진다.Research on general document automatic classification system which is the basis of web document automatic classification system has two main approaches. It is divided into the method of using the linguistic meaning of the document based on the natural language processing technique and simply modeling the surface phenomena (eg frequency) of the document using statistics and probabilities.

일반 문서 및 웹 문서 자동 분류의 경우 2 가지 접근 방법 모두 인공지능, 정보검색, 계산 언어 처리 등의 다양한 분야에서 연구가 진행 중인데 CMU Text Learning Group, IBM, almaden Research Center, Microsoft Research Lab. 등의 많은 대학 및 기업 연구소에서 연구가 진행중이다.For automatic classification of general documents and web documents, both approaches are working in a variety of areas, including AI, information retrieval, and computational language processing. CMU Text Learning Group, IBM, almaden Research Center, and Microsoft Research Lab. Research is underway at many universities and corporate research institutes.

문서 자동 분류 기술을 기반으로 실제 상품화 한 제품으로는 문서 분류 엔진의 단독 제품보다는 EDMS, KMS, 검색 엔진 등에 컴포넌트 형태로 개발된 제품들이 많은데, 대표적인 것이 IBM의 Lotus Notes, Autonomy의 KMS 제품군, Inktomi 의 Directory Engine 등이 있다.There are many products that have been commercialized based on document automatic classification technology, which are developed in the form of components such as EDMS, KMS, and search engines rather than the single product of document classification engines. Directory Engine, etc.

웹 문서 자동 분류는 주어진 웹 문서가 어떤 카테고리에 속하는 지를 판단하기 위해 정해진 카테고리에 대한 학습을 통하여 각 카테고리에 대한 특징 정보를 찾아내야 한다. 이를 위한 기존의 학습 방법은 사용자가 모든 카테고리에 대한 학습문서를 자동 분류 시스템에 제공하고, 이를 자동 분류 시스템이 각 카테고리간의 비교를 통하여 각 카테고리의 특징 정보를 추출하는 것이다. 그러나, 이 방법은 학습문서를 사용자가 직접 준비해야 하며, 한 카테고리의 특징 정보가 카테고리간의 문서들의 비교를 통해 추출되어지기 때문에 모든 카테고리의 학습문서가 일괄적으로 함께 제공되어야 한다는 문제가 있다.Automatic classification of web documents should find feature information for each category through learning about a given category to determine which category a given web document belongs to. The existing learning method for this purpose is that the user provides the learning documents for all categories to the automatic classification system, and the automatic classification system extracts the feature information of each category through comparison between the categories. However, this method has a problem that the user must prepare the learning document by himself or herself, and since the characteristic information of one category is extracted through comparison of the documents between the categories, the learning documents of all the categories must be provided together.

지금까지의 자동 분류 시스템에서의 학습은 사용자가 직접 전체 카테고리에 대한 학습문서 모음을 구하고 이를 학습 시스템에 일괄적으로 제공해야 함으로써 초기 카테고리 구축과 카테고리 추가 등의 관리적인 기능에 어려움이 있어 왔다.또한, 현재까지 분류에 대한 연구는 학습 알고리즘의 개선에 맞추어졌으며, 사용자의 관점에서 전체 분류 과정을 단순화하는 학습 기법 연구는 소흘히 되어 왔다. 기계 학습을 통한 문서 분류 시스템은 어떤 학습 알고리즘과 분류 알고리즘을 적용하든지 간에 미리 정의한 카테고리와 그 카테고리를 학습하기 위한 학습문서들을 가지고 오프라인으로 학습을 시킨 후 시스템의 적용이 가능한 형태였다. 이로 인해 카테고리의 정의가 어려운 경우나 해당 카테고리를 학습시킬 학습문서가 없는 경우 적용에 많은 문제점이 있었다.Until now, learning in the automatic classification system has been difficult in the administrative functions such as initial category construction and category addition by requiring the user to obtain a collection of learning documents for all categories and provide them to the learning system collectively. So far, classification research has been focused on the improvement of learning algorithms, and research on learning techniques that simplify the whole classification process from the user's point of view has been neglected. The document classification system through machine learning could be applied to the system after learning offline with a predefined category and learning documents for learning the category, regardless of which learning algorithm and classification algorithm were applied. Because of this, there are many problems in applying a case where it is difficult to define a category or when there is no learning document to learn the category.

따라서, 본 발명의 목적은 상기한 바와 같은 문제점을 해결하기 위한 것으로서, 본 발명의 목적은 사용자의 편의를 위해 카테고리 학습문서들 대신 카테고리 대표단어를 사용자로부터 받으며, 또한 카테고리간의 비교를 통해 특징 정보를 추출해내는 방법 대신 하위 카테고리와 상위 카테고리의 비교를 통한 특징 정보 추출 방법으로 전체 카테고리에 대한 학습문서들을 일괄적으로 제공받아야 하는 문제점을 해결한 웹 문서 자동분류방법 및 시스템을 제공하는데 있다.Accordingly, an object of the present invention is to solve the above problems, and an object of the present invention is to receive a category representative word from a user instead of category learning documents for the user's convenience, and to obtain feature information through comparison between categories. It is to provide a web document automatic classification method and system that solves the problem of having to collectively receive the learning documents for all categories as a method of extracting feature information by comparing lower and upper categories instead of extracting methods.

도 1은 웹 문서의 전체적인 분류 단계를 도시한 흐름도이고,1 is a flowchart showing the overall classification step of a web document,

도 2는 본 발명에 의한 웹 문서 자동분류시스템의 구성 및 실시예를 도시한 것이고,Figure 2 shows the configuration and embodiment of a web document automatic classification system according to the present invention,

도 3은 상기 학습 문서 확보기(21)가 학습문서를 확보하는 단계를 예시한 예시도이고,3 is an exemplary diagram illustrating a step of securing the learning document by the learning document securer 21,

도 4는 상기 학습 데이터 생성기(22)가 학습 데이터를 생성하는 과정을 도시한 것이고,4 illustrates a process in which the training data generator 22 generates training data.

도 5는 상기 분류기(23)가 분류를 요청한 문서가 하위 카테고리 또는 상위 카테고리에 분류 포함 가능하지를 판단하는 과정을 도시한 것이다.FIG. 5 illustrates a process in which the classifier 23 determines whether a document requested for classification can be included in a lower category or a higher category.

* 도면의 주요한 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

10:사용자 컴퓨터10: Your computer

20:주제별 웹 문서 자동분류시스템20: Web document automatic classification system by topic

21:학습 문서 확보기21: Learning Document obtainer

22:학습 데이터 생성기22: Learning Data Generator

23:분류기23: Classifier

30:인터넷30: Internet

40a, 40b, 40c:검색엔진40a, 40b, 40c: search engine

상기한 바와 같은 목적을 달성하기 위하여, 본 발명에 의한 카테고리 학습기법을 이용한 주제별 웹 문서 자동분류방법은,In order to achieve the above object, the web document automatic classification method for each subject using the category learning technique according to the present invention,

특정 카테고리에 대한 문서의 분류를 요청하는 단계(a); 상위 카테고리 대표단어와 하위 카테고리 대표단어를 이용하여 학습문서를 획득하는 단계(b); 확보된 상기 학습문서를 처리하여 학습 데이터를 생성하는 단계(c); 및 생성된 상기 학습데이터를 이용하여 요청된 상기 문서의 분류 포함가능여부를 판단하는 단계(d)를 포함하는 것을 특징으로 한다.Requesting a classification of the document for a particular category (a); (B) obtaining a learning document using the upper category representative word and the lower category representative word; (C) generating learning data by processing the acquired learning document; And (d) determining whether to include the classification of the requested document using the generated learning data.

본 발명의 또 다른 목적을 달성하기 위하여, 본 발명에 의한 카테고리 학습기법을 이용한 주제별 웹 문서 자동분류시스템은,In order to achieve another object of the present invention, the web document automatic classification system for each subject using the category learning technique according to the present invention,

특정 카테고리의 문서에 대한 분류요청에 의해서 상기 특정 카테고리의 문서가 상기 특정 카테고리에 포함되는지 여부를 판단하는 카테고리 학습기법을 이용한 주제별 웹 문서 자동분류시스템에 있어서, 분류할 문서, 분류에 필요한 상위 카테고리 대표단어 및 하위 카테고리 대표단어를 입력하고, 상기 문서의 분류 결과를 확인하고, 상기 주제별 웹 문서 자동분류시스템을 관리하는 사용자 컴퓨터; 상기 상위 카테고리 대표단어에 의한 검색식 및 상기 하위 카테고리 대표단어에 의한 검색식을 만들고, 상기 검색식을 상용검색엔진에 질의하여 학습문서를 확보하는 학습 문서 확보기; 상기 학습 문서 확보기에 의해 확보된 상기 학습문서를 이용하여 상기 특정 카테고리의 문서의 분류에 필요한 학습 데이터를 생성하는 학습 데이터 생성기; 상기 학습 데이터를 이용하여 상기 특정 카테고리의 문서가 상기 특정 카테고리에 분류 포함 가능한지 여부를 판단하는 분류기를 포함하는 것을 특징으로 한다.In a web document automatic classification system for each subject using a category learning technique for determining whether a document of a specific category is included in the specific category by a request for a classification of a document of a specific category, a document to be classified and a higher category representative for classification A user computer for inputting a word and a sub-category representative word, checking a classification result of the document, and managing the web document automatic classification system for each subject; A learning document securer for making a search expression by the upper category representative word and the search expression by the lower category representative word, and querying the search expression to a commercial search engine to secure a learning document; A training data generator for generating training data necessary for classifying the document of the specific category by using the training document secured by the training document securer; And a classifier configured to determine whether the document of the specific category may be included in the specific category by using the learning data.

이하에서는 첨부한 도면을 참조하면서 본 발명을 상세히 설명한다. 상세한 설명에 앞서, 본 발명에 의한 카테고리 학습기법을 이용한 주제별 웹 문서 자동분류방법 및 시스템을 설명하기 위하여 이하에서 사용되는 용어들을 정리한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. Prior to the detailed description, terms used below are described in order to explain a method and system for automatically classifying web documents by topic using a category learning method according to the present invention.

'카테고리'는 사용자의 관심 주제를 표현하는 분류 체계 내에서의 한 부분을 의미한다.'Category' refers to a part within a classification system that expresses a subject of interest to a user.

'카테고리 대표단어'는 카테고리를 가장 잘 대표하는 키워드로서 사용자가 학습 시스템에 제공하는 것을 의미한다.'Category representative word' is a keyword that best represents a category and means that the user provides to the learning system.

'상위 카테고리'는 하위 카테고리의 관심 주제를 상위 개념에서 포함하는 카테고리를 의미하다.'Upper category' refers to a category including a topic of interest of a lower category in a higher concept.

'하위 카테고리'는 현재 분류를 요청한 문서가 속하는 카테고리를 의미한다.'Subcategory' refers to the category to which the document which currently requested the classification belongs.

'학습문서'는 사용자의 관심 주제에 대한 특징 정보를 추출하기 위하여 학습 시스템이 처리하는 웹 문서를 의미한다.The 'learning document' refers to a web document processed by the learning system to extract feature information about a subject of interest to the user.

'학습 데이터'는 사용자가 제공한 학습문서모음으로부터 추출된 사용자의 관심을 대표하는 특징 정보를 의미한다.'Learning data' means feature information representing the user's interest extracted from the learning document collection provided by the user.

'학습'은 시스템이 사용자로부터 카테고리에 대한 학습문서모음을 받아 해당 카테고리에 대한 학습 데이터를 이끌어내는 과정을 의미한다.'Learning' refers to a process in which the system receives a collection of learning documents for a category from a user and derives learning data for the category.

'상용 검색 서비스'는 키워드로 구성된 질의에 대해 해당 웹 문서를 반환하는 상용 서비스를 의미한다.'Commercial search service' means a commercial service that returns a web document for a query composed of keywords.

'분류'는 특정 문서가 특정 카테고리에 속하는 지의 여부를 결정하는 과정을 의미한다.'Category' refers to the process of determining whether a particular document belongs to a specific category.

도 1은 웹 문서의 전체적인 분류 단계를 도시한 흐름도이다.1 is a flowchart illustrating the overall classification step of a web document.

특정 카테고리에 대한 문서의 분류 요청이 들어오면, 학습 데이터가 없을 경우 사용자로부터 받은 상위 카테고리 대표단어와 하위 카테고리 대표단어를 이용하여 가져온 학습문서들을 가져오고, 이를 처리하여 학습 데이터를 생성한 후, 주어진 문서에 대해 학습 데이터를 이용하여 분류 포함 가능한지를 판단한다.When a request for classifying a document for a specific category is received, if there is no learning data, the learning documents are imported using the upper category representative word and the lower category representative word received from the user, and processed to generate the learning data, It is determined whether the document can be classified using the learning data.

사용자가 특정 카테고리의 문서에 대한 분류요청을 한다(s100). 사용자는 인터넷상에 존재하는 임의의 웹 문서에 대하여 카테고리 분류를 요청한다.The user makes a classification request for a document of a specific category (s100). The user requests a category classification for any web document existing on the Internet.

특정 카테고리의 문서에 대한 분류요청이 들어오면, 상기 특정 카테고리에 대한 학습 데이터를 검색한다(s110). 상기 특정 카테고리에 대한 학습 데이터가 있는 경우에는 바로 상기 특정 카테고리의 문서가 분류가능한가를 판단한다(s150). 학습 데이터가 없는 경우에는 사용자에게 상위 카테고리 대표단어와 하위 카테고리 대표단어를 입력받는다.When a classification request for a document of a specific category is received, the training data for the specific category is searched (s110). If there is learning data for the specific category, it is immediately determined whether the document of the specific category can be classified (S150). If there is no learning data, the user receives the upper category representative word and the lower category representative word.

상기 상위 카테고리 대표단어와 상기 하위 카테고리 대표단어를 이용하여 학습문서를 획득한다(s120). 사용자가 제공하는 상위 카테고리 대표단어와 하위 카테고리 대표 단어를 조합하여 상용 검색 서비스에 질의를 보내어 관련 학습문서들을 가져온다.A learning document is obtained using the upper category representative word and the lower category representative word (S120). By combining the upper category representative word and the lower category representative word provided by the user, a query is sent to a commercial search service to retrieve related learning documents.

상위 카테고리에서의 학습문서 확보는 상위 카테고리 대표 단어를 '그리고(AND)'조건으로 검색식을 만들고, 하위 카테고리 대표 단어를 '그리고(AND)' 조건으로 검색식을 만든 후, 상위 카테고리의 검색식을 상용검색엔진에 질의하여 학습문서를 확보하고, 하위 카테고리에서의 학습문서 확보는 상위 카테고리의 검색식과 하위 카테고리의 검색식을 '빼기(MINUS)'조건으로 최종 검색식을 만든 후, 이를 사용 검색 엔진에 질의하여 학습문서들을 확보한다. 이렇게 하는 이유는 상위 카테고리 학습문서들과 하위 카테고리 학습문서들간의 차이점을 극대화하여 상위 카테고리에 대한 하위 카테고리의 학습 데이터를 추출하기 위해서이다. 도 3은 학습 문서 확보기를 이용한 학습문서 확보과정을 도시하고 있다.To obtain the learning document in the upper category, create a search expression with 'AND' condition for the upper category representative word, make a search expression with 'AND' condition for the lower category representative word, and then search for the upper category. To obtain the learning document by querying the common search engine, and to obtain the learning document in the subcategory, make the final search expression with the condition of 'minus' for the search expression of the upper category and the lower category, and then use the search. Query the engine to get the learning documents. The reason for doing this is to extract the learning data of the lower category for the upper category by maximizing the difference between the upper category learning documents and the lower category learning documents. 3 illustrates a process of securing a learning document using the learning document securer.

확보된 상위 카테고리 학습문서와 하위 카테고리 학습문서를 처리하여 각각의 학습 데이터를 생성한다(s130). 생성된 각 학습 데이터는 분류를 하게 될 때 판단에 필요한 기반 데이터로 활용된다. 학습문서들을 통해 학습 데이터를 추출하는 알고리즘은 Bayesian Method, Feature Selection 등의 여러 가지 알고리즘이 있으며, 여기서는 Feature Selection 알고리즘을 이용하여 학습 데이터를 추출한다. Feature Selection 알고리즘을 통해 학습 데이터를 추출해내는 기본적인 방법은 현재 학습문서들이 그 학습문서들과 비교되는 학습문서들간의 나타나는 단어와 그 단어의 발생 빈도수의 차이를 확인하고, 이 데이터를 기반으로 현재 학습문서들에 나타난 단어들의 가중치를 정하여 이들 단어 중 가중치가 높은 일정 개수의 단어들을 선택하여 이를 학습 데이터로 사용하는 것이다. 상위 카테고리 학습 데이터는 상위 카테고리 학습문서들과 비교 문서로서 무작위 학습문서들을 사용함으로써 만들어진다. 무작위 학습문서들은 의미 없는 단어를 상용 검색 서비스에 질의함으로써 얻어질 수 있다. 상위 카테고리 학습 데이터로 상위 카테고리 학습문서들과 비교문서로서 무작위 문서들을 사용함으로써 만들어진다. 하위 카테고리 학습 데이터는 하위 카테고리 학습문서들과 비교 문서로서 상위 카테고리 학습문서들을 사용함으로써 만들어진다. 이 과정을 거치게 되면 상위 카테고리에 대한 하위 카테고리 학습 데이터를 생성해 낼 수 있다. 도 4는 학습 데이터 추출과정을 도시하고 있다.Each learning data is generated by processing the acquired upper category learning document and lower category learning document (S130). Each generated learning data is used as the basis data for judgment when classifying. Algorithms for extracting learning data through learning documents include various algorithms such as Bayesian Method and Feature Selection. Here, learning data is extracted using Feature Selection Algorithm. The basic method of extracting the training data through the Feature Selection algorithm is to check the difference between the occurrences of the words and the frequency of occurrence of the words among the learning documents in which the current learning documents are compared with the learning documents. The weights of the words appearing in the field are determined, and a certain number of words having a high weight among these words are selected and used as learning data. Higher category learning data is generated by using random learning documents as a comparison document with higher category learning documents. Random learning documents can be obtained by querying a commercial search service for meaningless words. It is made by using random documents as a comparison document with higher category learning documents as the upper category learning data. The subcategory learning data is generated by using the upper category learning documents as the comparison document with the subcategory learning documents. Through this process, subcategory learning data for the upper category can be generated. 4 illustrates a process of extracting training data.

상기 상위 카테고리 학습 데이터와 상기 하위 카테고리 학습 데이터를 이용하여 요청된 상기 문서의 분류 포함가능여부를 판단한다(s140). 특정 문서에 대한 분류 요구를 받게 되면 먼저 상위 카테고리에서 문서가 상위 카테고리의 분류에 포함되는지를 판단하고, 분류 가능이 확인되면 하위 카테고리에 분류가 가능한지를 확인하고, 두 가지 모두에서 분류가 가능하면 최종적으로 분류가 가능하다고 판단한다. 분류 과정에서는 상기 s130단계에서 생성된 학습 데이터를 이용한다. 일반적으로 한 카테고리에 특정 문서가 분류 포함 가능한 지를 확인하는 방법은 주어진 문서의 구성 단어들의 가중치 합이 일정 값 이상이 되는 지를 판단하는 방법을 사용한다. 여기서도 이 방법을 그대로 사용하지만, 하위 카테고리에 대한 분류 포함이 가능한지만을 확인하는 일반적인 방법 대신 분류의 과정을 두 가지 과정으로 나누어 분류 포함 가능을 확인하는 방법을 사용한다. 먼저, 주어진 문서가 상위 카테고리에 분류 포함 가능한지를 확인하고, 이것이 가능할 경우에는 하위 카테고리에 분류 포함 가능한 지를 확인하게 된다. 도 5에 분류기를 이용한 분류방법이 도시되어 있다.Using the upper category learning data and the lower category learning data, it is determined whether to classify the requested document (S140). When a classification request for a specific document is received, the first step is to determine whether the document is included in the classification of the upper category, and if the classification is confirmed, check whether the classification is possible in the lower category. I think that can be classified as. In the classification process, the learning data generated in step S130 is used. In general, a method of checking whether a specific document can be included in a category can be determined by determining whether the sum of weights of the constituent words of a given document is equal to or greater than a predetermined value. Here, this method is used as it is, but instead of the general method of checking that only subcategories can be included, the method of dividing the classification into two processes is used to check whether the classification can be included. First, it is checked whether a given document can be classified in an upper category, and if this is possible, it is checked whether it can be classified in a lower category. 5 illustrates a classification method using a classifier.

상기한 바와 같이 분류의 과정을 두 부분으로 나누는 이유는 하위 카테고리에 속하는 문서는 기본적으로 상위 카테고리에 속한다는 것과 분류의 과정을 두 번 거치게 함으로써 분류의 정확성을 높이려는 것이다.As described above, the reason for dividing the classification process into two parts is that the documents belonging to the lower category basically belong to the upper category, and to improve the accuracy of the classification by performing the classification process twice.

도 2는 본 발명에 의한 주제별 웹 문서 자동분류시스템의 구성 및 실시예을 도시한 것이다.2 illustrates a configuration and an embodiment of a topic web document automatic classification system according to the present invention.

사용자 컴퓨터(10)는 인터넷 또는 오프라인 상에서 분류할 문서, 분류에 필요한 상위 카테고리 대표단어 및 하위 카테고리 대표단어를 입력하고, 상기 문서의 분류 결과를 확인할 수 있다.The user computer 10 may input a document to be classified on the Internet or offline, an upper category representative word and a lower category representative word required for classification, and check the classification result of the document.

주제별 웹 문서 자동분류시스템(20)은 학습 문서 확보기(21), 학습 데이터 생성기(22) 및 분류기(23)을 포함하고 있다.The automatic web document classification system 20 for each subject includes a learning document securer 21, a learning data generator 22, and a classifier 23.

상기 학습 문서 확보기(21)는 상기 웹 문서 자동분류시스템이 분류를 하기 위해서 학습에 기본적으로 필요한 학습문서를 확보하기 위하여 사용자가 제공하는 카테고리 대표단어를 이용하여 학습문서를 확보한다. 상기 학습 문서 확보기(21)가 학습문서를 확보하는 과정은 도 3에 도시되어 있다. 상기 학습 문서 확보기(21)는 사용자가 입력한 상기 상위 카테고리 대표단어 및 상기 하위 카테고리 대표단어를 이용하여 검색식을 작성하고 이를 상용검색엔진에 보내어 상기 대표단어를 포함하는 문서의 URL을 가져온다. 상기 문서의 URL을 호출하여 해당주소의 웹문서를 확보하게 된다.The learning document securer 21 secures the learning document using a category representative word provided by the user in order to secure the learning document basically required for learning in order for the web document automatic classification system to classify. The process of securing the learning document by the learning document securer 21 is shown in FIG. 3. The learning document securer 21 creates a search expression using the upper category representative word and the lower category representative word input by the user, and sends the search expression to a commercial search engine to obtain a URL of a document including the representative word. The URL of the document is called to secure the web document of the corresponding address.

상기 학습 데이터 생성기(22)는 상위 카테고리 학습문서들과 하위 카테고리 학습문서들이 확보되면 분류를 위해 학습 데이터를 생성한다. 상기 학습문서들을 통해 학습 데이터를 추출하는 알고리즘은 Bayesian Method, Feature Selection 등의 여러 가지 알고리즘이 있으며, 상기 학습 데이터 생성기(22)는 Feature Selection 알고리즘을 이용하여 학습 데이터를 추출한다. Feature Selection 알고리즘을 통해 학습 데이터를 추출해내는 기본적인 방법은 현재 학습문서들이 그 학습문서들과 비교되는 학습문서들간의 나타나는 단어와 그 단어의 발생 빈도수의 차이를 확인하고, 이 데이터를 기반으로 현재 학습문서들에 나타난 단어들의 가중치를 정하여 이들 단어 중 가중치가 높은 일정 개수의 단어들을 선택하여 이를 학습 데이터로 사용한다.When the upper category learning documents and the lower category learning documents are secured, the learning data generator 22 generates the learning data for classification. The algorithm for extracting learning data through the learning documents includes various algorithms such as Bayesian Method and Feature Selection, and the learning data generator 22 extracts learning data using a Feature Selection algorithm. The basic method of extracting the training data through the Feature Selection algorithm is to check the difference between the occurrences of the words and the frequency of occurrence of the words among the learning documents in which the current learning documents are compared with the learning documents. The weights of the words shown in the field are determined, and a certain number of words having a high weight are selected from these words and used as learning data.

상위 카테고리 학습 데이터는 상위 카테고리 학습문서들과 비교 문서로서 무작위 학습문서들을 사용함으로써 만들어진다. 무작위 학습문서들은 의미 없는 단어를 상용검색엔진에 질의함으로써 얻어질 수 있다. 이 과정을 거치게 되면 무작위 문서들에 대한 상위 카테고리 학습 데이터를 생성해 낼 수 있다. 하위 카테고리 학습 데이터는 하위 카테고리 학습문서들과 비교 문서로서 상위 카테고리 학습문서들을 사용함으로써 만들어진다. 이 과정을 거치게 도면 상위 카테고리에 대한 하위 카테고리 학습 데이터를 생성해 낼 수 있다. 도 4는 학습 데이터를 생성해내는 과정을 도시하고 있다.Higher category learning data is generated by using random learning documents as a comparison document with higher category learning documents. Random learning documents can be obtained by querying common search engines for meaningless words. This process allows us to generate higher category learning data for random documents. The subcategory learning data is generated by using the upper category learning documents as the comparison document with the subcategory learning documents. Through this process, the subcategory learning data for the upper category of the drawing can be generated. 4 illustrates a process of generating training data.

상기 과정을 통해 학습 데이터를 생성하는 이유는 하위 카테고리와 상위 카테고리간의 수직적인 비교를 통한 학습 데이터의 생성이 모든 카테고리에 대한 학습문서들이 일괄적으로 제공되어야 하는 수평적인 카테고리간의 비교를 통한 학습 데이터의 추출방법보다 단순성과 확장성이 우수하기 때문이다.The reason for generating the learning data through the above process is that the generation of the learning data through the vertical comparison between the lower category and the upper category is based on the comparison of the learning data through the comparison between the horizontal categories in which the learning documents for all categories should be provided collectively. This is because simplicity and extensibility are better than extraction methods.

상기 분류기(23)는 주어진 문서가 하위 카테고리로의 포함 가능한 지를 판단하는 것이다. 분류 과정에서는 상기 학습 데이터 생성기(22)가 만들어낸 학습 데이터를 이용한다. 일반적으로 한 카테고리에 특정 문서가 분류 포함 가능한 지를 확인하는 방법은 주어진 문서의 구성 단어들의 가중치 함이 일정 값 이상이 되는 지를 판단하는 방법을 사용한다. 상기 분류기(23)에서도 같은 방법을 사용하지만, 하위 카테고리에 대한 분류 포함이 가능한지만을 확인하는 일반적인 방법 대신 분류의 과정을 두 가지 과정으로 나누어 분류 포함 가능을 확인하는 방법을 사용한다. 먼저, 주어진 문서가 상위 카테고리에 분류 포함 가능한지를 확인하고, 이것이 가능할 시에는 하위 카테고리에 분류 포함 가능한 지를 확인하게 된다. 도 5에 분류과정이 도시되어 있다.The classifier 23 determines whether a given document can be included in a subcategory. In the classification process, the training data generated by the training data generator 22 is used. In general, the method of checking whether a specific document can be included in a category can be determined by determining whether a weighted value of the constituent words of a given document is a predetermined value or more. The same method is used in the classifier 23, but instead of the general method of checking that only the sub-categories can be included, the method of dividing the classification into two processes is used. First, it is checked whether a given document can be classified in an upper category, and if this is possible, it is checked whether it can be classified in a lower category. The classification process is shown in FIG.

이와 같이 분류의 과정을 두 부분으로 나누는 이유는 하위 카테고리에 속하는 문서는 기본적으로 상위 카테고리에 속한다는 것과 분류의 과정을 두 번 거치게 함으로써 분류의 정확성을 높이려는 것이다.The reason for dividing the classification process into two parts is that the documents belonging to the subcategory basically belong to the upper category, and to increase the accuracy of the classification by going through the classification process twice.

인터넷(30)은 세계적인 컴퓨터 네트웍 시스템으로서, 사용자가 어떤 컴퓨터에 있든지 간에 그가 사용권한을 가지고 있다면 그 어떤 다른 컴퓨터에도 접속해서 정보를 얻을 수 있는 "네트웍의 네트웍"이다. 인터넷에서 가장 널리 사용되는 서비스 중의 하나가 월드 와이드 웹이다. 웹을 사용하면 무수히 많은 량의 정보에 쉽게 액세스할 수 있다.The Internet 30 is a worldwide computer network system, which is a "network of networks" where a user can access information from any other computer if he has permission. One of the most widely used services on the Internet is the World Wide Web. The web makes it easy to access a myriad of information.

검색엔진(40a, 40b, 40c)은 인터넷상에서 필요한 정보를 검색해주는 프로그램 또는 웹사이트를 말한다. 상기 검색엔진은, 검색되어지길 원하는 각 웹 페이지 또는 모든 웹사이트의 대표 페이지로 가서 그것을 읽고, 각 페이지 상의 하이퍼텍스트 링크를 사용하여 그 사이트의 다른 페이지들을 읽어 오는 스파이더(때로는 "크롤러(crawler)" 또는 "봇(bot)" 이라고도 불림)라는 프로그램, 읽어들인 웹페이지에 대해 거대한 색인(때로 이것을 "카탈로그"라고도 부른다)을 만드는 프로그램 및 사용자의 검색요구를 받아들이고, 색인 내에 있는 내용과 비교한 뒤, 검색 결과를 돌려주는 프로그램 과 같이 세 부분으로 나뉜다. 검색엔진을 사용하는 또 다른 방법은 주제별 구조를 갖는 디렉토리를 탐색하는 것이다. 야후는 검색엔진을 사용할 수도 있지만, 웹 상에서 가장 널리 사용되는 디렉토리 검색사이트이다. 많은 수의 웹 포탈사이트들이 정보를 찾기 위해 검색엔진과 함께 디렉토리 방식을 제공한다. 대부분의 대형 검색엔진들은 가급적 웹 상에 있는 정보를 모두 색인하기 위한 노력을 경주하는데, 일단 어떤 사이트의 웹페이지가 색인되고 나면, 검색엔진은 주기적으로 그 사이트에 다시 방문하여 색인내용을 수정한다. 몇몇 검색엔진들은 다음과 같은 항목에 특별한 가중치를 두어 색인한다. 제목이나 주제설명 등에 들어있는 단어들, HTML 메타 태그 내에 나열된 키워드, 각 페이지의 첫 번째 단어들, 페이지 내에 가장 자주 반복되는 단어들이 그것이다.The search engines 40a, 40b, and 40c refer to programs or websites that search for necessary information on the Internet. The search engine is a spider (sometimes called a "crawler") that goes to a representative page of each web page or all websites that it wants to be searched for and reads it, and then reads other pages of that site using hypertext links on each page. Or a program called "bot", a program that creates a huge index (sometimes called a "catalog") for the webpages it reads, and the user's search needs, and compares it with the content in the index, It is divided into three parts, such as a program that returns search results. Another way to use a search engine is to search directories with a topic structure. Yahoo can use a search engine, but it is the most widely used directory search site on the Web. Many web portal sites provide a directory approach with search engines to find information. Most large search engines make an effort to index all the information on the web whenever possible. Once a site's web pages are indexed, the search engine periodically visits the site and revises the index. Some search engines index specific items with special weights: The words in the title or subject description, the keywords listed in the HTML meta tag, the first words on each page, and the words that are repeated most often in the page.

도 3은 상기 학습 문서 확보기(21)가 학습문서를 확보하는 단계를 예시한 예시도이다.3 is an exemplary diagram illustrating a step of securing the learning document by the learning document securer 21.

상기 학습 문서 확보기(21)는 사용자가 입력한 상기 상위 카테고리 대표단어 및 상기 하위 카테고리 대표단어를 이용하여 검색식을 작성하고 이를 상용검색엔진에 보내어 상기 대표단어를 포함하는 문서의 URL을 가져온다. 상기 문서의 URL을 호출하여 해당주소의 웹문서를 확보하게 된다.The learning document securer 21 creates a search expression using the upper category representative word and the lower category representative word input by the user, and sends the search expression to a commercial search engine to obtain a URL of a document including the representative word. The URL of the document is called to secure the web document of the corresponding address.

도 4는 상기 학습 데이터 생성기(22)가 학습 데이터를 생성하는 과정을 도시한 것이다.4 illustrates a process in which the training data generator 22 generates training data.

상기 학습 데이터 생성기(22)는 무작위로 추출된 비교문서와 상위 카테고리 학습문서를 비교하여 상위 카테고리 학습 데이터를 생성하고, 상위 카테고리 학습 문서와 하위 카테고리 학습문서를 비교하여 하위 카테고리 학습 데이터를 생성한다.The learning data generator 22 generates upper category learning data by comparing the randomly extracted comparison document and upper category learning document, and generates lower category learning data by comparing the upper category learning document and the lower category learning document.

분류 과정은 상위 카테고리 분류와 하위 카테고리 분류의 2단계로 나뉘어진다. 먼저 분류 요청된 문서가 상위 카테고리에 분류 포함 가능한지를 판단하고, 포함이 가능하면 다시 하위 카테고리에 분류 포함 가능하지를 판단한다.The classification process is divided into two stages: upper category classification and lower category classification. First, it is determined whether the document requested to be classified can be classified in a higher category, and if it can be included, it is determined whether it can be included in a lower category.