KR20220041336A

Movatterモバイル変換

Info

Publication number: KR20220041336A
Application number: KR1020200124399A
Authority: KR
Inventors: 이대희; 강호석
Original assignee: 주식회사 테크플럭스
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2022-04-01

Abstract

The present invention relates to a graph generation method for recommending a significant keyword and extracting a core document, which comprises the following steps: generating a keyword graph formed of keyword nodes for each of a plurality of searched documents; measuring document similarity between the generated keyword graphs; setting each keyword graph corresponding to the plurality of searched documents as a document node; generating a document graph formed of the document nodes by using weights of links connecting the set document nodes; clustering the document nodes into k groups from the generated document graph; combining graphs of the documents clustered by the k groups into a single graph by group; and extracting a significant keyword or a core document from the combined single graph. According to the present invention, the method recommends a significant keyword node from a search word entered by a user and automatically extracts a core document.

Description

Translated fromKorean

본 발명은 그래프 생성 방법에 관한 것으로서, 더욱 상세하게는 그래프 분석방법을 이용하여, 사용자가 입력한 검색어로부터 중요 키워드 노드를 추천하고, 핵심 문서를 자동으로 추출할 수 있는 그래프 생성 방법에 관한 것이다.The present invention relates to a graph generating method, and more particularly, to a graph generating method capable of recommending important keyword nodes from a search word input by a user and automatically extracting a key document using a graph analysis method.

텍스트를 포함하는 문서를 분류하고 분석하는 기존 방법은 문서에서 단어 또는 용어의 빈도를 이용하여 키워드를 결정한다. 문서는 주어진 단어들의 집합에서 빈번하게 나오는 키워드와 매칭하는 것을 통하여 분류되거나 분석된다.Existing methods of classifying and analyzing documents containing text use the frequency of words or terms in the document to determine keywords. Documents are classified or analyzed by matching keywords that occur frequently in a given set of words.

또다른 문서분류 방법으로는 전자문서의 특징벡터를 기반으로 하여 전자문서를 분류하는 방법이 있다(선행기술 1, 한국특허공개번호 10-2016-0081604).As another document classification method, there is a method of classifying an electronic document based on a feature vector of the electronic document (Prior Art 1, Korea Patent Publication No. 10-2016-0081604).

선행기술 1에 따르면, 전자문서에서의 각각 단어의 출현 확률을 기록한 확률벡터를 특징벡터로 정의하여 문서간 유사도를 판단하고 있다.According to Prior Art 1, the similarity between documents is determined by defining a probability vector in which the probability of occurrence of each word in an electronic document is recorded as a feature vector.

한편, 사용자 검색어를 이용하여 대규모 문서를 검색하는 경우 보통의 경우 문서 노이즈가 매우 크다. 이로부터 유사한 문서를 추출하여야 하는 경우 사람이 검색문서에서 유효문서를 추출한 이후, 핵심문서를 분류하는데 많은 시간이 소요되고, 분석자에 따라 분류의 정도가 매우 다를 실정이다.On the other hand, when a large-scale document is searched using a user search term, document noise is usually very large. When it is necessary to extract a similar document from this, it takes a lot of time to classify the core document after a person extracts a valid document from the search document, and the degree of classification is very different depending on the analyst.

종래에는 사용자가 핵심 문서를 추출하더라도 핵심 문서에서 공통점을 파악하여, 기술 트렌드를 추출하는 것에 어려움이 있었으며, 검색어가 변경되거나 기술 발전에 따라 관심기술이 변경되는 경우에는 업데이트 하는데 어려움이 있었다.Conventionally, even when a user extracts a core document, it is difficult to extract a technology trend by identifying commonalities in the core document, and it is difficult to update when a search term is changed or a technology of interest is changed according to technological development.

따라서, 본 발명이 해결하고자 하는 첫 번째 과제는 그래프 분석방법을 이용하여, 사용자가 입력한 검색어로부터 중요 키워드를 추천하고, 핵심 문서를 자동으로 추출할 수 있는 그래프 생성 방법을 제공하는 것이다.Accordingly, the first problem to be solved by the present invention is to provide a graph generation method capable of recommending important keywords from a search word input by a user and automatically extracting a key document using a graph analysis method.

본 발명이 해결하고자 하는 두 번째 과제는 자동으로 추출된 핵심 문서로부터의 거리정보를 이용하여 유사 문서군을 분류할 수 있고, 유사 문서군에 존재하는 중요 키워드를 추천하여, 핵심 문서를 추출하는 그래프 생성 시스템을 제공하는 것이다.The second problem to be solved by the present invention is a graph that can classify a similar document group using distance information from the automatically extracted core document, and recommends important keywords existing in the similar document group to extract the core document It is to provide a creation system.

또한, 상기된 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체를 제공하는데 있다.Another object of the present invention is to provide a computer-readable recording medium in which a program for executing the above-described method is recorded on a computer.

본 발명은 상기 첫 번째 과제를 달성하기 위하여, 다수의 검색된 문서 각각을 키워드 노드들로 구성된 키워드 그래프로 생성하는 단계; 상기 생성된 키워드 그래프들 간의 문서 유사도를 측정하는 단계; 상기 다수의 검색된 문서에 대응하는 각각의 키워드 그래프를 문서 노드로 설정하는 단계; 상기 설정된 문서 노드들을 연결하는 링크의 가중치를 이용하여 상기 문서 노드들로 구성된 문서 그래프를 생성하는 단계; 상기 생성된 문서 그래프로부터 문서 노드들을 k개의 그룹으로 군집화하는 단계; 상기 k개의 그룹별로 군집화된 문서들의 그래프를 그룹별로 단일 그래프로 결합하는 단계; 및 상기 결합된 단일 그래프로부터 중요 키워드 또는 핵심 문서를 추출하는 단계를 포함하는 그래프 생성 방법을 제공한다.In order to achieve the first object, the present invention comprises the steps of: generating each of a plurality of searched documents as a keyword graph composed of keyword nodes; measuring the document similarity between the generated keyword graphs; setting each keyword graph corresponding to the plurality of searched documents as a document node; generating a document graph composed of the document nodes using weights of links connecting the set document nodes; clustering document nodes into k groups from the generated document graph; combining the graphs of the documents clustered by the k groups into a single graph for each group; and extracting important keywords or key documents from the combined single graph.

본 발명의 일 실시 예에 의하면, 상기 문서 노드들을 연결하는 링크의 가중치는 상기 문서 노드들 간 거리정보로부터 계산되고, 상기 문서 노드들 간 거리정보를 이용하여, 상기 문서 노드들을 k개의 그룹으로 군집화하는 것이 바람직하다.According to an embodiment of the present invention, the weight of the link connecting the document nodes is calculated from the distance information between the document nodes, and the document nodes are clustered into k groups by using the distance information between the document nodes. It is preferable to do

또한, 상기 단일 그래프를 구성하는 키워드 노드들 중에서 중심도가 높은 노드를 중요 키워드로 추출할 수 있다.Also, a node having a high centrality among keyword nodes constituting the single graph may be extracted as an important keyword.

본 발명은 상기 두 번째 과제를 달성하기 위하여, 다수의 검색된 문서 각각을 키워드 노드들로 구성된 키워드 그래프로 생성하는 그래프 생성부; 상기 생성된 키워드 그래프들 간의 문서 유사도를 측정하는 문서 유사도 측정부; 상기 다수의 검색된 문서에 대응하는 각각의 키워드 그래프를 문서 노드로 설정하고, 상기 설정된 문서 노드들을 연결하는 링크의 가중치를 이용하여 상기 문서 노드들로 구성된 문서 그래프를 생성한 후, 상기 생성된 문서 그래프로부터 문서 노드들을 k개의 그룹으로 군집화하는 문서 클러스터링부; 및 상기 k개의 그룹별로 군집화된 문서들의 그래프를 그룹별로 단일 그래프로 결합하고, 상기 결합된 단일 그래프로부터 중요 키워드 또는 핵심 문서를 추출하는 키워드 및 문서 추출부를 포함하는 것을 특징으로 하는 그래프 생성 시스템을 제공한다.In order to achieve the second object of the present invention, a graph generating unit for generating each of a plurality of searched documents as a keyword graph composed of keyword nodes; a document similarity measuring unit for measuring the document similarity between the generated keyword graphs; After each keyword graph corresponding to the plurality of searched documents is set as a document node, and a document graph composed of the document nodes is generated using the weight of links connecting the set document nodes, the generated document graph a document clustering unit for clustering document nodes into k groups; and a keyword and document extraction unit for combining the graphs of the documents clustered by the k groups into a single graph for each group, and extracting important keywords or key documents from the combined single graph. do.

본 발명의 일 실시 예에 의하면, 상기 문서 노드들을 연결하는 링크의 가중치는 상기 문서 노드들 간 거리정보로부터 계산되고, 상기 문서 노드들 간 거리정보를 이용하여, 상기 문서 노드들을 k개의 그룹으로 군집화할 수 있다.According to an embodiment of the present invention, the weight of the link connecting the document nodes is calculated from the distance information between the document nodes, and the document nodes are clustered into k groups by using the distance information between the document nodes. can do.

또한, 상기 단일 그래프를 구성하는 키워드 노드들 중에서 중심도가 높은 노드를 중요 키워드로 추천할 수 있다.Also, a node having a high centrality among keyword nodes constituting the single graph may be recommended as an important keyword.

상기 다른 기술적 과제를 해결하기 위하여, 본 발명은 상기된 중요 키워드 추천 및 핵심 문서를 추출하기 위한 그래프 생성 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체를 제공한다.In order to solve the other technical problem, the present invention provides a computer-readable recording medium in which a program for executing the above-described important keyword recommendation and graph generating method for extracting a core document is recorded on a computer.

본 발명에 따르면, 그래프 분석방법을 이용하여, 사용자가 입력한 검색어로부터 중요 키워드를 추천하고, 핵심 문서를 자동으로 추출할 수 있다.According to the present invention, by using a graph analysis method, it is possible to recommend important keywords from a search word input by a user and to automatically extract a key document.

또한, 본 발명에 따르면, 자동으로 추출된 핵심 문서로부터의 거리정보를 이용하여 유사 문서군을 분류할 수 있고, 유사 문서군에 존재하는 중요 키워드를 추천하여, 핵심 문서를 추출할 수 있다.In addition, according to the present invention, a similar document group can be classified using distance information from the automatically extracted core document, and an important keyword existing in the similar document group can be recommended to extract the core document.

나아가 본 발명에 따르면, 핵심문서간의 유사어 정보를 추출하여, 기술문서의 공통점을 추출하여, 기술 트렌드 파악에 이용할 수 있다.Furthermore, according to the present invention, it is possible to extract information on similar words between core documents, extract common points of technical documents, and use them to identify technology trends.

도 1은 본 발명의 일 실시예에 따른 중요 키워드 추천 및 핵심 문서를 추출하기 위한 그래프 생성 시스템의 구성도이다.
도 2는 그래프 생성부(110)에서 생성된 그래프의 예를 도시한 것이다.
도 3은 본 발명의 바람직한 일 실시 예에 따른 중요 키워드 추천 및 핵심 문서를 추출하기 위한 그래프 생성 방법의 흐름도이다.1 is a block diagram of a graph generating system for extracting important keyword recommendations and key documents according to an embodiment of the present invention.
2 illustrates an example of a graph generated by thegraph generating unit 110 .
3 is a flowchart of a graph generating method for extracting important keyword recommendations and key documents according to an exemplary embodiment of the present invention.

이하, 바람직한 실시 예를 들어 본 발명을 더욱 상세하게 설명한다. 그러나 이들 실시 예는 본 발명을 보다 구체적으로 설명하기 위한 것으로, 본 발명의 범위가 이에 의하여 제한되지 않는다는 것은 당업계의 통상의 지식을 가진 자에게 자명할 것이다.Hereinafter, the present invention will be described in more detail with reference to preferred embodiments. However, these examples are intended to illustrate the present invention in more detail, and it will be apparent to those skilled in the art that the scope of the present invention is not limited thereby.

도 1은 본 발명의 일 실시예에 따른 중요 키워드 추천 및 핵심 문서를 추출하기 위한 그래프 생성 시스템의 구성도이다.1 is a block diagram of a graph generating system for extracting important keyword recommendations and key documents according to an embodiment of the present invention.

도 1을 참조하면, 본 실시예에 따른 중요 키워드 추천 및 핵심 문서를 추출하기 위한 그래프 생성 시스템은 문서 검색부(100), 그래프 생성부(110), 문서 유사도 측정부(120), 문서 클러스터링부(130), 및 키워드 및 문서 추출부(140)를 포함하여 구성된다.Referring to FIG. 1 , a graph generating system for recommending important keywords and extracting a key document according to the present embodiment includes adocument search unit 100 , agraph generation unit 110 , a documentsimilarity measurement unit 120 , and adocument clustering unit 130 , and a keyword anddocument extraction unit 140 .

문서 검색부(100)는 사용자로부터 입력받은 검색어 또는 검색어 업데이트부(150)로부터 업데이트된 검색어를 입력받아 문서를 검색한다.Thedocument search unit 100 searches for a document by receiving a search word input from the user or an updated search word from the search word update unit 150 .

문서 검색부(100)는 문서 검색에 사용되는 검색어를 저장하는 사전을 포함하고, 상기 사전은 어느 하나의 검색어와 동의어군을 저장하는 것이 바람직하다.Preferably, thedocument search unit 100 includes a dictionary for storing search terms used for document search, and the dictionary stores any one search word and a group of synonyms.

그래프 생성부(110)는 문서 검색부(100)가 다수의 문서를 검색하면, 상기 다수의 검색된 문서 각각으로부터 키워드 노드들로 구성된 키워드 그래프를 생성한다.When thedocument search unit 100 searches for a plurality of documents, thegraph generating unit 110 generates a keyword graph composed of keyword nodes from each of the plurality of searched documents.

그래프 생성부(110)는 자연어 처리 기술로 문서의 연결관계 정보를 분석하고 하나의 문서로부터 명사 또는 명사구와 연결 링크로 구성되는 그래프(Graph)를 생성하는 것이 바람직하다.It is preferable that thegraph generating unit 110 analyzes connection relationship information of a document using natural language processing technology and generates a graph composed of a noun or a noun phrase and a connection link from a single document.

일 실시예로서 파스트리(Parse Tree) 알고리즘을 이용하여 각 문서로부터 그래프를 생성할 수 있다. 파스트리 알고리즘은 구문 규칙을 이용하여 트리 형태의 그래프를 만드는 알고리즘의 일종이다.As an embodiment, a graph may be generated from each document using a Parse Tree algorithm. The Pastry algorithm is a kind of algorithm that creates a tree-shaped graph using syntax rules.

그래프 생성부(110)는 문서 검색부(100)로부터 수신한 문서로부터 키워드를 추출하고, 추출된 키워드들 중에서 중요하다고 판단되는 키워드 노드들을 선정한다.Thegraph generating unit 110 extracts keywords from the document received from thedocument search unit 100 , and selects keyword nodes determined to be important among the extracted keywords.

또한, 그래프 생성부(110)는 선정된 키워드 노드들 간의 관계를 점수로 환산하여 키워드 노드와 다른 키워드 노드를 연결하는 링크의 가중치(weight)를 결정한다. 상기 중요하다도 판단되는 키워드 노드들은 키워드 노드들 간의 링크의 가중치를 고려하여 결정되는 것이 바람직하다.In addition, thegraph generating unit 110 converts the relationship between the selected keyword nodes into scores to determine the weight of the link connecting the keyword node and other keyword nodes. It is preferable that the keyword nodes determined to be important are determined in consideration of the weight of the link between the keyword nodes.

그래프 생성부(110)는 상기 선정된 키워드 노드들과 상기 키워드 노드들 간을 연결하는 링크의 가중치를 이용하여 가중치 그래프(weighted graph)를 생성한다.Thegraph generating unit 110 generates a weighted graph by using the weights of the selected keyword nodes and a link connecting the keyword nodes.

문서 유사도 측정부(120)는 그래프 생성부(110)가 검색된 복수의 문서들의 그래프를 각각 생성하면, 생성된 그래프간의 유사도를 측정한다.The documentsimilarity measuring unit 120 measures the similarity between the generated graphs when thegraph generating unit 110 generates graphs of a plurality of searched documents, respectively.

본 발명의 일 실시예에 따르면, 문서 유사도 측정부(120)는 그래프 생성부(110)에서 생성된 그래프들 각각에 대하여 다음의 과정을 거쳐 유사도를 측정한다.According to an embodiment of the present invention, the documentsimilarity measuring unit 120 measures the similarity of each of the graphs generated by thegraph generating unit 110 through the following process.

그래프 생성부(110)가 생성한 그래프를 구성하는 각 키워드 노드들과 상기 각 키워드 노드들의 주변 키워드 노드들 간의 연결관계에 기초한 연결관계 유사도를 계산한다. 이때, 상기 계산된 연결관계 유사도는 키워드 노드들의 이름과 위치를 고려하지 않는다.A degree of connection relationship similarity is calculated based on the connection relationship between each keyword node constituting the graph generated by thegraph generating unit 110 and the keyword nodes surrounding each of the keyword nodes. In this case, the calculated connection relationship similarity does not consider the names and locations of keyword nodes.

이후 그래프 생성부(110)가 생성한 그래프를 구성하는 각 키워드 노드들 사이의 이름과 위치를 고려한 노드 유사도를 계산하고, 최종적으로 연결관계 유사도와 노드 유사도를 조합하여 문서 유사도를 측정한다.Thereafter, thegraph generating unit 110 calculates the node similarity in consideration of the name and location between each keyword node constituting the generated graph, and finally measures the document similarity by combining the connection relation similarity and the node similarity.

도 2는 그래프 생성부(110)에서 생성된 그래프의 예를 도시한 것이다.2 illustrates an example of a graph generated by thegraph generating unit 110 .

도 2(a) 내지 도 2(c)를 참조하면, 원으로 표시된 키워드 노드들과 내부에 사선이 그려진 원으로 표시된 선택 키워드 노드들이 표시되어 있다. 선택 키워드 노드들은 가중치가 임계값 이상인 링크와 연결된 키워드 노드인 것이 바람직하다.Referring to FIGS. 2A to 2C , keyword nodes indicated by circles and selected keyword nodes indicated by circles with slanted lines inside are displayed. Preferably, the selected keyword nodes are keyword nodes connected to links whose weight is greater than or equal to a threshold value.

도 2(a)의 선택 키워드 노드는 cyptographic-key, authentication, memory이고, 도 2(b)의 선택 키워드 노드는 cyptographic-key, synchronization, content-provider, 도 2(c)의 선택 키워드 노드는 cyptographic-key, token-sequence, user-interface, host-device, content-provider이다.The optional keyword nodes of Figure 2 (a) are cyptographic-key, authentication, memory, the optional keyword nodes of Figure 2 (b) are cyptographic-key, synchronization, content-provider, The optional keyword nodes of Figure 2 (c) are cyptographic -key, token-sequence, user-interface, host-device, content-provider.

도 2(a)와 도 2(b)를 비교하면, 문서 유사도 측정부(120)는 도 2(a)와 도 2(b)의 연결관계가 삼각형 형태이므로 연결관계 유사도는 높다고 판단할 것이다.Comparing FIGS. 2(a) and 2(b), the documentsimilarity measuring unit 120 will determine that the connection similarity of FIGS. 2(a) and 2(b) is high because the connection relationship of FIGS.

즉, 선택 키워드 노드를 연결하는 삼각형 연결관계가 유사하므로 연결관계 유사도는 높다고 판단될 수 있다. 선택 키워드 노드간의 연결관계 비교는 스케일링을 고려한 이미지 비교를 통해 비교하는 것이 가능하다.That is, since the triangle connection relationship connecting the selected keyword nodes is similar, it can be determined that the connection relationship similarity is high. It is possible to compare the connection relationship between selected keyword nodes through image comparison considering scaling.

그러나 도 2(a)의 삼각형 연결관계를 구성하는 각 선택 키워드 노드들은 cryptographic key, authentication, memory이고, 도 2(b)의 삼각형 연결관계를 구성하는 선택 키워드 노드들은 cryptographic key, synchronization, content-provider이어서, 각 선택 키워드 노드들의 이름이 다르므로 노드 유사도는 낮게 나올 것이다. 따라서, 최종적인 문서 유사도는 연결관계 유사도와 노드 유사도를 모두 고려한 것이므로, 낮은 노드 유사도로 인해 문서 유사도가 낮게 나오게 된다.However, each of the optional keyword nodes constituting the triangular connection relationship of FIG. 2(a) is a cryptographic key, authentication, and memory, and the optional keyword nodes constituting the triangle connection relationship of FIG. 2(b) are cryptographic key, synchronization, and content-provider Subsequently, since the names of the selected keyword nodes are different, the node similarity will be low. Therefore, since the final document similarity considers both the connection relation similarity and the node similarity, the document similarity is low due to the low node similarity.

한편, 도 2(a)와 도 2(b)를 도 2(c)와 비교하면, 도 2(c)는 도 2(a) 또는 도 2(b)와 연결관계 유사도가 낮게 나오게 되고, 선택 키워드 노드가 상이하므로, 노드 유사도 낮게 나오게 되어 문서 유사도나 낮게 된다.On the other hand, when comparing Figs. 2(a) and 2(b) with Fig. 2(c), Fig. 2(c) shows a low degree of connection relationship similarity with Fig. 2(a) or Fig. 2(b), and the selection Since the keyword nodes are different, the node similarity is also low and the document similarity is also low.

다른 실시예에 따르면, 문서 유사도 측정부(120)는 그래프 생성부(110)가 생성한 그래프를 구성하는 각 키워드 노드들 중에서 추천 키워드 노드를 이용하여 문서 유사도를 측정하는 것도 가능하다.According to another embodiment, the documentsimilarity measuring unit 120 may measure the document similarity by using a recommended keyword node among keyword nodes constituting the graph generated by thegraph generating unit 110 .

상기 추천 키워드 노드는 그래프 생성부(110)가 생성한 그래프를 구성하는 각 키워드 노드들 간의 링크에 기초하여 네트워크 중심도가 높은 노드일 수 있다. 상기 네트워크 중심도는 어느 하나의 키워드 노드와 주변 키워드 노드들 간의 링크의 개수 또는 가중치의 합에 비례하는 척도이다.The recommended keyword node may be a node having a high degree of network centrality based on links between keyword nodes constituting the graph generated by thegraph generating unit 110 . The network centrality is a measure proportional to the sum of the weights or the number of links between any one keyword node and neighboring keyword nodes.

또한, 상기 추천 키워드 노드는 TF-IDF(Term Frequency - Inverse Document Frequency)를 이용하여 결정될 수 있다. TF-IDF는 정보 검색과 텍스트 마이닝에서 이용하는 가중치로, 여러 문서로 이루어진 문서군이 있을 때 어떤 단어가 특정 문서 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치이다. 문서의 키워드를 추출하거나, 검색 엔진에서 검색 결과의 순위를 결정하거나, 문서들 사이의 비슷한 정도를 구하는 등의 용도로 사용할 수 있다.Also, the recommended keyword node may be determined using Term Frequency - Inverse Document Frequency (TF-IDF). TF-IDF is a weight used in information retrieval and text mining, and is a statistical number indicating how important a word is within a specific document when there is a document group consisting of several documents. It can be used for purposes such as extracting keywords from documents, determining the ranking of search results in a search engine, or obtaining similarity between documents.

여기서 TF(단어 빈도, term frequency)는 특정한 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값으로, 이 값이 높을수록 문서에서 중요하다고 생각할 수 있다. 하지만 단어 자체가 문서군 내에서 자주 사용되는 경우, 이것은 그 단어가 흔하게 등장한다는 것을 의미한다. 이것을 DF(문서 빈도, document frequency)라고 하며, 이 값의 역수를 IDF(역문서 빈도, inverse document frequency)라고 한다. TF-IDF는 TF와 IDF를 곱한 값이다.Here, TF (term frequency) is a value indicating how often a specific word appears in a document, and the higher this value, the more important it is in the document. However, if the word itself is used frequently within a document family, this means that the word appears frequently. This is called DF (document frequency), and the inverse of this value is called IDF (inverse document frequency). TF-IDF is the product of TF and IDF.

IDF 값은 문서군의 성격에 따라 결정된다. 예를 들어 '원자'라는 낱말은 일반적인 문서들 사이에서는 잘 나오지 않기 때문에 IDF 값이 높아지고 문서의 핵심어가 될 수 있지만, 원자에 대한 문서를 모아놓은 문서군의 경우 이 낱말은 상투어가 되어 각 문서들을 세분화하여 구분할 수 있는 다른 낱말들이 높은 가중치를 얻게 된다.The IDF value is determined according to the nature of the document group. For example, the word 'atom' does not appear in general documents, so the IDF value increases and it can become a key word for documents. Other words that can be subdivided and differentiated get higher weights.

한편, 상기 추천 키워드 노드는 그래프를 구성하는 각 키워드 노드들 간의 링크 중 링크의 가중치가 높거나 네트워크 중심도가 높은 추천 링크를 기준으로 하여 정의되는 것도 가능하다. 추천 링크는 문서 검색부(100)에 저장된 사전에 포함된 유사어 사이에 존재하는 링크이거나 유사도가 높은 문서 간에 포함된 링크가 될 수도 있다.Meanwhile, the recommendation keyword node may be defined based on a recommendation link having a high weight or a high network centrality among links between keyword nodes constituting the graph. The recommended link may be a link existing between similar words included in a dictionary stored in thedocument search unit 100 or a link included between documents having a high degree of similarity.

또한, 그래프 생성부(110)가 생성한 그래프 A에서 추천된 추천 키워드 노드 A와 그래프 B에서 추천된 추천 키워드 노드 B가 동일할 경우에 추천 키워드 노드로 정의될 수 있다.Also, when the recommended keyword node A recommended in the graph A generated by thegraph generating unit 110 and the recommended keyword node B recommended in the graph B are the same, it may be defined as a recommended keyword node.

문서 클러스터링부(130)는 상기 다수의 검색된 문서에 대응하는 각각의 키워드 그래프를 문서 노드로 설정하고, 상기 설정된 문서 노드들을 연결하는 링크의 가중치를 이용하여 상기 문서 노드들로 구성된 문서 그래프를 생성한다. 이후 상기 생성된 문서 그래프로부터 문서 노드들을 k개의 그룹으로 군집화한다. 상기 문서 노드들을 연결하는 링크의 가중치는 상기 키워드 노드들 간의 링크의 가중치를 고려하여 생성되는 것이 바람직하다.Thedocument clustering unit 130 sets each keyword graph corresponding to the plurality of searched documents as a document node, and generates a document graph composed of the document nodes using the weight of links connecting the set document nodes. . Thereafter, document nodes are clustered into k groups from the generated document graph. Preferably, the weight of the link connecting the document nodes is generated in consideration of the weight of the link between the keyword nodes.

키워드 및 문서 추출부(140)는 상기 k개의 그룹별로 군집화된 문서들의 그래프를 그룹별로 단일 그래프로 결합하고, 상기 단일 그래프를 구성하는 키워드 노드들 중에서 중심도가 높은 노드를 중심 키워드로 추출한다. 또한, 키워드 및 문서 추출부(140)는 상기 추출된 중요 키워드를 고려하여 핵심 문서를 추출하는 것이 바람직하다.The keyword anddocument extraction unit 140 combines the graphs of the documents clustered by the k groups into a single graph for each group, and extracts a node having a high centrality among keyword nodes constituting the single graph as a central keyword. In addition, it is preferable that the keyword anddocument extraction unit 140 extracts the core document in consideration of the extracted important keywords.

도 3은 본 발명의 바람직한 일 실시 예에 따른 중요 키워드 추천 및 핵심 문서를 추출하기 위한 그래프 생성 방법의 흐름도이다.3 is a flowchart of a graph generating method for extracting important keyword recommendations and key documents according to an exemplary embodiment of the present invention.

도 3을 참조하면, 본 실시 예에 따른 그래프 생성 방법은 도 1에 도시된 그래프 생성 시스템에서 시계열적으로 처리되는 단계들로 구성된다. 따라서, 이하 생략된 내용이라 하더라도 도 1에 도시된 그래프 생성 시스템에 관하여 이상에서 기술된 내용은 본 실시 예에 따른 그래프 생성 방법에도 적용된다.Referring to FIG. 3 , the graph generating method according to the present embodiment includes steps processed in time series in the graph generating system shown in FIG. 1 . Therefore, even if omitted below, the above-described content regarding the graph generating system shown in FIG. 1 is also applied to the graph generating method according to the present embodiment.

300 단계에서 그래프 생성 시스템은 사용자로부터 입력받은 검색어 또는 검색어 업데이트부(150)로부터 업데이트된 검색어를 입력받아 문서를 검색한다.In step 300 , the graph generating system searches for a document by receiving a search word input from the user or an updated search word from the search word updater 150 .

310 단계에서 그래프 생성 시스템은 다수의 검색된 문서 각각을 키워드 노드들로 구성된 키워드 그래프로 생성한다.In step 310, the graph generating system generates each of the plurality of searched documents as a keyword graph composed of keyword nodes.

상기 키워드 그래프는 자연어 처리 기술로 문서의 연결관계 정보를 분석하고 하나의 문서로부터 명사 또는 명사구와 연결 링크로 구성되는 그래프(Graph)인 것이 바람직하다.The keyword graph is preferably a graph composed of a noun or a noun phrase and a connection link from a single document by analyzing connection relationship information of a document using natural language processing technology.

300 단계에서 검색된 문서로부터 키워드를 추출하고, 추출된 키워드들 중에서 중요하다고 판단되는 키워드 노드들을 선정한다. 또한, 상기 선정된 키워드 노드들 간의 관계를 점수로 환산하여 키워드 노드와 다른 키워드 노드를 연결하는 링크의 가중치(weight)를 결정한다.In step 300, keywords are extracted from the searched document, and keyword nodes determined to be important among the extracted keywords are selected. In addition, the weight of a link connecting the keyword node and another keyword node is determined by converting the relationship between the selected keyword nodes into a score.

따라서, 상기 선정된 키워드 노드들과 상기 키워드 노드들 간을 연결하는 링크의 가중치를 이용하여 가중치 그래프(weighted graph)를 생성하는 것이 바람직하다.Accordingly, it is preferable to generate a weighted graph using the weights of the selected keyword nodes and links connecting the keyword nodes.

320 단계에서 그래프 생성 시스템은 상기 생성된 키워드 그래프들 간의 문서 유사도를 측정한다.In step 320, the graph generating system measures the document similarity between the generated keyword graphs.

본 발명의 일 실시예에 따르면, 310 단계에서 생성한 그래프를 구성하는 각 키워드 노드들과 상기 각 키워드 노드들의 주변 키워드 노드들 간의 연결관계에 기초한 연결관계 유사도를 계산한다. 이때, 상기 계산된 연결관계 유사도는 키워드 노드들의 이름과 위치를 고려하지 않는다.According to an embodiment of the present invention, the degree of connection relationship similarity is calculated based on the connection relationship between each keyword node constituting the graph generated in step 310 and the keyword nodes surrounding each of the keyword nodes. In this case, the calculated connection relationship similarity does not consider the names and locations of keyword nodes.

이후 310 단계에서 생성한 그래프를 구성하는 각 키워드 노드들 사이의 이름과 위치를 고려한 노드 유사도를 계산하고, 최종적으로 연결관계 유사도와 노드 유사도를 조합하여 문서 유사도를 측정하는 것이 바람직하다.Thereafter, it is preferable to calculate the node similarity in consideration of the name and location between each keyword node constituting the graph generated in step 310, and finally measure the document similarity by combining the connection relation similarity and the node similarity.

본 발명의 다른 실시예에 따르면, 310 단계에서 생성한 그래프를 구성하는 각 키워드 노드들 중에서 추천 키워드 노드를 이용하여 문서 유사도를 측정하는 것도 가능하다.According to another embodiment of the present invention, it is also possible to measure the similarity of a document by using a recommended keyword node among keyword nodes constituting the graph generated in step 310 .

상기 추천 키워드 노드는 310 단계에서 생성한 그래프를 구성하는 각 키워드 노드들 간의 링크에 기초하여 네트워크 중심도가 높은 노드일 수 있다. 상기 네트워크 중심도는 어느 하나의 키워드 노드와 주변 키워드 노드들 간의 링크의 개수 또는 가중치의 합에 비례하는 척도이다. 또한, 상기 추천 키워드 노드는 TF-IDF(Term Frequency - Inverse Document Frequency)를 이용하여 결정될 수 있다.The recommended keyword node may be a node having a high degree of network centrality based on links between keyword nodes constituting the graph generated in step 310 . The network centrality is a measure proportional to the sum of the weights or the number of links between any one keyword node and neighboring keyword nodes. Also, the recommended keyword node may be determined using Term Frequency - Inverse Document Frequency (TF-IDF).

한편, 상기 추천 키워드 노드는 그래프를 구성하는 각 키워드 노드들 간의 링크 중 링크의 가중치가 높거나 네트워크 중심도가 높은 추천 링크를 기준으로 하여 정의되는 것도 가능하다. 추천 링크는 사전에 포함된 유사어 사이에 존재하는 링크이거나 유사도가 높은 문서 간에 포함된 링크가 될 수도 있다.Meanwhile, the recommendation keyword node may be defined based on a recommendation link having a high weight or a high network centrality among links between keyword nodes constituting the graph. The recommendation link may be a link that exists between similar words included in the dictionary, or a link included between documents with a high degree of similarity.

또한, 310 단계에서 생성한 그래프 A에서 추천된 추천 키워드 노드 A와 그래프 B에서 추천된 추천 키워드 노드 B가 동일할 경우에 추천 키워드 노드로 정의될 수 있다.Also, when the recommended keyword node A recommended in the graph A generated in step 310 and the recommended keyword node B recommended in the graph B are the same, it may be defined as a recommended keyword node.

330 단계에서 그래프 생성 시스템은 상기 다수의 검색된 문서에 대응하는 각각의 키워드 그래프를 문서 노드로 설정한다. 문서 노드는 상기 키워드 노드들로 구성된 키워드 그래프를 대표하는 하나의 노드를 설정한 것일 수 있다.In step 330, the graph generating system sets each keyword graph corresponding to the plurality of searched documents as a document node. The document node may set one node representing the keyword graph composed of the keyword nodes.

340 단계에서 그래프 생성 시스템은 상기 설정된 문서 노드들을 연결하는 링크의 가중치를 이용하여 상기 문서 노드들로 구성된 문서 그래프를 생성한다.In step 340, the graph generating system generates a document graph composed of the document nodes by using the weights of links connecting the set document nodes.

상기 문서 노드를 연결하는 링크의 가중치는 상기 문서 노드 내에 포함된 키워드 노드들 및 키워드 노드들 간의 링크의 가중치를 고려하여 결정되는 것이 바람직하다.Preferably, the weight of links connecting the document nodes is determined in consideration of keyword nodes included in the document node and weights of links between keyword nodes.

350 단계에서 그래프 생성 시스템은 상기 생성된 문서 그래프로부터 문서 노드들을 k개의 그룹으로 군집화한다. 상기 군집화는 340 단계에서의 문서 노드들 간의 링크의 가중치를 고려할 수 있다.In step 350, the graph generating system clusters document nodes into k groups from the generated document graph. The clustering may consider the weight of links between document nodes in step 340 .

360 단계에서 그래프 생성 시스템은 상기 k개의 그룹별로 군집화된 문서들의 그래프를 그룹별로 단일 그래프로 결합한다. 상기 단일 그래프는 문서 노드들로 구성된 문서 그래프를 다시 키워드 노드들로 분해하고, 상기 분해된 키워드 노드들 간의 링크를 고려하여 단일 그래프로 재생성하는 것이 바람직하다.In step 360, the graph generating system combines the graphs of the documents clustered by the k groups into a single graph for each group. It is preferable that the single graph decomposes the document graph composed of document nodes into keyword nodes again, and regenerates it into a single graph in consideration of links between the decomposed keyword nodes.

370 단계에서 그래프 생성 시스템은 상기 결합된 단일 그래프로부터 중요 키워드 또는 핵심 문서를 추출한다. 상기 중요 키워드는 상기 선택 키워드 노드 또는 상기 추천 키워드 노드를 결정하는 방법을 사용할 수 있다. 한편, 핵심 문서는 중요 키워드와의 링크 가중치가 가장 높은 문서를 핵심 문서로 추출하는 것이 바람직하다.In step 370, the graph generating system extracts important keywords or key documents from the combined single graph. The important keyword may use a method of determining the selected keyword node or the recommended keyword node. On the other hand, as for the core document, it is desirable to extract the document having the highest link weight with the important keyword as the core document.

본 발명의 실시 예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시 예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시 예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, in the present invention, specific matters such as specific components, etc., and limited embodiments and drawings have been described, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , various modifications and variations are possible from these descriptions by those of ordinary skill in the art to which the present invention pertains.

따라서, 본 발명의 사상은 설명된 실시 예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the described embodiments, and not only the claims to be described later but also all of the claims and equivalents or equivalent modifications will be said to belong to the scope of the spirit of the present invention. .

Claims

Translated fromKorean

다수의 검색된 문서 각각을 키워드 노드들로 구성된 키워드 그래프로 생성하는 단계;
상기 생성된 키워드 그래프들 간의 문서 유사도를 측정하는 단계;
상기 다수의 검색된 문서에 대응하는 각각의 키워드 그래프를 문서 노드로 설정하는 단계;
상기 설정된 문서 노드들을 연결하는 링크의 가중치를 이용하여 상기 문서 노드들로 구성된 문서 그래프를 생성하는 단계;
상기 생성된 문서 그래프로부터 문서 노드들을 k개의 그룹으로 군집화하는 단계;
상기 k개의 그룹별로 군집화된 문서들의 그래프를 그룹별로 단일 그래프로 결합하는 단계; 및
상기 결합된 단일 그래프로부터 중요 키워드 또는 핵심 문서를 추출하는 단계를 포함하는 그래프 생성 방법.generating each of the plurality of searched documents as a keyword graph composed of keyword nodes;
measuring the document similarity between the generated keyword graphs;
setting each keyword graph corresponding to the plurality of searched documents as a document node;
generating a document graph composed of the document nodes using weights of links connecting the set document nodes;
clustering document nodes into k groups from the generated document graph;
combining the graphs of the documents clustered by the k groups into a single graph for each group; and
Graph generation method comprising the step of extracting important keywords or key documents from the combined single graph.

제1 항에 있어서,
상기 문서 노드들을 연결하는 링크의 가중치는 상기 문서 노드들 간 거리정보로부터 계산되고, 상기 문서 노드들 간 거리정보를 이용하여, 상기 문서 노드들을 k개의 그룹으로 군집화하는 것을 특징으로 하는 그래프 생성 방법.According to claim 1,
The weight of the link connecting the document nodes is calculated from the distance information between the document nodes, and the document nodes are clustered into k groups by using the distance information between the document nodes.

제1 항에 있어서,
상기 단일 그래프를 구성하는 키워드 노드들 중에서 중심도가 높은 노드를 중요 키워드로 추출하는 것을 특징으로 하는 그래프 생성 방법.According to claim 1,
A method for generating a graph, characterized in that, among the keyword nodes constituting the single graph, a node having a high centrality is extracted as an important keyword.

다수의 검색된 문서 각각을 키워드 노드들로 구성된 키워드 그래프로 생성하는 그래프 생성부;
상기 생성된 키워드 그래프들 간의 문서 유사도를 측정하는 문서 유사도 측정부;
상기 다수의 검색된 문서에 대응하는 각각의 키워드 그래프를 문서 노드로 설정하고, 상기 설정된 문서 노드들을 연결하는 링크의 가중치를 이용하여 상기 문서 노드들로 구성된 문서 그래프를 생성한 후, 상기 생성된 문서 그래프로부터 문서 노드들을 k개의 그룹으로 군집화하는 문서 클러스터링부; 및
상기 k개의 그룹별로 군집화된 문서들의 그래프를 그룹별로 단일 그래프로 결합하고, 상기 결합된 단일 그래프로부터 중요 키워드 또는 핵심 문서를 추출하는 키워드 및 문서 추출부를 포함하는 것을 특징으로 하는 그래프 생성 시스템.a graph generating unit that generates each of the plurality of searched documents as a keyword graph composed of keyword nodes;
a document similarity measuring unit for measuring document similarity between the generated keyword graphs;
After each keyword graph corresponding to the plurality of searched documents is set as a document node, and a document graph composed of the document nodes is generated using the weight of links connecting the set document nodes, the generated document graph a document clustering unit for clustering document nodes into k groups; and
and a keyword and document extraction unit for combining the graphs of the documents clustered by the k groups into a single graph for each group, and extracting important keywords or key documents from the combined single graph.

제4 항에 있어서,
상기 문서 노드들을 연결하는 링크의 가중치는 상기 문서 노드들 간 거리정보로부터 계산되고, 상기 문서 노드들 간 거리정보를 이용하여, 상기 문서 노드들을 k개의 그룹으로 군집화하는 것을 특징으로 하는 그래프 생성 시스템.5. The method of claim 4,
The weight of the link connecting the document nodes is calculated from the distance information between the document nodes, and the document nodes are clustered into k groups by using the distance information between the document nodes.

제4 항에 있어서,
상기 단일 그래프를 구성하는 키워드 노드들 중에서 중심도가 높은 노드를 중요 키워드로 추출하는 것을 특징으로 하는 그래프 생성 시스템.5. The method of claim 4,
A graph generating system, characterized in that, among the keyword nodes constituting the single graph, a node having a high centrality is extracted as an important keyword.

제1 항 내지 제3 항 중에 어느 한 항의 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체.A computer-readable recording medium in which a program for executing the method of any one of claims 1 to 3 on a computer is recorded.