CN106227788A

Movatterモバイル変換

Info

Publication number: CN106227788A
Application number: CN201610571519.4A
Authority: CN
Inventors: 胡光阳; 杨培强
Original assignee: Inspur Software Group Co Ltd
Current assignee: Inspur Software Group Co Ltd
Priority date: 2016-07-20
Filing date: 2016-07-20
Publication date: 2016-12-14

Abstract

The invention discloses a database query method based on Lucene, which refers to a hierarchical mode of Lucene indexes, firstly creates indexes for resources in a database, and adds the indexes into an index resource library; utilizing the retrieval conditions to query from the index resource library to obtain a retrieval result and return the retrieval result; the method mainly comprises the steps of creating an index part and searching the index part. The invention can obviously improve the speed and recall rate of database query and can ensure that a querier obtains more friendly retrieval experience.

Description

Translated fromChinese

一种以Lucene为基础的数据库查询方法A Lucene-Based Database Query Method

技术领域technical field

本发明涉及数据库查询分析技术，具体的说是一种以Lucene为基础的数据库查询方法。The invention relates to database query analysis technology, specifically a database query method based on Lucene.

背景技术Background technique

目前互联网技术在飞速进步，全球的信息资源在逐步增多，各行各业都在向着信息化转变，因此需要保存于计算机之上的重要数据日益增加，各种网站的数据库也变得越来越大，例如电子商务类网站，更是有着海量的数据。以往的仅仅面对于网点的数据库查询办法遭遇了很多难题，例如搜索速度很慢、对用户的反应时间太久搜寻的内容和查询者的真实意愿吻合度不高，搜寻结果的内容不够完整，而且它无法依据查询者的搜索意愿进行排序，不能够提高用户搜寻满意度。At present, Internet technology is advancing rapidly, global information resources are gradually increasing, and all walks of life are transforming towards informatization. Therefore, the important data that needs to be stored on the computer is increasing day by day, and the databases of various websites are also becoming larger and larger. , For example, e-commerce websites have massive amounts of data. In the past, the database query method only for outlets encountered many problems, such as slow search speed, too long response time for users, the content of the search did not match the real wishes of the queryer, the content of the search results was not complete, and It cannot be sorted according to the search intention of the queryer, and cannot improve user search satisfaction.

信息搜索意为去信息资源的所有序列内查找，寻出那些与查询者本意相吻合的内容，于信息搜索办法内，全文搜索非常实用，并且它是所有搜索办法内通用性最好的。全文搜索会拿着查询者所提供的搜索条件去和文档内的所有的词语对比一下，和数据库查询的字段比对不一样，全文搜索工具的好处是搜寻范围广而且彻底，能够给查询者最为齐全的检索结果。而且，全文搜索会把查询者提供的检索词和索引库里面有关联的索引词比对，和数据库查询的顺序搜寻对比，它在效率方面要有多个数量级的提升。Information search means to search in all sequences of information resources to find out the content that matches the queryer's intention. Among information search methods, full-text search is very practical, and it is the most versatile among all search methods. The full-text search will use the search criteria provided by the queryer to compare with all the words in the document. Unlike the field comparison of the database query, the advantage of the full-text search tool is that the search range is wide and thorough, and it can give the queryer the most Complete search results. Moreover, the full-text search will compare the search terms provided by the queryer with the relevant index words in the index library, and compared with the sequential search of the database query, it will have multiple orders of magnitude improvement in efficiency.

Lucene是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎。Lucene是一套用于全文检索和搜寻的开源程式库，为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。Lucene索引的具体形式是本身独立的，它和具体的使用平台没干系。Lucene的基本表示单位为8个字节，如果系统是相互兼容的，则它们能够利用相同的索引资源。Lucene is an open source full-text search engine toolkit, but it is not a complete full-text search engine, but a full-text search engine architecture, which provides a complete query engine and index engine, and some text analysis engines. Lucene is an open source library for full-text search and search, providing software developers with an easy-to-use toolkit to facilitate full-text search in the target system, or to build a complete Full text search engine. The specific form of the Lucene index is independent of itself, and it has nothing to do with the specific application platform. The basic representation unit of Lucene is 8 bytes, and if the systems are compatible with each other, they can utilize the same index resource.

发明内容Contents of the invention

本发明针对目前技术发展的需求和不足之处，提供一种以Lucene为基础的数据库查询方法。The invention provides a Lucene-based database query method aiming at the needs and deficiencies of the current technical development.

本发明所述一种以Lucene为基础的数据库查询方法，解决上述技术问题采用的技术方案如下：所述一种以Lucene为基础的数据库查询方法，参考Lucene索引的层次模式，首先为数据库内资源创建索引，并将索引加进索引资源库内；利用检索条件自索引资源库内查询，获得检索结果并返回；其主要步骤包括创建索引部分、搜寻索引部分。A kind of database query method based on Lucene according to the present invention, the technical scheme adopted to solve the above technical problems is as follows: said a kind of database query method based on Lucene, with reference to the hierarchical mode of Lucene index, at first is the resource in the database Create an index, and add the index to the index resource library; use the retrieval conditions to query from the index resource library, obtain the retrieval result and return it; the main steps include creating the index part and searching the index part.

优选的，所述创建索引部分是指，固定地自数据库内采集资源，并为这些资源进行适当的分析与处理，接着对那些资源创建索引并将它们加进索引资源库内。Preferably, the part of creating an index refers to fixedly collecting resources from the database, performing appropriate analysis and processing for these resources, and then creating indexes for those resources and adding them to the index resource library.

优选的，所述创建索引部分具体包括如下步骤：Preferably, the index creation part specifically includes the following steps:

1）首先获取信息资源，即以固定时间去取得数据库内表格记录的资源，作为创造索引文件的资料来源；1) First obtain information resources, that is, to obtain resources recorded in tables in the database at a fixed time, as a source of information for creating index files;

2）接着进行信息资源过滤，在某条记录内，选择待存储的字段使得在维持信息完整度的前提下避免无用的资源内容；2) Next, filter the information resources. In a certain record, select the fields to be stored so as to avoid useless resource content while maintaining the integrity of the information;

3）其次，分析过滤出来的信息内容，进行分词处理；3) Secondly, analyze the filtered information content and perform word segmentation processing;

4）然后是创建索引，将记录内容加载到资源库内，对前面分好的词语创造索引，索引能够存放在硬盘或内存里面；最终把索引文件放到索引资源库里面。4) Then create an index, load the recorded content into the resource library, and create an index for the previously divided words. The index can be stored in the hard disk or memory; finally put the index file into the index resource library.

优选的，所述搜寻索引部分主要内容包括，利用查询者所提供的检索条件去取得查询语句，接着分析处理这些查询语句，之后自索引资源库内查询，将最终的检索结果返回给查询者。Preferably, the main content of the search index part includes: using the retrieval conditions provided by the queryer to obtain query statements, then analyzing and processing these query statements, and then querying from the index resource library, and returning the final retrieval results to the queryer.

优选的，先取得查询者提供的检索条件，其次分析处理所得到的条件语句的句法语法结构，抽取相应的关键词，依照一定规则来构成句法树，然后通过搜寻索引去找到符合句法的数据库记录。Preferably, first obtain the retrieval conditions provided by the queryer, and then analyze and process the syntax structure of the obtained conditional sentences, extract corresponding keywords, form a syntax tree according to certain rules, and then search the index to find database records that meet the syntax .

本发明所述一种以Lucene为基础的数据库查询方法与现有技术相比具有的有益效果是：采用本发明，当查询者提供条件含有若干个词语，本发明能够切分出这几个词语，然后经由Term来比对索引库，且能比对改变顺序的数据库信息，能查询出有关系的记录内容并呈现给查询者，可以明显地改进数据库查询的召回率；Compared with the prior art, a Lucene-based database query method according to the present invention has the beneficial effect that: by adopting the present invention, when the conditions provided by the queryer contain several words, the present invention can segment these words , and then compare the index library through Term, and can compare the database information in changed order, and can query the relevant record content and present it to the queryer, which can significantly improve the recall rate of the database query;

当对某个检索词多次查询时，本发明能够把初次查询的结果加载到计算机的缓存里面，所以当下一次查询到同一个检索词语时，直接去计算机缓存内找到相应的信息，而无须对索引资源库重复检索，可以明显提高搜索查询效率；When a search term is queried multiple times, the present invention can load the result of the initial query into the cache of the computer, so when the same search term is queried next time, the corresponding information can be directly found in the computer cache without having to Repeated retrieval of the index resource library can significantly improve the efficiency of search queries;

能够根据索引链表中的词语频次属性去完成词频位置加权排序算法，它的搜寻结果能够依照词语匹配程度去作排序操作，得出更贴进查询者意愿的记录信息，提升了查询者的搜索质量，能够让查询者取得更为友好的检索体验，增强了用户使用系统的依赖性。It can complete the word frequency position weighted sorting algorithm according to the word frequency attribute in the index linked list. Its search results can be sorted according to the degree of word matching, and the record information that is more suitable for the queryer's wishes can be obtained, which improves the search quality of the queryer. , which enables the queryer to obtain a more friendly retrieval experience and enhances the user's dependence on the system.

说明书附图Instructions attached

附图1为所述以Lucene为基础的数据库查询方法的流程图；Accompanying drawing 1 is the flow chart of described database query method based on Lucene;

附图2为所述Lucene索引的层次结构示意图。Accompanying drawing 2 is the hierarchical structure diagram of described Lucene index.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，对本发明所述一种以Lucene为基础的数据库查询方法进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, a Lucene-based database query method of the present invention will be further described in detail below in conjunction with specific embodiments.

本发明参考全文查询工具Lucene的层次模式，提出了一种以Lucene为基础的数据库查询方法，构造了以Lucene为基础的数据库查询扩展方法，经过试验分析，这个查询扩展方法可以明显地改进数据库查询的速率和召回率，能够让查询者取得更为友好的检索体验。The present invention refers to the hierarchical mode of the full-text query tool Lucene, proposes a database query method based on Lucene, and constructs a database query expansion method based on Lucene. Through experimental analysis, this query expansion method can obviously improve database query. The speed and recall rate can make the searcher get a more friendly search experience.

实施例：Example:

本实施例所述以Lucene为基础的数据库查询方法，参考全文查询工具Lucene的层次模式，首先为数据库内资源创建索引，并将索引加进索引资源库内；利用检索条件自索引资源库内查询，获得检索结果并返回；其主要步骤包括创建索引部分、搜寻索引部分。The database query method based on Lucene described in this embodiment refers to the hierarchical mode of the full-text query tool Lucene, at first creating an index for the resources in the database, and adding the index to the index repository; using the retrieval conditions to query from the index repository , get the search result and return it; its main steps include creating the index part and searching the index part.

Lucene关于分析处理文档的接口与文字、文档形式没有关联，当需要创建索引时，只须让索引工具处理数据流就能够实现。所述创建索引部分主要内容包括，固定地自数据库内采集资源，并为这些资源进行适当的分析与处理，接着对那些资源创建索引并将它们加进索引资源库内。Lucene's interface for analyzing and processing documents is not related to text and document forms. When an index needs to be created, it can be realized only by letting the indexing tool process the data stream. The main content of the index creation part includes collecting resources from the database in a fixed manner, performing appropriate analysis and processing for these resources, and then creating indexes for those resources and adding them to the index resource library.

Lucene本身具有一套成型的搜索工具，查询者能够根据这个搜索工具去完成自身的检索需求。查询者能够实现按需求建立特定的搜寻规定，例如模糊检索、范围检索等。所述搜寻索引部分主要内容包括，利用查询者所提供的检索条件去取得查询语句，接着分析处理这些查询语句，之后自索引资源库内查询，将最终的检索结果返回给查询者。Lucene itself has a set of established search tools, and the queryer can use this search tool to complete their own retrieval needs. The queryer can establish specific search rules according to requirements, such as fuzzy search, range search, etc. The main content of the search index part includes: using the retrieval conditions provided by the queryer to obtain query statements, then analyzing and processing these query statements, and then searching from the index resource library, and returning the final retrieval results to the queryer.

附图1为所述以Lucene为基础的数据库查询方法的流程图，如附图1所示，采用该数据库查询方法的具体过程如下：首先固定地自数据库内采集资源信息，并为这些资源进行适当的分析与处理，接着对那些资源创建索引模块，并将它们加进索引资源库（索引库）内；然后获取用户所提供的查询条件，接着通过分析查询条件模块分析处理这些查询条件，之后通过查询索引模块自索引资源库（索引库）内查询，呈现查询结果，并将最终的检索结果返回给用户。Accompanying drawing 1 is the flow chart of described database query method based on Lucene, as shown in accompanying drawing 1, adopts the concrete process of this database query method as follows: first fixedly collect resource information from the database, and carry out for these resources Appropriate analysis and processing, then create index modules for those resources, and add them to the index resource library (index library); then obtain the query conditions provided by the user, and then analyze and process these query conditions through the analysis query condition module, and then Through the query index module, query from the index resource library (index library), present the query results, and return the final retrieval results to the user.

附图2为Lucene索引的层次结构示意图，如附图2所示，将Lucene的索引划分为以下几个层次，首先是Index（索引）层，接着为Segment（段）层，其次是Document（文档）层，最底层为Field（字段）层。其中，Index由一些Segment构成，一个段包含许多的文档，而各个文档则含括许多的字段，最后，字段的构成部分是一个个的（Term）词元。Attached Figure 2 is a schematic diagram of the hierarchical structure of the Lucene index. As shown in Figure 2, the Lucene index is divided into the following levels, first the Index (index) layer, then the Segment (segment) layer, and then the Document (document ) layer, and the bottom layer is the Field (field) layer. Among them, the Index is composed of some Segments, a segment contains many documents, and each document contains many fields, and finally, the components of the fields are Term tokens.

由于Lucene索引是按照一定的结构组织的，因此去进行搜索时可以立刻在索引资源内找到，而无须去之前的资源内执行顺序的搜寻工作，能够把检索的区域缩小很多，极大提高了检索效率。Lucene的数据来源并非是一种确切的格式，仅仅为一种文件的层次，查询者去创建索引的数据源能够为各种格式，可以是xml文档、字符串、txt文档，或者是数据库内的数据资源。Since the Lucene index is organized according to a certain structure, it can be found in the index resource immediately when searching, without having to perform sequential search work in the previous resource, which can greatly reduce the search area and greatly improve the search efficiency. The data source of Lucene is not an exact format, but only a file level. The data source for the queryer to create an index can be in various formats, which can be xml documents, strings, txt documents, or in the database. data resources.

下面为本实施例所述以Lucene为基础的数据库查询方法的具体实施过程，以此来进一步详细了解该数据库查询方法的技术内容和技术效果。The following is the specific implementation process of the Lucene-based database query method described in this embodiment, so as to further understand the technical content and technical effect of the database query method in detail.

首先创建数据的索引，具体包括如下步骤：First create the index of the data, including the following steps:

2）接着进行信息资源过滤，在某条记录内，如何选择待存储的字段使得在维持信息完整度的前提下避免无用的资源内容，举个例子，对于高校学生来说，一般具有存放价值的是“学号”、“所学专业”、“学习课程”、“考试成绩”等字段，而对于课堂上偶尔进行的随堂测验、模拟考试等信息基本上无须存放；2) Next, filter the information resources. In a certain record, how to select the fields to be stored so as to avoid useless resource content while maintaining the integrity of the information. For example, for college students, generally have storage value It is the fields such as "student number", "major studied", "study courses", "examination results", and there is basically no need to store information such as quizzes and mock exams that are occasionally conducted in the classroom;

3）其次，分析过滤出来的信息内容，进行分词处理，目前最常用的分词工具是以搜狗词库为基础的Mmseg4j分词器；3) Secondly, analyze the filtered information content and perform word segmentation processing. At present, the most commonly used word segmentation tool is the Mmseg4j word segmenter based on the Sogou thesaurus;

4）然后是创建索引，应该将记录内容加载到资源库内，对前面分好的词语创造索引，索引能够存放在硬盘或者内存里面；最终把索引文件放到索引资源库里面。4) Then create an index, which should load the recorded content into the resource library, and create an index for the previously divided words. The index can be stored in the hard disk or memory; finally, put the index file into the index resource library.

在创建索引后，以此为基础进行搜索索引、基础查询，具体过程如下：先取得查询者提供的检索条件，其次分析处理所得到的条件语句的句法语法结构，抽取相应的关键词，依照一定规则来构成句法树，然后通过搜寻索引去找到符合句法的数据库记录。After the index is created, search the index and basic query based on this. The specific process is as follows: first obtain the retrieval conditions provided by the queryer, and then analyze and process the syntax and grammatical structure of the obtained conditional statements, extract corresponding keywords, Rules to form a syntax tree, and then search the index to find database records that match the syntax.

举例句法树“key1andkey2notkey3”来说明一下检索步骤：Take the syntax tree "key1andkey2notkey3" as an example to illustrate the retrieval steps:

(1)首先，去数据库倒排链表之内寻得各自含有key1、key2、key3的记录；接着，将那些含有key1和key2的数据库记录链表合并在一起，可以获取到同时含有key1和key2的链表；(1) First, go to the inverted list of the database to find the records containing key1, key2, and key3; then, merge the linked lists of database records containing key1 and key2 together to obtain a linked list containing both key1 and key2 ;

(2)然后把得到的链表和含有key3的记录执行差运算，去掉那些含有key3的链表，即可以获取到仅仅含有key1、key2却没含有key3的记录，该记录即为最终符合条件的链表。(2) Then perform a difference operation on the obtained linked list and the record containing key3, and remove those linked lists containing key3, that is, the record containing only key1 and key2 but not containing key3 can be obtained, and the record is the final qualified linked list.

本实施例所述以Lucene为基础的数据库查询方法，在以往的查询工具的倒排结构的基础上，Lucene加入自身分块创建索引文件的功能，能够为新的资源创建小的索引，以便增加查询效率，经过与已经存在的索引融合，能够改进优化各个索引资源。当对某个检索词多次查询时，本发明把初次查询的结果加载到计算机的缓存里面，所以当下一次查询到同一个检索词语时，直接去计算机缓存内找到相应的信息，而无须对索引资源库重复检索，因此本发明搜索速度可以提高数倍。当查询者提供条件含有若干个词语，本发明能够切分出这几个词语，然后经由Term来比对索引库，且能比对改变顺序的数据库信息，能查询出有关系的记录内容并呈现给用户，本数据库查询扩展方法具有很高的召回率。The database query method based on Lucene described in this embodiment, on the basis of the inverted structure of previous query tools, Lucene adds the function of creating index files by itself, which can create small indexes for new resources, so as to increase Query efficiency, through integration with existing indexes, can improve and optimize each index resource. When a search term is queried multiple times, the present invention loads the result of the initial query into the cache of the computer, so when the same search term is found in the next query, the corresponding information is directly found in the computer cache without the need to search for the index The resource library is repeatedly searched, so the search speed of the present invention can be increased several times. When the condition provided by the queryer contains several words, the present invention can segment these words, and then compare the index library through Term, and can compare the database information in changed order, and can query the relevant record content and present it To the user, the database query expansion method has a high recall rate.

上述具体实施方式仅是本发明的具体个案，本发明的专利保护范围包括但不限于上述具体实施方式，任何符合本发明的权利要求书的且任何所属技术领域的普通技术人员对其所做的适当变化或替换，皆应落入本发明的专利保护范围。The above-mentioned specific embodiments are only specific cases of the present invention, and the scope of patent protection of the present invention includes but is not limited to the above-mentioned specific embodiments, any claims that meet the claims of the present invention and any ordinary skilled person in the technical field. Appropriate changes or substitutions should fall within the scope of patent protection of the present invention.

Claims

Translated fromChinese

1.一种以Lucene为基础的数据库查询方法，其特征在于, 参考Lucene索引的层次模式，首先为数据库内资源创建索引，并将索引加进索引资源库内；利用检索条件自索引资源库内查询，获得检索结果并返回；其主要步骤包括创建索引部分、搜寻索引部分。1. A database query method based on Lucene, characterized in that, with reference to the hierarchical pattern of the Lucene index, at first create an index for the resources in the database, and add the index to the index resource library; utilize the retrieval condition from the index resource library Query, get the search results and return; its main steps include creating the index part and searching the index part.

2.根据权利要求1所述一种以Lucene为基础的数据库查询方法，其特征在于, 所述创建索引部分是指，固定地自数据库内采集资源，并为这些资源进行适当的分析与处理，接着对那些资源创建索引并将它们加进索引资源库内。2. a kind of database query method based on Lucene according to claim 1, is characterized in that, described index creation part refers to, fixedly collects resources from the database, and carries out suitable analysis and processing for these resources, Those resources are then indexed and added to the indexed repository.

3.根据权利要求2所述一种以Lucene为基础的数据库查询方法，其特征在于, 所述创建索引部分具体包括如下步骤：3. a kind of database query method based on Lucene according to claim 2, is characterized in that, described creation index part specifically comprises the steps:

4.根据权利要求2所述一种以Lucene为基础的数据库查询方法，其特征在于, 所述搜寻索引部分主要内容包括，利用查询者所提供的检索条件去取得查询语句，接着分析处理这些查询语句，之后自索引资源库内查询，将最终的检索结果返回给查询者。4. A kind of database query method based on Lucene according to claim 2, characterized in that, the main content of the search index part includes, utilizing the search conditions provided by the queryer to obtain query sentences, and then analyzing and processing these queries statement, and then query from the index repository, and return the final retrieval result to the queryer.

5.根据权利要求4所述一种以Lucene为基础的数据库查询方法，其特征在于, 先取得查询者提供的检索条件，其次分析处理所得到的条件语句的句法语法结构，抽取相应的关键词，依照一定规则来构成句法树，然后通过搜寻索引去找到符合句法的数据库记录。5. a kind of database query method based on Lucene according to claim 4, is characterized in that, first obtains the retrieval condition that the inquirer provides, secondly analyzes and processes the syntactic grammatical structure of the obtained conditional sentence, extracts corresponding keywords , according to certain rules to form a syntax tree, and then search the index to find the database records that match the syntax.