





技术领域technical field
本发明涉及一种面向用户的信息搜索引擎系统及方法,属于信息搜索技术领域。The invention relates to a user-oriented information search engine system and method, belonging to the technical field of information search.
背景技术Background technique
当前,搜索引擎已成为信息查询的主要工具。随着信息的海量爆炸式增长,智能、高效的搜索方法可增大大加查询速度,提高查全率与查准率,使用户在尽可能短的时间内获取尽可能多的关注信息,为用户带来极大的便利。Currently, search engines have become the main tool for information query. With the explosive growth of information, intelligent and efficient search methods can greatly increase the query speed, improve the recall rate and precision rate, so that users can obtain as much attention information as possible in the shortest possible time, and serve users Bring great convenience.
根据对象控制着手点的不同,目前对搜索引擎设计的改进方法主要分为两类:面向扩展搜索词语义的方法和面向推断用户兴趣的方法。According to the different starting points of object control, the improvement methods of search engine design are mainly divided into two categories: the methods oriented to expand the semantics of search terms and the methods oriented to infer user interests.
面向扩展搜索词语义的方法通过本体技术解析搜索词的语义网,以达到扩展搜索词、扩大查询范围的效果。但使用这种方法存在两方面的缺点,一是仅仅对搜索词进行语义分析,未考虑搜索结果全文中可能存在着聚合语义的关键信息;二是往往关注于搜索词本身的语义而忽略了用户的意图,难以使搜索结果符合用户的要求。The method for extending the semantics of search words uses ontology technology to analyze the semantic web of search words, so as to achieve the effect of expanding search words and enlarging the scope of queries. However, there are two disadvantages in using this method. One is that it only conducts semantic analysis on the search terms, without considering the key information of aggregation semantics that may exist in the full text of the search results; Intent, it is difficult to make the search results meet the user's requirements.
向推断用户兴趣的方法通过记录并用户对历史搜索结果的操作,分析用户感兴趣的信息,从而推断用户的关注领域。使用这种方法的缺点在于仅仅考虑了用户的兴趣点,未从语义本身层面进行扩展,由于用户对自己真正意图的把握往往存在局限性和不准确性,此种方法往往也使用户难以获得真正符合意图的搜索结果。The method of inferring user interests records and analyzes the user's interested information through recording and user operations on historical search results, so as to infer the user's focus area. The disadvantage of using this method is that it only considers the user's points of interest, and does not expand from the level of semantics itself. Because users often have limitations and inaccuracies in grasping their true intentions, this method often makes it difficult for users to obtain real information. Search results that match intent.
此外,现有的搜索引擎系统均需要用户手动输入关键词,即便提供搜索提示,也仅仅按顺序罗列了用户以往的历史搜索结果,未能通过相应的解析,按使用频率进行推送,并运行用户对单个词进行选择及排序,一定程度上增加了用户交互的繁琐性。In addition, the existing search engine systems require users to manually input keywords. Even if search prompts are provided, they only list the user's past historical search results in order, fail to pass the corresponding analysis, push according to the frequency of use, and run the user Selecting and sorting individual words increases the complexity of user interaction to a certain extent.
发明内容Contents of the invention
本发明的技术解决问题:弥补现有技术的不足,提供一种使查询结果范围更全、精度更高的搜索引擎系统及方法。该方法在推断用户兴趣的基础上实现了搜索词的重构,并且在搜索词的重构中综合考虑了参照权威叙词表进行语义的扩展,扩大了搜索范围,此外,用户通过该系统可实现搜索词的选择输入、自主排序,并可通过交互操作提高后续查询结果的精准性,为用户执行信息搜索提供了一种灵活、便利、智能的交互接口。The technical problem of the present invention is to make up for the deficiencies of the prior art, and provide a search engine system and method with a more complete range of query results and a higher precision. This method realizes the reconstruction of search terms on the basis of inferring user interests, and comprehensively considers the expansion of semantics by referring to the authoritative thesaurus in the reconstruction of search terms, which expands the search scope. In addition, users can use this system It realizes the selection and input of search words, autonomous sorting, and improves the accuracy of subsequent query results through interactive operations, providing a flexible, convenient, and intelligent interactive interface for users to perform information searches.
本发明的技术解决方案:面向用户的信息搜索引擎系统,如图1所示,由客户端和服务器构成,在服务器负责客户端所传递数据的后端解析及处理工作,在服务器端部署搜索词推送模块、用户关注点更新模块、初次搜索模块、用户兴趣推断模块、搜索词重构查块及二次搜索模块;客户端主机通过B/S方式与服务器进行交互,在客户口端部署用户发起搜索模块、初次搜索模块;其中上述各模块实现如下:Technical solution of the present invention: the user-oriented information search engine system, as shown in Figure 1, is composed of a client and a server, the server is responsible for the back-end analysis and processing of the data transmitted by the client, and the search term is deployed at the server Push module, user focus update module, initial search module, user interest inference module, search word reconstruction block search and secondary search module; the client host interacts with the server through B/S mode, and deploys user-initiated Search module and initial search module; the above-mentioned modules are implemented as follows:
搜索词推送模块:服务器根据当前用户的身份信息,查询用户关注库,所述的用户关注库由本人历史关注点以及同兴趣用户历史关注点两部分组成,所述的本人历史关注点以及同兴趣用户历史关注点均由历史搜索词和搜索词的使用频率组成,首先解析用户本人历史搜索词,按照搜索词使用频率由高到低进行排序,选择使用频率超过一定阈值的历史搜索词,按序写入用户本人历史关注词集合,即searchVoc_past集,之后遍历searchVoc_past集,获取各历史搜索词除当前用户之外的其他历史用户,写入同兴趣用户集合,即user_sameInt集,依次获取user_sameInt集中各用户的历史搜索词,分别查询各历史搜索词的使用频率,按照使用频率由高到低写入同兴趣用户历史关注词集合,即searchVoc_past_other集,对searchVoc_past_other集遍历,在避免重复的前提下,将其中的词顺序加入searchVoc_past集,根据searchVoc_past集形成搜索词推送列表,输出至客户端,供用户发起搜索模块调用;Search word push module: the server queries the user focus library according to the identity information of the current user. The user focus library is composed of two parts: the historical focus of the user and the historical focus of users with the same interest. The historical focus of the person and the same interest The user's historical concerns are composed of historical search words and the frequency of use of the search words. First, analyze the user's own historical search words, sort them according to the frequency of use of the search words, and select the historical search words whose frequency of use exceeds a certain threshold. Write the user's own historical attention word collection, that is, the searchVoc_past collection, and then traverse the searchVoc_past collection to obtain other historical users of each historical search term except the current user, write the same interest user collection, that is, the user_sameInt collection, and obtain each user in the user_sameInt collection in turn historical search words, query the frequency of use of each historical search word respectively, and write the set of historical attention words of users with the same interest according to the frequency of use from high to low, that is, the searchVoc_past_other set, traverse the searchVoc_past_other set, and avoid duplication, among them The order of words is added to the searchVoc_past set, and the search word push list is formed according to the searchVoc_past set, which is output to the client for the user to initiate a search module call;
用户发起搜索模块:接收搜索词推送模块输出的搜索词推送列表,解析其中的搜索词,按顺序显示在客户端,并提供复选按钮及排序按钮,允许用户对各搜索词进行选择或取消,以及设置搜索词的优先级,根据用户的选择结果动态更改搜索词集合,同时支持用户对搜索词集合进行人工补充或修改,以形成最终提交的搜索申请,供用户关注点更新模块以及初次搜索模块调用;User-initiated search module: receives the search word push list output by the search word push module, parses the search words in it, displays them on the client in order, and provides check buttons and sort buttons to allow users to select or cancel each search word, And set the priority of search terms, dynamically change the search term set according to the user's selection results, and support users to manually supplement or modify the search term set to form the final search application for users to focus on the update module and the initial search module transfer;
用户关注点更新模块:接收搜索申请,对用户发起的搜索行为进行记录,所述的搜索行为由用户输入的搜索词及搜索词的顺序组成,将用户输入的搜索词按序写入搜索词用户选择集合,即searchVoc_select集,遍历searchVoc_select集,判断其中的搜索词是否存在于用户关注库中,如果已存在,则更新该词当前使用频率,否则,则将该词写入用户关注库中的本人历史关注点集合,同时设置当前使用频率为初始值;User focus update module: receive search application, record the search behavior initiated by the user, the search behavior is composed of the search words input by the user and the order of the search words, and write the search words input by the user in sequence to the search word user Select the set, namely the searchVoc_select set, traverse the searchVoc_select set, and judge whether the search term exists in the user-focused database. If it exists, update the current frequency of use of the word, otherwise, write the word into the user-focused database. A collection of historical concerns, while setting the current frequency of use as the initial value;
初次搜索模块:根据用户发起的搜索行为执行初次搜索,首先按照搜索词的优先级对searchVoc_select集中的全部搜索词进行全排列组合,将排列组合后的searchVoc_select集记作searchVoc_select重组集,其中包括独立词及组合词,遍历searchVoc_select重组集,依次查询与其中各个词相匹配的搜索结果,与独立词匹配即表示搜索结果中包含该独立词,与组合词匹配即表示搜索结果包含每一个组成要素,对于每一个搜索词的匹配结果,统计全文中与搜索词的匹配频率,按匹配频率由高到低排序,按searchVoc_select重组集的词序将所有匹配的搜索结果列表组合,写入初始搜索结果集合,即result_first集,所述的搜索结果列表由结果信息标题、摘要、来源组成,其中,摘要为结果全文中与搜索词匹配最多的一段文字,将形成的result_first集输出至客户端,供用户查看;Initial search module: perform the initial search according to the search behavior initiated by the user. Firstly, all the search words in the searchVoc_select set are fully arranged and combined according to the priority of the search words, and the arranged and combined searchVoc_select set is recorded as the searchVoc_select reorganization set, including independent words And compound words, traverse the searchVoc_select recombination set, and query the search results that match each word in turn. Matching an independent word means that the independent word is included in the search result, and matching a compound word means that the search result contains each component element. For For the matching results of each search word, count the matching frequency with the search word in the full text, sort by the matching frequency from high to low, combine all matching search result lists according to the word order of the searchVoc_select reorganization set, and write it into the initial search result set, that is result_first set, the search result list is composed of result information title, abstract, and source, wherein, the abstract is a paragraph of text in the full text of the result that most matches the search term, and the formed result_first set is output to the client for users to view;
用户兴趣推断模块:记录用户对result_first集的操作,将用户筛选行为写入初次搜索结果用户筛选集,即result_userSelect集。所述的用户筛选行为由用户选择结果ID、结果点击次数以及结果查看时间组成。对于各条结果,对“结果点击次数x结果查看时间”进行求和计算,得到用户对各条结果的关注程度,按照关注程度值从高到低进行排序,分别解析出各结果的摘要信息,将摘要信息按顺序写入用户筛选结果摘要集,即result_abstract集,输出至用户关注结果分词模块;User interest inference module: record the user's operation on the result_first set, and write the user filtering behavior into the user filtering set of the initial search result, namely the result_userSelect set. The user screening behavior is composed of user selection result ID, result click times and result viewing time. For each result, calculate the sum of "result click times x result viewing time" to get the degree of user attention to each result, sort according to the degree of attention from high to low, and analyze the summary information of each result respectively. Write the abstract information in order to the summary set of user screening results, that is, the result_abstract set, and output it to the word segmentation module for user-focused results;
用户关注结果分词模块:遍历result_abstract集,依次解析出用户关注结果的摘要信息,对照字典集,采用逆向匹配算法分词,所述的字典集为哈希表,即HashMap组成的数组,数组长度为字典中可作为首字的汉字个数,数组索引为该汉字的区位码,数组各元素为该首字对应的所有词组成的HashMap,其中词本身作为HashMap的key,词频作为HashMap的value,分词完毕后,对照无义词库,将无义词剔除,将各篇摘要的分词结果作为独立数组,写入摘要分词结果离散集,即abstract_cut_apart集,同时提取出分词结果的并集,即不存在重复词的最大集合,写入摘要分词结果组合集,即abstract_cut_unit集,将abstract_cut_apart集和abstract_cut_unit集两个集合均输出至搜索词重构模块;User attention result word segmentation module: traverse the result_abstract set, parse out the abstract information of the user attention result in turn, compare the dictionary set, and use the reverse matching algorithm to segment the word. The dictionary set is a hash table, that is, an array composed of HashMap, and the length of the array is a dictionary The number of Chinese characters that can be used as the first character, the array index is the location code of the Chinese character, and each element of the array is a HashMap composed of all words corresponding to the first character. The word itself is used as the key of the HashMap, and the word frequency is used as the value of the HashMap. The word segmentation is completed Finally, compare the nonsense thesaurus, remove the nonsense words, use the word segmentation results of each abstract as an independent array, write the discrete set of abstract word segmentation results, that is, the abstract_cut_apart set, and extract the union of the word segmentation results at the same time, that is, there is no duplication The largest set of words is written into the abstract word segmentation result combination set, that is, the abstract_cut_unit set, and both the abstract_cut_apart set and the abstract_cut_unit set are output to the search word reconstruction module;
搜索词重构模块:遍历abstract_cut_unitt集中的词语,比对abstract_cut_apart集,解析各词语在不同摘要中出现的次数,所述的各词语在不同摘要中出现的次数不包括该词语在同一摘要中出现的次数,将出现次数与摘要篇数相同的词语,即各篇摘要中均出现的词语汇集并写入摘要分词结果交集,即abstract_cut__same集,对照中文分类主题词表,分析abstract_cut_same集,对于与其中词语具有用代关系及相关关系的词,写入摘要分词结果重组集,即abstract_cut_reorg集,将abstract_cut_same集及abstract_cut_reorg集两个集合均输出至二次搜索模块;Search word reconstruction module: traverse the words in the abstract_cut_unitt set, compare the abstract_cut_apart set, and analyze the number of times each word appears in different abstracts. The number of times each word appears in different abstracts does not include the word in the same abstract The number of times, the words that appear the same as the number of abstracts, that is, the words that appear in each abstract are collected and written into the intersection of the abstract word segmentation results, that is, the abstract_cut_same set, and the abstract_cut_same set is analyzed against the Chinese classification subject list, and for the words in it Words with generational relationship and related relationship are written into the summary word segmentation result reorganization set, i.e. abstract_cut_reorg set, and both abstract_cut_same set and abstract_cut_reorg set are output to the secondary search module;
二次搜索模块:首先解析abstract_cut_same集,按照初次搜索模块中的方法对集合中的词进行排列组合,遍历abstract_cut_same集中的各搜索词,依次获取全文中与之匹配的文档、标题中与之匹配的图片以及视频,其中,对组合词而言,与之匹配表示满足其中每一个组成要素,之后,解析abstract_cut_reorg集,获取与其中每个独立词匹配的文档、图片以及视频,将所有的文档文件按搜索顺序写入二次搜索文档结果集,即result_second_doc集,将所有的图片文件按搜索顺序写入二次搜索图片结果集,即result_second_image集,将所有的视频文件按搜索顺序写入二次搜索视频结果集,即result_second_vedio集,返回result_second_doc集、result_second_image集及result_second_vedio集三个集合至客户端,按类别将搜索结果展示给用户,提示用户本次搜索结果可能更符合其意图,供用户深入查看。Secondary search module: first parse the abstract_cut_same set, arrange and combine the words in the set according to the method in the initial search module, traverse each search word in the abstract_cut_same set, and obtain the matching documents in the full text and the matching words in the title in turn Pictures and videos, among them, for compound words, matching with them means satisfying each of the constituent elements. After that, parse the abstract_cut_reorg set to obtain documents, pictures and videos that match each independent word, and press all the document files Write the search sequence into the secondary search document result set, namely the result_second_doc set, write all the image files in the search sequence into the secondary search image result set, namely the result_second_image set, and write all the video files in the search sequence into the secondary search video The result set, that is, the result_second_vedio set, returns three sets of result_second_doc set, result_second_image set and result_second_vedio set to the client, and displays the search results to the user by category, prompting the user that the search results this time may be more in line with their intentions for users to view in depth.
所述的搜索词推送模块实现过程如下:The implementation process of the search term push module is as follows:
(1)捕获用户信息,根据用户登录时存储身份信息的session,获得当前登录者的用户名、用户编号,即ID;(1) Capture user information, and obtain the user name and user number of the current log-in person, that is, ID, according to the session that stores identity information when the user logs in;
(2)根据用户ID查询用户关注库,提取出与该ID匹配的历史搜索词以及搜索词使用频率,搜索词记为V,使用频率记为F,将结果按F值的降序排列;(2) Query the user's concern database according to the user ID, extract the historical search term and the frequency of use of the search term matched with the ID, the search term is recorded as V, and the frequency of use is recorded as F, and the results are arranged in descending order of the F value;
(3)设预设的词频阈值为E,比使用频率F与设定阈值E的大小;(3) set the preset word frequency threshold to be E, than use the frequency F and set the size of the threshold E;
c.如果F>=E,则将F对应的V写入用户本人历史关注词集,记作searchVoc_past集;c. If F>=E, then write the V corresponding to F into the user's own historical concern word set, and record it as the searchVoc_past set;
d.如果F<E,则不做处理;d. If F<E, do not process;
(4)解析searchVoc_past集,依次遍历其中的搜索词V,查询用户关注库,获得与V匹配的除当前用户之外的其他用户ID,写入同兴趣用户集合,即user_sameInt集;(4) Parsing the searchVoc_past set, traversing the search term V therein in turn, querying the user focus database, obtaining other user IDs that match V except the current user, and writing the set of users with the same interest, namely the user_sameInt set;
(5)根据user_sameInt集中各用户ID,查询用户关注库,分别获取与各用户ID匹配的历史搜索词记录,遍历记录中的历史搜索词,分别统计各搜索词用户关注库中的使用频率,按频率从高到底写入同兴趣用户历史关注词集合,即searchVoc_past_other集;(5) Collect each user ID according to user_sameInt, query the user concern library, obtain the historical search word records matching each user ID respectively, traverse the historical search words in the record, and count the usage frequency of each search word in the user concern library respectively, press The frequency is written into the set of historical attention words of users with the same interest from high to low, that is, the searchVoc_past_other set;
(6)遍历searchVoc_past_other集,依次判断该词是否已存在于searchVoc_past集中;(6) traverse the searchVoc_past_other set, and judge whether the word exists in the searchVoc_past set in turn;
c.如果已经存在,则对本词不作处理,继续解析下一个词;c. If it already exists, do not process this word and continue to analyze the next word;
d.如果不存在,则将该词加入到searchVoc_past集中;d. If it does not exist, add the word to the searchVoc_past set;
(7)将searchVoc_past集作为数组存储在缓存中,作为搜索词推送列表输出至客户端,供用户发起搜索模块调用。(7) Store the searchVoc_past set in the cache as an array, and output it as a search word push list to the client for the user to initiate a search module call.
所述的搜索词重构模块实现过程如下:The implementation process of the search term reconstruction module is as follows:
(1)遍历用户筛选结果摘要集,即result_abstract集,依次解析出用户关注结果的摘要信息,对照字典集,采用逆向匹配算法分词,各篇摘要的分词结果作为独立数组,写入摘要分词结果离散集,即abstract_cut_apart集,数组个数记为N;(1) Traverse the summary set of user screening results, that is, the result_abstract set, and analyze the summary information of the results concerned by the user in turn, compare the dictionary set, and use the reverse matching algorithm to segment the words. set, that is, the abstract_cut_apart set, and the number of arrays is recorded as N;
(2)提取出分词结果的并集,即不存在重复词的最大集合,写入摘要分词结果组合集,即absrtact_cut_unit集;(2) Extract the union of word segmentation results, that is, there is no maximum set of repeated words, and write the abstract word segmentation result combination set, that is, the absrtact_cut_unit set;
(3)对abstract_cut_unit集进行遍历,对其中每个搜索词,执行以下操作;(3) traverse the abstract_cut_unit set, and perform the following operations for each search term;
(3.1)初始化当前搜索词的出现频率Fabs=0;(3.1) Initialize the frequency of occurrence Fabs =0 of the current search term;
(3.2)遍历abstract_cut_apan集中的各个数组元素,判断该数组元素中是否包含当前搜索词;(3.2) traverse each array element in the abstract_cut_apan set, and determine whether the current search term is included in the array element;
c.如果包含,则Fabs=Fabs+1,继续判断下一条数组元素;c. If it is included, then Fabs =Fabs +1, continue to judge the next array element;
d.如果不包含,Fabs值不变。d. If not included, the Fabs value remains unchanged.
(3.3)将当前搜索词对应的Fabs值与abstract_cut_apart集中的数组个数进行比较;(3.3) compare the Fabs value corresponding to the current search term with the number of arrays in abstract_cut_apart;
c.如果Fabs=N,将当前搜索词写入摘要分词结果交集,即c. If Fabs=N, write the current search term into the intersection of the summary word segmentation results, ie
abstract_cut_same;abstract_cut_same;
d.如果Fabs<N,不做处理,继续判断下一搜索词。d. If Fabs<N, do not process and continue to judge the next search word.
(4)遍历abstract_cut_same集,对其中每个搜索词,在中文分类主题词表中检索以该词为款目叙词的语义网;(4) Traverse the abstract_cut_same set, for each search word wherein, search the Semantic Web with this word as the entry descriptor in the Chinese classification thesaurus;
(4.1)如果语义网中有标识为“Y”的关系词,表示该词具有正式表达词,将正式表达词写入集合abstract_cut_reorg;(4.1) If there is a relational word marked as "Y" in the Semantic Web, it means that the word has a formal expression, and the formal expression is written into the set abstract_cut_reorg;
(4.2)如果语义网中有标识为“D”的关系词,表示该词具有非正式表达词,将非正式表达词写入集合abstract_cut_reorg;(4.2) If there is a relational word marked as "D" in the Semantic Web, it means that the word has an informal expression, and the informal expression is written into the collection abstract_cut_reorg;
(4.3)如果语义网中有标识为“C”的关系词,表示该词具有词义与之相关的表达词,将相关表达词写入集合abstract_cut_reorg;(4.3) If there is a relational word marked as "C" in the Semantic Web, it means that the word has an expression word related to it, and the related expression word is written into the set abstract_cut_reorg;
(5)将abstract_cut_same集及abstract_cut_reorg集均作为数组输出至二次搜索模块。(5) Both the abstract_cut_same set and the abstract_cut_reorg set are output to the secondary search module as an array.
面向用户的信息搜索引擎系实现方法的实现步骤如下:The implementation steps of the user-oriented information search engine system implementation method are as follows:
(1)服务器根据当前用户的身份信息,查询用户关注库,所述的用户关注库由本人历史关注点以及同兴趣用户历史关注点两部分组成,所述的本人历史关注点以及同兴趣用户历史关注点均由历史搜索词和搜索词的使用频率组成,首先解析用户本人历史搜索词,按照搜索词使用频率由高到低进行排序,选择使用频率超过一定阈值的历史搜索词,按序写入用户本人历史关注词集合,即searchVoc_past集,之后遍历searchVoc_past集,获取各历史搜索词除当前用户之外的其他历史用户,写入同兴趣用户集合,即user_sameInt集,依次获取user_sameInt集中各用户的历史搜索词,分别查询各历史搜索词的使用频率,按照使用频率由高到低写入同兴趣用户历史关注词集合,即searchVoc_past_other集,对searchVoc_past_other集遍历,在避免重复的前提下,将其中的词顺序加入searchVoc_past集,根据searchVoc_past集形成搜索词推送列表,输出至客户端,供用户发起搜索模块调用;(1) The server queries the user attention database according to the identity information of the current user. The user attention database is composed of two parts: the historical attention points of the person and the historical attention points of users with the same interest. The points of interest are composed of historical search words and the frequency of use of search words. First, analyze the user's own historical search words, sort them according to the frequency of use of search words from high to low, select the historical search words whose frequency of use exceeds a certain threshold, and write them in order The user's own historical attention word collection, that is, the searchVoc_past collection, and then traverse the searchVoc_past collection to obtain other historical users of each historical search word except the current user, and write it into the user collection of the same interest, that is, the user_sameInt collection, and then obtain the history of each user in the user_sameInt collection Search terms, respectively query the usage frequency of each historical search term, and write the historical attention word set of users with the same interest according to the frequency of use from high to low, that is, the searchVoc_past_other set, traverse the searchVoc_past_other set, and save the words in it under the premise of avoiding repetition Add the searchVoc_past set sequentially, form a search word push list according to the searchVoc_past set, and output it to the client for the user to initiate a search module call;
(2)接收搜索词推送列表,解析其中的搜索词,按顺序显示在客户端,并提供复选按钮及排序按钮,允许用户对各搜索词进行选择或取消,以及设置搜索词的优先级,根据用户的选择结果动态更改搜索词集合,同时支持用户对搜索词集合进行人工补充或修改,以形成最终提交的搜索申请;(2) Receive the push list of search words, analyze the search words in it, display them on the client in order, and provide check buttons and sorting buttons, allowing users to select or cancel each search word, and set the priority of search words, Dynamically change the search term set according to the user's selection results, and at the same time support the user to manually supplement or modify the search term set to form the final submitted search application;
(3)接收搜索申请,对用户发起的搜索行为进行记录,所述的搜索行为由用户输入的搜索词及搜索词的顺序组成,将用户输入的搜索词按序写入搜索词用户选择集合,即searchVoc_select集,遍历searchVoc_select集,判断其中的搜索词是否存在于用户关注库中,如果已存在,则更新该词当前使用频率,否则,则将该词写入用户关注库中的本人历史关注点集合,同时设置当前使用频率为初始值,为后续的搜索词推送提供数据基础;(3) Receive the search application, record the search behavior initiated by the user, the search behavior is composed of the search words entered by the user and the order of the search words, and write the search words entered by the user into the search word user selection set in sequence, That is, the searchVoc_select set, traverse the searchVoc_select set, and judge whether the search term exists in the user's attention database. If it exists, update the current frequency of use of the word, otherwise, write the word into the personal historical focus in the user attention database Set, and set the current frequency of use as the initial value at the same time, to provide a data basis for subsequent search word push;
(4)根据用户发起的搜索行为执行初次搜索,首先按照搜索词的优先级对searchVoc_select集中的全部搜索词进行全排列组合,排列组合后的searchVoc_select集中包括独立词及组合词,遍历排列组合后的searchVoc_select集,依次查询与其中各个词相匹配的搜索结果,与独立词匹配即表示搜索结果中包含该独立词,与组合词匹配即表示搜索结果包含每一个组成要素,对于每一个搜索词的匹配结果,统计全文中与搜索词的匹配频率,按匹配频率由高到低排序,按searchVoc_select集的词序将所有匹配的搜索结果列表组合,写入初始搜索结果集合,即result_first集,所述的搜索结果列表由结果信息标题、摘要、来源组成,其中,摘要为结果全文中与搜索词匹配最多的一段文字,将形成的result_first集输出至客户端,供用户查看;(4) Execute the initial search according to the search behavior initiated by the user. Firstly, all the search words in the searchVoc_select set are fully arranged and combined according to the priority of the search words. The searchVoc_select set after the arrangement and combination includes independent words and compound words. After traversing the arranged and combined words searchVoc_select set, query the search results that match each word in turn. Matching an independent word means that the independent word is included in the search result. Matching a compound word means that the search result contains each component element. For each search word match As a result, the matching frequency of the search term in the full text is counted, and the matching frequency is sorted from high to low, and all matching search result lists are combined according to the word order of the searchVoc_select set, and written into the initial search result set, that is, the result_first set, the search The result list is composed of the title of the result information, the abstract, and the source. The abstract is the text that most matches the search word in the full text of the result, and the resulting result_first set is output to the client for users to view;
(5)记录用户对result_first集的操作,将用户筛选行为写入初次搜索结果用户筛选集,即result_userSelect集。所述的用户筛选行为由用户选择结果ID、结果点击次数以及结果查看时间组成。对于各条结果,对“结果点击次数x结果查看时间”进行求和计算,得到用户对各条结果的关注程度,按照关注程度值从高到低进行排序,分别解析出各结果的摘要信息,将摘要信息按顺序写入用户筛选结果摘要集,即result_abstract集,供分词使用;(5) Record the user's operation on the result_first set, and write the user screening behavior into the user screening set of the initial search result, namely the result_userSelect set. The user screening behavior is composed of user selection result ID, result click times and result viewing time. For each result, calculate the sum of "result click times x result viewing time" to get the degree of user attention to each result, sort according to the degree of attention from high to low, and analyze the summary information of each result respectively. Write the summary information in order to the summary set of user screening results, that is, the result_abstract set, for word segmentation;
(6)遍历result_abstract集,依次解析出用户关注结果的摘要信息,对照字典集,采用逆向匹配算法分词,所述的字典集为哈希表,即HashMap组成的数组,数组长度为字典中可作为首字的汉字个数,数组索引为该汉字的区位码,数组各元素为该首字对应的所有词组成的HashMap,其中词本身作为HashMap的key,词频作为HashMap的value,分词完毕后,对照无义词库,将无义词剔除,将各篇摘要的分词结果作为独立数组,写入摘要分词结果离散集,即abstract_cut_apart集,同时提取出分词结果的并集,即不存在重复词的最大集合,写入摘要分词结果组合集,即abstract_cut_unit集;(6) Traverse the result_abstract set, parse out the summary information of the results concerned by the user in turn, compare the dictionary set, and use the reverse matching algorithm to segment the words. The dictionary set is a hash table, that is, an array composed of HashMap, and the length of the array can be used as The number of Chinese characters in the first character, the array index is the location code of the Chinese character, each element of the array is a HashMap composed of all words corresponding to the first character, in which the word itself is used as the key of the HashMap, and the word frequency is used as the value of the HashMap. After the word segmentation is completed, compare Nonsense thesaurus, remove nonsense words, write the word segmentation results of each abstract as an independent array, write the discrete set of abstract word segmentation results, that is, abstract_cut_apart set, and extract the union of word segmentation results at the same time, that is, the maximum number of words without repeated words Set, write the abstract word segmentation result combination set, that is, the abstract_cut_unit set;
(7)遍历abstract_cut_unitt集中的词语,比对abstract_cut_apart集,解析各词语在不同摘要中出现的次数,所述的各词语在不同摘要中出现的次数不包括该词语在同一摘要中出现的次数,将出现次数与摘要篇数相同的词语,即各篇摘要中均出现的词语汇集并写入摘要分词结果交集,即abstract_cut_same集,对照中文分类主题词表,分析abstract_cut_same集,对于与其中词语具有用代关系及相关关系的词,写入摘要分词结果重组集,即abstract_cut_reorg集,供二次搜索使用。(7) Traverse the words in the abstract_cut_unitt set, compare the abstract_cut_apart set, and analyze the number of times each word appears in different abstracts. The number of times each word appears in different abstracts does not include the number of times the word appears in the same abstract. Words with the same number of occurrences as the number of abstracts, that is, the words that appear in each abstract are collected and written into the intersection of the abstract word segmentation results, that is, the abstract_cut_same set. Compared with the Chinese classification subject headings, the abstract_cut_same set is analyzed. Words related to relationships and related relationships are written into the reorganized set of abstract word segmentation results, namely the abstract_cut_reorg set, for secondary search.
(8)首先解析abstract_cut_same集,按照初次搜索模块中的方法对集合中的词进行排列组合,遍历abstract_cut_same集中的各搜索词,依次获取全文中与之匹配的文档、标题中与之匹配的图片以及视频,其中,对组合词而言,与之匹配表示满足其中每一个组成要素,之后,解析abstract_cut_reorg集,获取与其中每个独立词匹配的文档、图片以及视频,将所有的文档文件按搜索顺序写入二次搜索文档结果集,即result_second doc集,将所有的图片文件按搜索顺序写入二次搜索图片结果集,即result_second_image集,将所有的视频文件按搜索顺序写入二次搜索视频结果集,即result_second_vedio集,返回result_second doc集、result_second image集及result_second_vedio集三个集合至客户端,按类别将搜索结果展示给用户,为用户提供更精准的搜索结果。(8) First parse the abstract_cut_same set, arrange and combine the words in the set according to the method in the initial search module, traverse each search word in the abstract_cut_same set, and obtain the matching documents in the full text, the matching pictures in the title, and Video, where, for compound words, matching with it means satisfying each of its constituent elements, and then parsing the abstract_cut_reorg set to obtain documents, pictures and videos that match each independent word, and put all document files in the search order Write the secondary search document result set, that is, the result_second doc set, write all image files in the search sequence to the secondary search image result set, that is, result_second_image set, and write all the video files in the search sequence to the secondary search video result Set, namely the result_second_vedio set, returns three sets of result_second doc set, result_second image set and result_second_vedio set to the client, and displays the search results to users by category, providing users with more accurate search results.
本发明与现有技术相比的优点在于:The advantage of the present invention compared with prior art is:
(1)本发明综合了关键词语义扩展以及用户兴趣推断,在捕捉用户交互操作的基础上,通过提取并解析全文中的关键信息进行语义的扩展,实现搜索词的重构,提高了搜索词的权威性、收敛性,使搜索结果更符合用户的真实意图。(1) The present invention integrates keyword semantic expansion and user interest inference. On the basis of capturing user interaction, it extracts and analyzes key information in the full text to expand semantics, realizes the reconstruction of search terms, and improves the search term efficiency. The authoritativeness and convergence of the search results make the search results more in line with the real intentions of users.
(2)本发明可通过捕捉用户信息实现对历史搜索词的自动推送,并支持用户对搜索词的选择输入、自主排序,简化了现有搜索引擎中用户输入搜索词的工作量,为用户执行信息搜索提供了一种灵活、便利的交互接口。(2) The present invention can realize the automatic push of historical search words by capturing user information, and supports users to select and input search words and sort them autonomously, which simplifies the workload of users inputting search words in existing search engines, and implements search words for users. Information search provides a flexible and convenient interactive interface.
(3)本发明可通过记录用户发起的搜索申请,不断补充并完善用户的关注点,增强后续查询结果的精准性,提高了搜索引擎系统的智能化程度。(3) The present invention can continuously supplement and improve the user's focus by recording the search application initiated by the user, enhance the accuracy of subsequent query results, and improve the intelligence of the search engine system.
(4)本发明在用户提交搜索请求时,首先根据初始搜索词返回一定数量的搜索结果,快速响应用户请求;在用户查看信息的同时,根据用户的操作反馈进行搜索词重构及二次搜索,并将更深入的搜索结果以推荐形式反馈给用户,在保证搜索效率的同时,提高了查全率与查准率。(4) When the user submits a search request, the present invention first returns a certain amount of search results according to the initial search term, and quickly responds to the user request; while the user views the information, the search term reconstruction and secondary search are performed according to the user's operation feedback , and feed back more in-depth search results to users in the form of recommendations, which improves the recall rate and precision rate while ensuring search efficiency.
附图说明Description of drawings
图1为本发明系统的体系结构图;Fig. 1 is the architecture diagram of the system of the present invention;
图2为本发明系统中的搜索词推送模块实现过程;Fig. 2 is the implementation process of the search word pushing module in the system of the present invention;
图3为本发明系统中的用户发起搜索模块实现过程;Fig. 3 is that the user in the system of the present invention initiates the search module realization process;
图4为本发明系统中的用户关注点更新及初次搜索模块实现过程;Fig. 4 is the update of the user's focus in the system of the present invention and the implementation process of the initial search module;
图5为本发明系统中的用户兴趣推断模块实现过程;Fig. 5 is the implementation process of the user interest inference module in the system of the present invention;
图6为本发明的用户关注结果分词、重构及二次搜索模块实现过程。Fig. 6 is the implementation process of the word segmentation, reconstruction and secondary search module of the user's attention result in the present invention.
具体实施方式Detailed ways
本发明面向用户的信息搜索引擎系统,其系统由服务器和客户端组成,数据库服务器采用Xeon2.8双核处理器,16G内存,2TB硬盘,负责存储所有的数据信息,同时配置磁带库和备份软件,作为历史数据备份和恢复使用;应用服务器采用LinuX操作系统,Oracle9i以上的数据管理软件,包括搜索词推送模块、用户关注点更新及初次搜索模块、用户兴趣推断模块、搜索词重构及二次搜索模块,负责客户端所传递数据的后端解析及处理工作;客户端主机采用3.0CPU,4G内存,500G硬盘,使用Windows XP操作系统,通过B/S方式与服务器进行交互,主要功能为前端展示,包括用户发起搜索模块,以及初次搜索结果及二次搜索结果的展示工作。The user-oriented information search engine system of the present invention, its system is made up of server and client, database server adopts Xeon2.8 dual-core processor, 16G memory, 2TB hard disk, is responsible for storing all data information, configures tape library and backup software simultaneously, Used as historical data backup and recovery; application server adopts LinuX operating system, data management software above Oracle9i, including search word push module, user focus update and initial search module, user interest inference module, search word reconstruction and secondary search The module is responsible for the back-end analysis and processing of the data transmitted by the client; the client host adopts 3.0CPU, 4G memory, 500G hard disk, uses Windows XP operating system, interacts with the server through B/S mode, and the main function is front-end display , including the user-initiated search module, as well as the display of the initial search results and the secondary search results.
为了更好地理解本发明,先对一些基本概念进行一下解释说明。In order to better understand the present invention, some basic concepts are firstly explained.
搜索词推送列表:由推荐搜索词组成的一个数组,数组中的元素由用户本人的历史搜索词和同兴趣用户的历史搜索词组成,每一个元素为一个搜索词,元素的顺序按搜索词的使用频率由高到低排列。Search term push list: an array composed of recommended search terms. The elements in the array are composed of the user's own historical search terms and the historical search terms of users with similar interests. Each element is a search term, and the order of the elements is according to the order of the search term. The frequency of use is arranged from high to low.
搜索词用户选择集合:用户根据自身搜索意愿,通过对系统推送的搜索词进行人为筛选形成的搜索词列表。筛选操作包括选择某个推送词作为搜索词、从当前搜索词列表中移除某个推送词、调整搜索词列表中的词序、补充新的搜索词等。搜索词用户选择集合记作searchVoc_select集。Search word user selection collection: A list of search words formed by users according to their own search wishes and artificially screening the search words pushed by the system. Filtering operations include selecting a pushed word as a search word, removing a pushed word from the current search word list, adjusting the order of words in the search word list, adding new search words, and so on. The set of search words selected by users is denoted as searchVoc_select set.
searchVoc_select重组集:对searchVoc_select集中的搜索词进行全排列组合后的集合,全排列组合后的词序以遵照原集合中各词的优先级为原则,假设searchVoc_select集为(A,B,C),则进行全排列组合之后的searchVoc_select重组集为(ABC,AB,AC,BC,A,B,C),重组集中包括独立词及组合词,独立指具有单独词义的词,本例中包括A、B、C,组合词指多个词组合在一期的词,本例中包括ABC、AB、AC、BC。searchVoc_select recombination set: a collection of search words in the searchVoc_select set that are fully arranged and combined. The word order after the full arrangement and combination is based on the principle of following the priority of each word in the original set. Assuming that the searchVoc_select set is (A, B, C), then The searchVoc_select reorganization set after full permutation and combination is (ABC, AB, AC, BC, A, B, C). The reorganization set includes independent words and compound words. Independent refers to words with separate meanings. In this example, A and B are included. , C, compound words refer to the words of multiple words combined in one period, including ABC, AB, AC, BC in this example.
逆向匹配算法:一种基本的分词算法,其基本思想是:假设字典中最大词条所含有的汉字个数为n个,从待处理字符串的末尾开始,向前取n个字作为匹配字段,查找分词字典,若字典中含有该词,则匹配成功,分出该词,然后从待处理字符串从末尾属第n+1处开始再取n个字组成的字段重新在字典中匹配;如果没有匹配成功,则将这n个字组成的字段的最后一位剔除,用剩下的n-1个字组成的字段在字典中进行匹配,如此进行下去,直到切分成功为止。例如,在分词过程中,假设文本中的字串为ABC,W为字典,若C∈W,BC∈W,W,那么就取切分A/BC。Reverse matching algorithm: a basic word segmentation algorithm, the basic idea is: assuming that the largest entry in the dictionary contains n Chinese characters, starting from the end of the string to be processed, take n characters forward as the matching field , look up the word segmentation dictionary, if the word is contained in the dictionary, the match is successful, the word is separated, and then the field consisting of n words is taken from the string to be processed from the end to the n+1th position and then matched in the dictionary again; If no match is successful, the last bit of the field consisting of n characters is removed, and the remaining field consisting of n-1 characters is used to match in the dictionary, and so on until the segmentation is successful. For example, in the word segmentation process, suppose the word string in the text is ABC, W is a dictionary, if C∈W, BC∈W, W, then take the split A/BC.
摘要分词结果离散集:对搜索结果用户筛选集中的各篇摘要信息分别进行分词,由各篇分词结果组成的数组。数组长度为搜索结果的篇数,数组元素为各篇摘要分词集合,例如,记摘要分词结果离散集为abstract_cut_apart,其第一个元素的形式为:abstract_cut_apart[0]={液体发动机,组成,包括......}。Discrete set of abstract word segmentation results: Segment the abstract information of each article in the search result user filter set, and it is an array composed of word segmentation results of each article. The length of the array is the number of articles in the search results, and the elements of the array are the abstract word segmentation sets of each article. For example, the discrete set of abstract word segmentation results is abstract_cut_apart, and the form of the first element is: abstract_cut_apart[0]={liquid engine, composition, including ......}.
摘要分词结果组合集:各篇摘要分词结果的并集组成的数组。数组长度为1,数组元素为包含所有分词的并集。Abstract word segmentation result combination set: an array composed of the union of abstract word segmentation results. The length of the array is 1, and the array elements are the union of all word segmentations.
中文分类主题词表:显示主题词与词间语义关系的规范化动态性的检索语言词表,是主题标引、检索和组织目录、索引的主要工具,中文分类主题词表涉及的主题专业范围包括哲学、社会科学和自然科学、工程技术等所有领域的学科和主题概念。本发明中通过查询中文分类主题词表中搜索词的语义关系,实现搜索词语义的扩展。Chinese Classified Thesaurus: A standardized and dynamic search language vocabulary that shows the semantic relationship between subject words and words. It is the main tool for subject indexing, retrieval and organization of catalogs and indexes. The subject areas covered by the Chinese Classified Thesaurus include Disciplinary and subject concepts in all areas of philosophy, social and natural sciences, engineering technology. In the present invention, the extension of the semantics of the search words is realized by querying the semantic relationship of the search words in the Chinese classification thesaurus.
语义网:中文分类主题词表中词与词之间关系的组合,主题词之间的关系包括周、代、属、分、族、参,其对应的关系符合分别为“Y”、“D”、“S”、“F”、“Z”、“C”。其中,“Y”后面的词表示款目叙词的正式表达;“D”后面的词表示款目叙词的非正式表达;“S”后面的词表示款目叙词的上位词,比款目叙词高一个等级;“F”后面的词表示款目叙词的下位词,比款目叙词低一个等级;“Z”后面的词表示款目叙词的族首词;“C”后面的词表示款目叙词的参照词。Semantic Web: The combination of the relationship between words in the Chinese taxonomy thesaurus. The relationship between the words includes week, generation, genus, sub, family, and ginseng. The corresponding relationship is "Y", "D" ", "S", "F", "Z", "C". Among them, the word after "Y" indicates the formal expression of the item descriptor; the word after "D" indicates the informal expression of the item descriptor; the word after "S" indicates the hypernym of the item descriptor, and the The word after "F" represents the hyponym of the entry descriptor, which is one level lower than the entry descriptor; the word after "Z" represents the head word of the entry descriptor; "C" The following words represent the reference words of the entry descriptors.
下面结合附图对本发明进行详细说明The present invention is described in detail below in conjunction with accompanying drawing
如图1所示,本发明面向用户的信息搜索引擎系统搜索词推送模块、用户发起搜索模块、用户关注点更新模块、初次搜索模块、用户兴趣推断模块、用户关注结果分词模块、搜索词重构模块以及二次搜索模块构成。As shown in Figure 1, the user-oriented information search engine system of the present invention includes a search word push module, a user-initiated search module, a user focus update module, an initial search module, a user interest inference module, a user focus result word segmentation module, and a search word reconstruction module module and secondary search module.
整个实现过程如下:The whole implementation process is as follows:
(1)服务器根据当前用户的身份信息,查询用户关注库,所述的用户关注库由本人历史关注点以及同兴趣用户历史关注点两部分组成,所述的本人历史关注点以及同兴趣用户历史关注点均由历史搜索词和搜索词的使用频率组成,首先解析用户本人历史搜索词,按照搜索词使用频率由高到低进行排序,选择使用频率超过一定阈值的历史搜索词,按序写入用户本人历史关注词集合,即searchVoc_past集,之后遍历searchVoc_past集,获取各历史搜索词除当前用户之外的其他历史用户,写入同兴趣用户集合,即user_sameInt集,依次获取user_sameInt集中各用户的历史搜索词,分别查询各历史搜索词的使用频率,按照使用频率由高到低写入同兴趣用户历史关注词集合,即searchVoc_past_other集,对searchVoc_past_other集遍历,在避免重复的前提下,将其中的词顺序加入searchVoc_past集,根据searchVoc_past集形成搜索词推送列表,输出至客户端;(1) The server queries the user attention database according to the identity information of the current user. The user attention database is composed of two parts: the historical attention points of the person and the historical attention points of users with the same interest. The points of interest are composed of historical search words and the frequency of use of search words. First, analyze the user's own historical search words, sort them according to the frequency of use of search words from high to low, select the historical search words whose frequency of use exceeds a certain threshold, and write them in order The user's own historical attention word collection, that is, the searchVoc_past collection, and then traverse the searchVoc_past collection to obtain other historical users of each historical search word except the current user, and write it into the user collection of the same interest, that is, the user_sameInt collection, and then obtain the history of each user in the user_sameInt collection Search terms, respectively query the usage frequency of each historical search term, and write the historical attention word set of users with the same interest according to the frequency of use from high to low, that is, the searchVoc_past_other set, traverse the searchVoc_past_other set, and save the words in it under the premise of avoiding repetition Add the searchVoc_past set sequentially, form a search word push list according to the searchVoc_past set, and output it to the client;
(2)接收搜索词推送列表,解析其中的搜索词,按顺序显示在客户端,并提供复选按钮及排序按钮,允许用户对各搜索词进行选择或取消,以及设置搜索词的优先级,根据用户的选择结果动态更改搜索词集合,同时支持用户对搜索词集合进行人工补充或修改,以形成最终提交的搜索申请;(2) Receive the push list of search words, analyze the search words in it, display them on the client in order, and provide check buttons and sorting buttons, allowing users to select or cancel each search word, and set the priority of search words, Dynamically change the search term set according to the user's selection results, and at the same time support the user to manually supplement or modify the search term set to form the final submitted search application;
(3)接收搜索申请,对用户发起的搜索行为进行记录,所述的搜索行为由用户输入的搜索词及搜索词的顺序组成,将用户输入的搜索词按序写入搜索词用户选择集合,即searchVoc_select集,遍历searchVoc_select集,判断其中的搜索词是否存在于用户关注库中,如果已存在,则更新该词当前使用频率,否则,则将该词写入用户关注库中的本人历史关注点集合,同时设置当前使用频率为初始值,为后续的搜索词推送提供数据基础,同时,根据用户发起的搜索行为执行初次搜索,首先按照搜索词的优先级对searchVoc_select集中的全部搜索词进行全排列组合,排列组合后的searchVoc_select集中包括独立词及组合词,遍历排列组合后的searchVoc_select集,依次查询与其中各个词相匹配的搜索结果,与独立词匹配即表示搜索结果中包含该独立词,与组合词匹配即表示搜索结果包含每一个组成要素,对于每一个搜索词的匹配结果,统计全文中与搜索词的匹配频率,按匹配频率由高到低排序,按searchVoc_select集的词序将所有匹配的搜索结果列表组合,写入初始搜索结果集合,即result_first集,所述的搜索结果列表由结果信息标题、摘要、来源组成,其中,摘要为结果全文中与搜索词匹配最多的一段文字,将形成的result_first集输出至客户端,供用户查看;(3) Receive the search application, record the search behavior initiated by the user, the search behavior is composed of the search words entered by the user and the order of the search words, and write the search words entered by the user into the search word user selection set in sequence, That is, the searchVoc_select set, traverse the searchVoc_select set, and judge whether the search term exists in the user's attention database. If it exists, update the current frequency of use of the word, otherwise, write the word into the personal historical focus in the user attention database Set, and set the current frequency of use as the initial value, and provide a data basis for subsequent search word pushes. At the same time, perform the initial search according to the search behavior initiated by the user, and firstly arrange all the search words in the searchVoc_select set according to the priority of the search words Combination, the searchVoc_select set after permutation and combination includes independent words and compound words, traverse the searchVoc_select set after permutation and combination, and query the search results that match each of the words in turn, matching independent words means that the independent words are included in the search results, and Combination word matching means that the search results contain every component element. For the matching results of each search word, the matching frequency of the search word in the full text is counted, sorted according to the matching frequency from high to low, and all matching words are sorted according to the word order of the searchVoc_select set The search result list combination is written into the initial search result set, that is, the result_first set. The search result list is composed of the result information title, abstract, and source, where the abstract is a paragraph of text in the full text of the result that most matches the search term, and will form The result_first set is output to the client for users to view;
(4)记录用户对resuIt_first集的操作,将用户筛选行为写入初次搜索结果用户筛选集,即result_userSelect集。所述的用户筛选行为由用户选择结果ID、结果点击次数以及结果查看时间组成。对于各条结果,对“结果点击次数x结果查看时间”进行求和计算,得到用户对各条结果的关注程度,按照关注程度值从高到低进行排序,分别解析出各结果的摘要信息,将摘要信息按顺序写入用户筛选结果摘要集,即result_abstract集,供分词使用;(4) Record the user's operation on the resuIt_first set, and write the user screening behavior into the user filtering set of the initial search result, namely the result_userSelect set. The user screening behavior is composed of user selection result ID, result click times and result viewing time. For each result, calculate the sum of "result click times x result viewing time" to get the degree of user attention to each result, sort according to the degree of attention from high to low, and analyze the summary information of each result respectively. Write the summary information in order to the summary set of user screening results, that is, the result_abstract set, for word segmentation;
(5)遍历result_abstract集,依次解析出用户关注结果的摘要信息,对照字典集,采用逆向匹配算法分词,所述的字典集为哈希表,即HashMap组成的数组,数组长度为字典中可作为首字的汉字个数,数组索引为该汉字的区位码,数组各元素为该首字对应的所有词组成的HashMap,其中词本身作为HashMap的key,词频作为HashMap的value,分词完毕后,对照无义词库,将无义词剔除,将各篇摘要的分词结果作为独立数组,写入摘要分词结果离散集,即abstract_cut_apart集,同时提取出分词结果的并集,即不存在重复词的最大集合,写入摘要分词结果组合集,即abstract_cut_unit集,遍历abstract_cut_uIitt集中的词语,比对abstract_cut_apart集,解析各词语在不同摘要中出现的次数,所述的各词语在不同摘要中出现的次数不包括该词语在同一摘要中出现的次数,将出现次数与摘要篇数相同的词语,即各篇摘要中均出现的词语汇集并写入摘要分词结果交集,即abstract_cut_same集,对照中文分类主题词表,分析abstract_cut_same集,对于与其中词语具有用代关系及相关关系的词,写入摘要分词结果重组集,即abstract_cut_reorg集,解析abstract_cut_same集,按照初次搜索模块中的方法对集合中的词进行排列组合,遍历abstract_cut_same集中的各搜索词,依次获取全文中与之匹配的文档、标题中与之匹配的图片以及视频,其中,对组合词而言,与之匹配表示满足其中每一个组成要素,之后,解析abstract_cut_reorg集,获取与其中每个独立词匹配的文档、图片以及视频,将所有的文档文件按搜索顺序写入二次搜索文档结果集,即result_second_doc集,将所有的图片文件按搜索顺序写入二次搜索图片结果集,即result_second_image集,将所有的视频文件按搜索顺序写入二次搜索视频结果集,即result_second_vedio集,返回result_second_doc集、result_second_image集及result_second_vedio集三个集合至客户端,按类别将搜索结果展示给用户,为用户提供更精准的搜索结果。上述各模块的具体实现过程如下:(5) Traverse the result_abstract set, parse out the summary information of the results concerned by the user in turn, compare the dictionary set, and use the reverse matching algorithm to segment the words. The dictionary set is a hash table, that is, an array composed of HashMap, and the length of the array can be used as The number of Chinese characters in the first character, the array index is the location code of the Chinese character, each element of the array is a HashMap composed of all words corresponding to the first character, in which the word itself is used as the key of the HashMap, and the word frequency is used as the value of the HashMap. After the word segmentation is completed, compare Nonsense thesaurus, remove nonsense words, write the word segmentation results of each abstract as an independent array, write the discrete set of abstract word segmentation results, that is, abstract_cut_apart set, and extract the union of word segmentation results at the same time, that is, the maximum number of words without repeated words Set, write the abstract word segmentation result combination set, that is, the abstract_cut_unit set, traverse the words in the abstract_cut_uIitt set, compare the abstract_cut_apart set, and analyze the number of times each word appears in different abstracts. The number of times each word appears in different abstracts does not include The number of times the word appears in the same abstract, the words that appear the same as the number of abstracts, that is, the words that appear in each abstract are collected and written into the intersection of the abstract word segmentation results, that is, the abstract_cut_same set. Compared with the Chinese classification subject vocabulary, Analyze the abstract_cut_same set, and write the abstract word segmentation result reorganization set, namely the abstract_cut_reorg set, for the words that have a generational relationship and related relationship with the words in it, analyze the abstract_cut_same set, and arrange and combine the words in the set according to the method in the initial search module, Traverse each search term in the abstract_cut_same set, and obtain the matching documents in the full text, the matching pictures and videos in the title in turn, among which, for compound words, matching with it means that each of the constituent elements is satisfied, and then parse abstract_cut_reorg set, to obtain documents, pictures and videos matching each independent word, write all document files in the search order to the secondary search document result set, namely the result_second_doc set, write all picture files in the search order to the second The secondary search image result set, that is, the result_second_image set, writes all video files in the search order to the secondary search video result set, that is, the result_second_vedio set, and returns three sets of result_second_doc set, result_second_image set, and result_second_vedio set to the client. The search results are displayed to users to provide users with more accurate search results. The specific implementation process of the above modules is as follows:
1.搜索词推送模块1. Search word push module
该模块的实现过程如图2所示:The implementation process of this module is shown in Figure 2:
(1)捕获用户信息,根据用户登录时存储身份信息的session,获得当前登录者的用户名、ID;(1) Capture user information, and obtain the user name and ID of the current login according to the session that stores the identity information when the user logs in;
(2)根据用户ID查询用户关注表,记作searchVoc_past_table,提取出与该ID匹配的历史搜索词以及搜索词使用频率,搜索词记为V,使用频率记为F,将结果按F值的降序排列;(2) Query the user attention table according to the user ID, record it as searchVoc_past_table, extract the historical search terms matching the ID and the frequency of use of the search term, record the search term as V, and record the frequency of use as F, and sort the results in descending order of the F value arrangement;
(3)设预设的词频阈值为E,比使用频率F与设定阈值E的大小;(3) set the preset word frequency threshold to be E, than use the frequency F and set the size of the threshold E;
a.如果F>=E,则将F对应的V写入用户关注的历史词集,记作searchVoc_past集;a. If F>=E, then write the V corresponding to F into the historical vocabulary set that the user cares about, and record it as the searchVoc_past set;
b.如果F<E,则不做处理;b. If F<E, do not process;
(4)解析searchVoc_past集,依次遍历其中的搜索词V,查询searchVoc_past_able表,获得与V匹配的除当前用户之外的其他用户ID,写入user_sameInt集;(4) Parsing the searchVoc_past set, traversing the search term V therein in turn, querying the searchVoc_past_able table, obtaining other user IDs matching V except the current user, and writing them into the user_sameInt set;
(5)根据user_sameInt集中各用户ID,查询searchVoc_past_table,获取匹配的历史搜索词记录,分别统计各搜索词在searchVoc_past_table中的使用频率,按频率从高到底写入searchVoc_past_other集;(5) According to each user ID in the user_sameInt set, query searchVoc_past_table to obtain matching historical search term records, respectively count the usage frequency of each search term in searchVoc_past_table, and write searchVoc_past_other set according to the frequency from high to low;
(6)遍历searchVoc_past_other集,依次判断该词是否已存在于searchVoc_past集中;(6) traverse the searchVoc_past_other set, and judge whether the word exists in the searchVoc_past set in turn;
a.如果已经存在,则对本词不作处理,继续解析下一个词;a. If it already exists, do not process this word and continue to analyze the next word;
b.如果不存在,则将该词写入到searchVoc_past集中。b. If it does not exist, write the word into the searchVoc_past set.
(7)遍历结束后,形成的searchVoc_past集即为搜索词推送列表。(7) After the traversal, the formed searchVoc_past set is the search word push list.
2.用户发起搜索模块2. User initiated search module
该模块的实现过程如图3所示:The implementation process of this module is shown in Figure 3:
(1)接收搜索词推送列表,即searchVoc_past集,并读入缓冲区中;(1) Receive the push list of search terms, that is, the searchVoc_past set, and read it into the buffer;
(2)判断searchVoc_past集的长度,记为L;(2) Judge the length of the searchVoc_past set, denoted as L;
(3)如果L>0,则(3) If L>0, then
(3.1)以L为循环边界依次读取搜索词推送列表中的搜索词,包括搜索词ID以及搜索词内容,将搜索词内容显示在客户端,并在每个搜索词前方生成复选框按钮,复选框按钮的ID即为当前读取的搜索词的ID;(3.1) Read the search terms in the search term push list sequentially with L as the loop boundary, including the search term ID and search term content, display the search term content on the client, and generate a check box button in front of each search term , the ID of the checkbox button is the ID of the currently read search term;
(3.2)遍历结束后,生成排序按钮,显示在客户端;(3.2) After the traversal ends, a sort button is generated and displayed on the client;
(4)将搜索框中的内容存储为字符串,记作str_searchVoc;(4) Store the content in the search box as a character string, denoted as str_searchVoc;
(5)判断str_searchVoc是否为空,如果为空,则初始化str_searchVoc为一个至格字符;(5) judge whether str_searchVoc is empty, if it is empty, then initialize str_searchVoc to be a character to case;
(6)解析用户的操作,(6) Analyze the user's operation,
(6.1)选中某搜索词复选框,判断str_searchVoc中是否包含被选中的搜索词,如果包含,不做任何操作;如果不包含,则将该搜索词附加在srt_searchVoc之后,同时附加空格分隔符;(6.1) Select a search term check box to determine whether the selected search term is included in the str_searchVoc, if it is included, do nothing; if not, then append the search term to the srt_searchVoc, and add a space separator at the same time;
(6.2)取消选中某搜索词复选框,判断str_searchVoc中是否包含被选中的搜索词;如果包含,则去除该搜索词以及其后的空格分隔符;如果不包含,不做任何操作;(6.2) Uncheck a search term check box to determine whether the selected search term is included in str_searchVoc; if it is included, remove the search term and the space separator thereafter; if not included, do nothing;
(6.3)排序上移/下移,判断是否有选中的搜索词,(6.3) Move up/down in sorting to determine whether there is a selected search word,
(6.3.1)如果没有,提示用户进行选择;(6.3.1) If not, prompt the user to make a choice;
(6.3.2)如果选中的搜索词大于一个,提示用户只能选择一个搜索词进行操作;(6.3.2) If the selected search term is more than one, prompt the user to select only one search term for operation;
(6.3.3)如果选中了一个搜索词,则(6.3.3) If a search term is selected, then
a.将该搜索词顺序上移/下移一个排列位,并按照当前的排序次序将各搜索词组成一个字符串,记作str_searchVoc_newSeq,词与词之间以空格分隔;a. Move the search word order up/down by one arrangement position, and form a string of each search word according to the current sort order, denoted as str_searchVoc_newSeq, separated by spaces between words;
b.比对分析str_searchVoc和str_searchVoc_newSeq,依次解析str_searchVoc_newSeq中的搜索词,并判断是否存在于str_searchVoc中,如果不存在于str_searchVoc中,在str_searchVoc_newSeq中进行剔除;b. Compare and analyze str_searchVoc and str_searchVoc_newSeq, analyze the search words in str_searchVoc_newSeq in turn, and judge whether they exist in str_searchVoc, if they do not exist in str_searchVoc, remove them in str_searchVoc_newSeq;
c.将处理完毕的字符串替换str_searchVoc,并将str_searchVoc中的词重新写入客户端的搜索框;c. Replace str_searchVoc with the processed string, and rewrite the words in str_searchVoc into the search box of the client;
(6.4)用户确认操作后,提交搜索请求,将str_searchVoc提交到服务器。(6.4) After confirming the operation, the user submits a search request and submits str_searchVoc to the server.
3.用户关注点更新及初次搜索模块3. User focus update and initial search module
用户关注点更新及初次搜索模块由用户关注点更新模块及初次搜索模块两个子模块构成,其执行过程如图4所示:The user focus update and initial search module is composed of two sub-modules, the user focus update module and the initial search module, and its execution process is shown in Figure 4:
(1)接收搜索申请,即str_searchVoc,并存入缓冲区;(1) Receive the search application, namely str_searchVoc, and store it in the buffer;
(2)解析str_searchVoc,按分隔符将str_searchVoc中的搜索词分隔为数组,依次写入searchVoc_select集;(2) Parse str_searchVoc, separate the search words in str_searchVoc into arrays according to the delimiter, and write them into the searchVoc_select set in turn;
(3)遍历searchVoc_select集,判断其中的搜索词是否存在于用户关注库中的用户关注表,即searchVoc_past_table,(3) traverse the searchVoc_select set, and judge whether the search term exists in the user attention table in the user attention library, that is, searchVoc_past_table,
a.如果已存在,则读取该词的当前使用频率,记为f,将f值转换为整型后加1;a. If it already exists, read the current frequency of use of the word, record it as f, and add 1 after converting the value of f to an integer;
b.如果不存在,则将该词写入用户关注库中的用户关注表,插入值包括用户ID、搜索词内容和搜索词使用频率,搜索词使用频率设定为初始值。b. If it does not exist, then write the word into the user attention table in the user attention library, the inserted value includes user ID, search word content and search word usage frequency, and the search word usage frequency is set as the initial value.
(4)解析searchVoc_select集,由于搜索词过多会造成搜索结果的大量冗余,因此只从前向后提取一定数量的搜索词;(4) Parsing the searchVoc_select set, because too many search words will cause a lot of redundancy in the search results, so only a certain number of search words are extracted from the front to the back;
(5)形成所提取的搜索词的所有排列组合,形成重排字符串,组合词之间以分号(“;”)分隔,组合词的元素(即组成组合词的各独立词)之间以逗号(“,”)分隔;(5) Form all permutations and combinations of the extracted search words to form a rearranged string, the compound words are separated by semicolons (";"), and the elements of the compound words (that is, the independent words that form the compound words) are separated separated by commas (",");
(6)按分号分隔符(“;”)对重排字符串进行分隔,写入searchVoc_select重组集,记作searchVoc_select_reArr;(6) Separate the rearranged character string by a semicolon delimiter (";"), write it into the searchVoc_select reorganization set, and denote it as searchVoc_select_reArr;
(7)遍历searchVoc_select_reArr,对其中的每一个元素进行如下操作:(7) Traverse searchVoc_select_reArr, and perform the following operations on each element:
(7.1)按逗号分隔符(“,”)对每一个元素进行分隔,生成arr_conVoc数组;(7.1) Separate each element by a comma separator (",") to generate an array of arr_conVoc;
(7.2)判断arr_conVoc数组的长度L,(7.2) Determine the length L of the arr_conVoc array,
(7.2.1)如果L=1,则说明该元素为独立词,对该词进行匹配搜索,对于每一项搜索结果,执行如下操作:(7.2.1) If L=1, then explain that this element is an independent word, carry out matching search to this word, for each search result, perform the following operations:
a.解析搜索结果全文,以回车为分隔符分隔出各自然段落,记作arr_result_para数组;a. Parse the full text of the search results, use the carriage return as the separator to separate each natural paragraph, and record it as an array of arr_result_para;
b.遍历arr_result_para数组,依次统计各元素中包含该独立词的个数,将个数最多的一段提取出来,作为摘要;b. Traversing the arr_result_para array, counting the number of independent words contained in each element in turn, and extracting the paragraph with the largest number as a summary;
c.将标题、摘要、来源组合为搜索结果列表,写入初始搜索结果集,即result_first集;c. Combine the title, abstract, and source into a search result list, and write it into the initial search result set, namely the result_first set;
(7.2.2)如果L>1,则说明该元素为组合词,依次提取出该元素的arr_conVoc数组中的子元素,即独立词,按照(4.2.1)中的步骤执行搜索,获得满足所有独立词的结果,将(7.2.2) If L>1, it means that the element is a compound word, and the sub-elements in the arr_conVoc array of the element, that is, independent words, are sequentially extracted, and the search is performed according to the steps in (4.2.1) to obtain all The result of independent words will be
(8)遍历结束后,形成最终的result_first集,输出至客户端,供用户查看(8) After the traversal, the final result_first set is formed and output to the client for users to view
4.用户兴趣推断模块4. User interest inference module
用户兴趣推断模块的实现过程如图5所示:The implementation process of the user interest inference module is shown in Figure 5:
(1)初始化初次搜索结果用户筛选集,即result_userSelect集;(1) Initialize the initial search result user selection set, namely the result_userSelect set;
(2)对于用户的操作,记录下所选择结果的ID;(2) For the user's operation, record the ID of the selected result;
(3)判断该ID是否已存在于result_userSelect集中,(3) Determine whether the ID already exists in the result_userSelect set,
(3.1)如果不存在,则(3.1) If not present, then
a.该初始化该搜索结果ID的顺序号P为1,点击次数N为1,记录下当前操作时间Tcurrent,写入result_userSelect集中;a. The serial number P of the initial search result ID is 1, the number of clicks N is 1, record the current operation time Tcurrent , and write it into the result_userSelect set;
b.取出result_userSelect集中最大的T值,记为TmaxLast,如果T值不存在,则取TmaxLast为0;b. Take out the largest T value in the result_userSelect set and record it as TmaxLast . If the T value does not exist, take TmaxLast as 0;
c.通过Tcurrent-TmaxLast算出TmaxLast对应的搜索结果的浏览时间,写入result_userSelect集中;c. Calculate the browsing time of the search result corresponding to TmaxLast through Tcurrent -TmaxLast , and write it into the result_userSelect set;
(3.2)如果已经存在,则取出该ID当前对应的点击次数N,将N值增1,并更新result_userSelect集中相应数值;(3.2) If it already exists, take out the current number of clicks N corresponding to the ID, increase the value of N by 1, and update the corresponding value in the result_userSelect set;
(4)遍历result_userSelect集,按“结果初次点击顺序/求和(结果点击次数x结果查看时间)”进行用户关注度计算,求得用户对筛选的各条结果的关注度;(4) Traverse the result_userSelect set, and calculate the user's attention according to the "first click sequence/summation (result click times x result viewing time)" to obtain the user's attention to the filtered results;
(5)按对应用户关注度从高到低对result_userSelect集进行排序,提取排在前列的一定数量的信息,分别取出各ID值对应的摘要信息,写入用户筛选结果摘要集,即result_abstract集。(5) Sort the result_userSelect set from high to low according to the corresponding user's attention, extract a certain amount of information at the top, extract the summary information corresponding to each ID value, and write the summary set of user screening results, that is, the result_abstract set.
5.用户关注结果分词、重构及二次搜索模块5. Users pay attention to the word segmentation, reconstruction and secondary search modules of the results
用户关注结果分词、重构及二次搜索由用户关注结果分词模块、搜索词重构模块以及二次搜索模块三个子模块构成,实现过程如图6所示:Word segmentation, reconstruction and secondary search of user-focused results are composed of three sub-modules: user-focused result word segmentation module, search word reconstruction module, and secondary search module. The implementation process is shown in Figure 6:
(1)遍历result_absrtact集,依次解析出用户关注结果的摘要信息,对照字典集,采用逆向匹配算法分词,各篇摘要的分词结果作为独立数组,写入摘要分词结果离散集,即abstract_cut_apart集,数组个数记为N;(1) Traversing the result_absrtact set, analyzing the abstract information of the results concerned by the user in turn, comparing with the dictionary set, using the reverse matching algorithm to segment the words, the word segmentation results of each abstract are used as an independent array, and written into the discrete set of abstract word segmentation results, namely the abstract_cut_apart set, array The number is recorded as N;
(2)提取出分词结果的并集,即不存在重复词的最大集合,写入摘要分词结果组合集,即abstract_cut_unit集;(2) Extract the union of word segmentation results, that is, there is no maximum set of repeated words, and write the abstract word segmentation result combination set, that is, the abstract_cut_unit set;
(3)对所有摘要分词结果组成的集合进行遍历,对其中每个搜索词,执行以下操作;(3) Traverse the set of all abstract word segmentation results, and perform the following operations for each search term;
(3.1)初始化当前搜索词的出现频率Fabs=0;(3.1) Initialize the frequency of occurrence Fabs =0 of the current search term;
(3.2)遍历abstract_cut_apart集中的各个数组元素,判断该数组元素中是否包含当前搜索词;(3.2) traverse each array element in the abstract_cut_apart set, and determine whether the current search term is included in the array element;
e.如果包含,则Fabs=Fabs+1,继续判断下一条数组元素;e. If it is included, then Fabs =Fabs +1, continue to judge the next array element;
f.如果不包含,Fabs值不变。f. If not included, the Fabs value remains unchanged.
(3.3)将当前搜索词对应的Fabs值与abstract_cut_apart集中的数组个数进行比较;(3.3) compare the Fabs value corresponding to the current search term with the number of arrays in abstract_cut_apart;
a.如果Fabs=N,将当前搜索词写入摘要分词结果交集,即abstract_cut_same;a. If Fabs=N, write the current search word into the intersection of the abstract word segmentation results, ie abstract_cut_same;
b.如果Fabs<N,不做处理,继续判断下一搜索词。b. If Fabs<N, do not process and continue to judge the next search word.
(4)遍历abstract_cut_same集,对其中每个搜索词,在中文分类主题词表中检索以该词为款目叙词的语义网;(4) Traverse the abstract_cut_same set, for each search word wherein, search the Semantic Web with this word as the entry descriptor in the Chinese classification thesaurus;
(4.1)如果语义网中有标识为“Y”的关系词,表示该词具有正式表达词,将正式表达词写入摘要分词结果重组集,即abstract_cut_reorg;(4.1) If there is a relational word marked as "Y" in the Semantic Web, it means that the word has a formal expression word, and the formal expression word is written into the reorganization set of abstract word segmentation results, namely abstract_cut_reorg;
(4.2)如果语义网中有标识为“D”的关系词,表示该词具有非正式表达词,将非正式表达词写入集合abstract_cut_reorg;(4.2) If there is a relational word marked as "D" in the Semantic Web, it means that the word has an informal expression, and the informal expression is written into the collection abstract_cut_reorg;
(4.3)如果语义网中有标识为“C”的关系词,表示该词具有词义与之相关的表达词,将相关表达词写入集合abstract_cut_reorg。(4.3) If there is a relational word labeled "C" in the Semantic Web, it means that the word has an expression word whose meaning is related to it, and the related expression word is written into the set abstract_cut_reorg.
(5)按照初次搜索模块中的方法对abstract_cut_same集合中的词进行排列组合,遍历abstract_cut_same集中的各搜索词,按照初次搜索模块的搜索方法,依次获取全文中与之匹配的文档、标题中与之匹配的图片以及视频;(5) Arrange and combine the words in the abstract_cut_same collection according to the method in the initial search module, traverse each search word in the abstract_cut_same collection, and obtain the matching documents and titles in the full text in sequence according to the search method of the initial search module Matching pictures and videos;
(6)遍历abstract_cut_reorg集合,读取各独立词,依次获取全文中与之匹配的文档、标题中与之匹配的图片以及视频;(6) Traverse the abstract_cut_reorg collection, read each independent word, and obtain the matching documents in the full text, the matching pictures and videos in the title in turn;
(7)将所有的文档文件按搜索顺序写入result_second_doc集,将所有的图片文件按搜索顺序写入result_second_image集,将所有的视频文件按搜索顺序写入result_second_vedio集;(7) All document files are written into result_second_doc set in search order, all image files are written in result_second_image set in search order, and all video files are written in result_second_vedio set in search order;
(8)返回result_second_doc、result_second_image及esult_second_vedio三个集合至客户端,按类别将搜索结果展示给用户,提示用户本次搜索结果可能更符合其意图,供用户深入查看。(8) Return three sets of result_second_doc, result_second_image, and result_second_vedio to the client, and display the search results to the user by category, prompting the user that the search results this time may be more in line with their intentions for the user to view in depth.
应用举例:本发明的系统及方法已经成功应用于航天运载火箭技术研究院的航天器型号的研制中,协助研发设计人员快速、便捷得到最需要的知识信息,证明了本发明系统及方法具有灵活性、便利性及智能化的优点。Application example: the system and method of the present invention have been successfully applied in the development of spacecraft models of the Aerospace Launch Vehicle Technology Research Institute, assisting R&D designers to quickly and conveniently obtain the most needed knowledge information, which proves that the system and method of the present invention are flexible The advantages of security, convenience and intelligence.
本发明未详细描述的部分属于本领域公知技术。Parts not described in detail in the present invention belong to the well-known technologies in the art.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201210433731.6ACN102930022B (en) | 2012-10-31 | 2012-10-31 | User oriented information search engine system and method |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201210433731.6ACN102930022B (en) | 2012-10-31 | 2012-10-31 | User oriented information search engine system and method |
| Publication Number | Publication Date |
|---|---|
| CN102930022Atrue CN102930022A (en) | 2013-02-13 |
| CN102930022B CN102930022B (en) | 2015-11-25 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201210433731.6AActiveCN102930022B (en) | 2012-10-31 | 2012-10-31 | User oriented information search engine system and method |
| Country | Link |
|---|---|
| CN (1) | CN102930022B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103268312A (en)* | 2013-05-03 | 2013-08-28 | 同济大学 | A training corpus collection system and method based on user feedback |
| CN103294814A (en)* | 2013-06-07 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | Search result recommendation method, system and search engine |
| CN103391320A (en)* | 2013-07-18 | 2013-11-13 | 百度在线网络技术(北京)有限公司 | Content recommending method and device based on interest point change |
| CN103593195A (en)* | 2013-11-22 | 2014-02-19 | 安一恒通(北京)科技有限公司 | Method and device for customizing personalized software |
| CN103617266A (en)* | 2013-12-03 | 2014-03-05 | 北京奇虎科技有限公司 | Personalized extension search method, device and system |
| CN103631929A (en)* | 2013-12-09 | 2014-03-12 | 江苏金智教育信息技术有限公司 | Intelligent prompt method, module and system for search |
| CN104009970A (en)* | 2013-09-17 | 2014-08-27 | 宁波公众信息产业有限公司 | Network information acquisition method |
| CN104102847A (en)* | 2014-07-25 | 2014-10-15 | 中国科学技术信息研究所 | Chinese descriptor list building system |
| CN104166700A (en)* | 2014-08-01 | 2014-11-26 | 百度在线网络技术(北京)有限公司 | Search term recommendation method and device |
| CN104346160A (en)* | 2013-08-09 | 2015-02-11 | 联想(北京)有限公司 | Method for processing information and electronic equipment |
| CN104933092A (en)* | 2015-05-19 | 2015-09-23 | 苏州工讯科技有限公司 | Screening type searching method aiming at industrial product search |
| CN105009115A (en)* | 2013-11-29 | 2015-10-28 | 华为终端有限公司 | Method and apparatus for obtaining network resources |
| CN105069032A (en)* | 2015-07-20 | 2015-11-18 | 东南大学 | Filtering expression and rendering engine based method for automatically monitoring update of dynamic webpage |
| CN105117479A (en)* | 2015-09-11 | 2015-12-02 | 北京金山安全软件有限公司 | Acquisition method and processing method of user search behavior information and electronic equipment |
| CN105302897A (en)* | 2015-10-21 | 2016-02-03 | 无锡天脉聚源传媒科技有限公司 | Search result acquisition method and apparatus |
| CN105447192A (en)* | 2015-12-21 | 2016-03-30 | 北京奇虎科技有限公司 | Method and device for recommending personalized search terms on navigation page |
| CN105574176A (en)* | 2015-12-21 | 2016-05-11 | 北京奇虎科技有限公司 | Hot word recommending method and device with combination of multiple data sources |
| CN105808737A (en)* | 2016-03-10 | 2016-07-27 | 腾讯科技(深圳)有限公司 | Information retrieval method and server |
| CN106156256A (en)* | 2015-04-28 | 2016-11-23 | 天脉聚源(北京)科技有限公司 | A kind of user profile classification transmitting method and system |
| CN106407337A (en)* | 2016-09-05 | 2017-02-15 | 深圳震有科技股份有限公司 | Quick search method and system |
| CN106776743A (en)* | 2016-11-18 | 2017-05-31 | 广东小天才科技有限公司 | Search content prompting method and device |
| CN106919693A (en)* | 2017-03-07 | 2017-07-04 | 广州优视网络科技有限公司 | It is a kind of to improve the method and apparatus that hot word exposes coverage rate |
| CN107330023A (en)* | 2017-06-21 | 2017-11-07 | 北京百度网讯科技有限公司 | Content of text based on focus recommends method and apparatus |
| CN107341251A (en)* | 2017-07-10 | 2017-11-10 | 江西博瑞彤芸科技有限公司 | A kind of extraction and the processing method of medical folk prescription and keyword |
| CN107346336A (en)* | 2017-06-29 | 2017-11-14 | 北京百度网讯科技有限公司 | Information processing method and device based on artificial intelligence |
| CN107423355A (en)* | 2017-05-26 | 2017-12-01 | 北京三快在线科技有限公司 | Information recommendation method and device, electronic equipment |
| CN107562747A (en)* | 2016-06-30 | 2018-01-09 | 上海博泰悦臻网络技术服务有限公司 | Method for information display and system, electronic equipment and database |
| CN107679211A (en)* | 2017-10-17 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Method and apparatus for pushed information |
| CN107748745A (en)* | 2017-11-08 | 2018-03-02 | 厦门美亚商鼎信息科技有限公司 | A kind of enterprise name keyword extraction method |
| CN107832332A (en)* | 2017-09-29 | 2018-03-23 | 北京奇虎科技有限公司 | The method, apparatus and electronic equipment for recommending word are generated in navigating search frame |
| CN108121731A (en)* | 2016-11-29 | 2018-06-05 | 渡鸦科技(北京)有限责任公司 | Intension recognizing method and device |
| CN108270840A (en)* | 2017-01-04 | 2018-07-10 | 阿里巴巴集团控股有限公司 | A kind of business monitoring, the searching method of business datum, device and electronic equipment |
| CN109543113A (en)* | 2018-12-21 | 2019-03-29 | 北京字节跳动网络技术有限公司 | Determine method, apparatus, storage medium and the electronic equipment clicked and recommend word |
| WO2019072007A1 (en)* | 2017-10-12 | 2019-04-18 | 阿里巴巴集团控股有限公司 | Data processing method and device |
| CN110019646A (en)* | 2017-10-12 | 2019-07-16 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus for establishing index |
| CN110222265A (en)* | 2019-05-28 | 2019-09-10 | 深圳市轱辘汽车维修技术有限公司 | A kind of method, system, user terminal and the server of information push |
| CN111046221A (en)* | 2019-12-17 | 2020-04-21 | 腾讯科技(深圳)有限公司 | Song recommendation method and device, terminal equipment and storage medium |
| CN111190948A (en)* | 2019-12-26 | 2020-05-22 | 航天信息股份有限公司企业服务分公司 | Retrieval coding method based on keyword sorting |
| CN111190993A (en)* | 2019-12-26 | 2020-05-22 | 航天信息股份有限公司企业服务分公司 | Hierarchical sorting method based on ordered set of keywords |
| CN111190947A (en)* | 2019-12-26 | 2020-05-22 | 航天信息股份有限公司企业服务分公司 | Ordered hierarchical sorting method based on feedback |
| CN111209378A (en)* | 2019-12-26 | 2020-05-29 | 航天信息股份有限公司企业服务分公司 | Ordered hierarchical ordering method based on business dictionary weight |
| CN111222030A (en)* | 2018-11-27 | 2020-06-02 | 阿里巴巴集团控股有限公司 | Information recommendation method and device and electronic equipment |
| CN111831884A (en)* | 2020-07-14 | 2020-10-27 | 深圳市众创达企业咨询策划有限公司 | Matching system and method based on information search |
| CN112765494A (en)* | 2017-06-20 | 2021-05-07 | 创新先进技术有限公司 | Search method and search device |
| CN112765970A (en)* | 2021-01-14 | 2021-05-07 | 深圳前海微众银行股份有限公司 | Text theme determination method and device and readable storage medium |
| CN112905927A (en)* | 2021-03-19 | 2021-06-04 | 北京字节跳动网络技术有限公司 | Searching method, device, equipment and medium |
| CN112966177A (en)* | 2021-03-05 | 2021-06-15 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for identifying consultation intention |
| CN118069905A (en)* | 2024-04-22 | 2024-05-24 | 中国船舶集团有限公司第七一九研究所 | Interactive electronic manual data conversion system |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2002048921A1 (en)* | 2000-12-13 | 2002-06-20 | Znow, Inc | Method and apparatus for searching a database and providing relevance feedback |
| WO2006017364A1 (en)* | 2004-07-13 | 2006-02-16 | Google, Inc. | Personalization of placed content ordering in search results |
| CN101201838A (en)* | 2007-08-21 | 2008-06-18 | 新百丽鞋业(深圳)有限公司 | Method for improving searching engine based on keyword index using phrase index technique |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2002048921A1 (en)* | 2000-12-13 | 2002-06-20 | Znow, Inc | Method and apparatus for searching a database and providing relevance feedback |
| WO2006017364A1 (en)* | 2004-07-13 | 2006-02-16 | Google, Inc. | Personalization of placed content ordering in search results |
| CN101019118A (en)* | 2004-07-13 | 2007-08-15 | 谷歌股份有限公司 | Personalization of placed content ordering in search results |
| CN101201838A (en)* | 2007-08-21 | 2008-06-18 | 新百丽鞋业(深圳)有限公司 | Method for improving searching engine based on keyword index using phrase index technique |
| Title |
|---|
| 徐小乐: "搜索引擎个性化检索及用户推荐功能的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》* |
| 黄磊: "基于实例学习的搜索引擎结果优化系统设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103268312A (en)* | 2013-05-03 | 2013-08-28 | 同济大学 | A training corpus collection system and method based on user feedback |
| CN103268312B (en)* | 2013-05-03 | 2016-04-06 | 同济大学 | A kind of corpus collection system based on user feedback and method thereof |
| WO2014194844A1 (en)* | 2013-06-07 | 2014-12-11 | 百度在线网络技术(北京)有限公司 | Method and system for recommending search result and search engine |
| CN103294814A (en)* | 2013-06-07 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | Search result recommendation method, system and search engine |
| CN103391320A (en)* | 2013-07-18 | 2013-11-13 | 百度在线网络技术(北京)有限公司 | Content recommending method and device based on interest point change |
| CN104346160B (en)* | 2013-08-09 | 2018-02-27 | 联想(北京)有限公司 | The method and electronic equipment of information processing |
| CN104346160A (en)* | 2013-08-09 | 2015-02-11 | 联想(北京)有限公司 | Method for processing information and electronic equipment |
| CN104009970A (en)* | 2013-09-17 | 2014-08-27 | 宁波公众信息产业有限公司 | Network information acquisition method |
| CN103593195A (en)* | 2013-11-22 | 2014-02-19 | 安一恒通(北京)科技有限公司 | Method and device for customizing personalized software |
| CN105009115A (en)* | 2013-11-29 | 2015-10-28 | 华为终端有限公司 | Method and apparatus for obtaining network resources |
| CN105009115B (en)* | 2013-11-29 | 2019-06-11 | 华为终端有限公司 | Method and device for acquiring network resources |
| US9965468B2 (en) | 2013-11-29 | 2018-05-08 | Huawei Device Co., Ltd. | Method and apparatus for acquiring network resource |
| CN103617266A (en)* | 2013-12-03 | 2014-03-05 | 北京奇虎科技有限公司 | Personalized extension search method, device and system |
| CN103631929A (en)* | 2013-12-09 | 2014-03-12 | 江苏金智教育信息技术有限公司 | Intelligent prompt method, module and system for search |
| CN103631929B (en)* | 2013-12-09 | 2016-08-31 | 江苏金智教育信息股份有限公司 | A kind of method of intelligent prompt, module and system for search |
| CN104102847A (en)* | 2014-07-25 | 2014-10-15 | 中国科学技术信息研究所 | Chinese descriptor list building system |
| CN104102847B (en)* | 2014-07-25 | 2017-11-10 | 中国科学技术信息研究所 | Chinese thesaurus constructing system |
| CN104166700A (en)* | 2014-08-01 | 2014-11-26 | 百度在线网络技术(北京)有限公司 | Search term recommendation method and device |
| CN106156256A (en)* | 2015-04-28 | 2016-11-23 | 天脉聚源(北京)科技有限公司 | A kind of user profile classification transmitting method and system |
| CN104933092A (en)* | 2015-05-19 | 2015-09-23 | 苏州工讯科技有限公司 | Screening type searching method aiming at industrial product search |
| CN104933092B (en)* | 2015-05-19 | 2018-09-21 | 苏州工讯科技有限公司 | A kind of screening type searching method for industrial products search |
| CN105069032A (en)* | 2015-07-20 | 2015-11-18 | 东南大学 | Filtering expression and rendering engine based method for automatically monitoring update of dynamic webpage |
| CN105117479A (en)* | 2015-09-11 | 2015-12-02 | 北京金山安全软件有限公司 | Acquisition method and processing method of user search behavior information and electronic equipment |
| CN105302897A (en)* | 2015-10-21 | 2016-02-03 | 无锡天脉聚源传媒科技有限公司 | Search result acquisition method and apparatus |
| CN105302897B (en)* | 2015-10-21 | 2018-11-20 | 无锡天脉聚源传媒科技有限公司 | A kind of acquisition methods and device of search result |
| CN105447192A (en)* | 2015-12-21 | 2016-03-30 | 北京奇虎科技有限公司 | Method and device for recommending personalized search terms on navigation page |
| CN105574176A (en)* | 2015-12-21 | 2016-05-11 | 北京奇虎科技有限公司 | Hot word recommending method and device with combination of multiple data sources |
| CN105808737A (en)* | 2016-03-10 | 2016-07-27 | 腾讯科技(深圳)有限公司 | Information retrieval method and server |
| CN107562747A (en)* | 2016-06-30 | 2018-01-09 | 上海博泰悦臻网络技术服务有限公司 | Method for information display and system, electronic equipment and database |
| CN106407337B (en)* | 2016-09-05 | 2019-08-20 | 深圳震有科技股份有限公司 | A kind of method and system of fast search |
| CN106407337A (en)* | 2016-09-05 | 2017-02-15 | 深圳震有科技股份有限公司 | Quick search method and system |
| CN106776743A (en)* | 2016-11-18 | 2017-05-31 | 广东小天才科技有限公司 | Search content prompting method and device |
| CN108121731A (en)* | 2016-11-29 | 2018-06-05 | 渡鸦科技(北京)有限责任公司 | Intension recognizing method and device |
| CN108270840A (en)* | 2017-01-04 | 2018-07-10 | 阿里巴巴集团控股有限公司 | A kind of business monitoring, the searching method of business datum, device and electronic equipment |
| CN106919693B (en)* | 2017-03-07 | 2020-12-01 | 阿里巴巴(中国)有限公司 | Method and device for improving hot word exposure coverage rate |
| CN106919693A (en)* | 2017-03-07 | 2017-07-04 | 广州优视网络科技有限公司 | It is a kind of to improve the method and apparatus that hot word exposes coverage rate |
| CN107423355A (en)* | 2017-05-26 | 2017-12-01 | 北京三快在线科技有限公司 | Information recommendation method and device, electronic equipment |
| CN112765494A (en)* | 2017-06-20 | 2021-05-07 | 创新先进技术有限公司 | Search method and search device |
| CN107330023A (en)* | 2017-06-21 | 2017-11-07 | 北京百度网讯科技有限公司 | Content of text based on focus recommends method and apparatus |
| US10671656B2 (en) | 2017-06-21 | 2020-06-02 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method for recommending text content based on concern, and computer device |
| CN107330023B (en)* | 2017-06-21 | 2021-02-12 | 北京百度网讯科技有限公司 | Text content recommendation method and device based on attention points |
| CN107346336A (en)* | 2017-06-29 | 2017-11-14 | 北京百度网讯科技有限公司 | Information processing method and device based on artificial intelligence |
| US11620321B2 (en) | 2017-06-29 | 2023-04-04 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Artificial intelligence based method and apparatus for processing information |
| CN107341251A (en)* | 2017-07-10 | 2017-11-10 | 江西博瑞彤芸科技有限公司 | A kind of extraction and the processing method of medical folk prescription and keyword |
| CN107832332A (en)* | 2017-09-29 | 2018-03-23 | 北京奇虎科技有限公司 | The method, apparatus and electronic equipment for recommending word are generated in navigating search frame |
| WO2019072007A1 (en)* | 2017-10-12 | 2019-04-18 | 阿里巴巴集团控股有限公司 | Data processing method and device |
| CN110019646A (en)* | 2017-10-12 | 2019-07-16 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus for establishing index |
| TWI710917B (en)* | 2017-10-12 | 2020-11-21 | 開曼群島商創新先進技術有限公司 | Data processing method and device |
| CN107679211B (en)* | 2017-10-17 | 2021-12-28 | 百度在线网络技术(北京)有限公司 | Method and device for pushing information |
| US11151206B2 (en) | 2017-10-17 | 2021-10-19 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for pushing information |
| CN107679211A (en)* | 2017-10-17 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Method and apparatus for pushed information |
| CN107748745B (en)* | 2017-11-08 | 2021-08-03 | 厦门美亚商鼎信息科技有限公司 | Enterprise name keyword extraction method |
| CN107748745A (en)* | 2017-11-08 | 2018-03-02 | 厦门美亚商鼎信息科技有限公司 | A kind of enterprise name keyword extraction method |
| CN111222030A (en)* | 2018-11-27 | 2020-06-02 | 阿里巴巴集团控股有限公司 | Information recommendation method and device and electronic equipment |
| CN111222030B (en)* | 2018-11-27 | 2023-10-20 | 阿里巴巴集团控股有限公司 | Information recommendation method and device and electronic equipment |
| CN109543113A (en)* | 2018-12-21 | 2019-03-29 | 北京字节跳动网络技术有限公司 | Determine method, apparatus, storage medium and the electronic equipment clicked and recommend word |
| CN109543113B (en)* | 2018-12-21 | 2022-02-01 | 北京字节跳动网络技术有限公司 | Method and device for determining click recommendation words, storage medium and electronic equipment |
| CN110222265A (en)* | 2019-05-28 | 2019-09-10 | 深圳市轱辘汽车维修技术有限公司 | A kind of method, system, user terminal and the server of information push |
| CN110222265B (en)* | 2019-05-28 | 2022-02-08 | 深圳市轱辘车联数据技术有限公司 | Information pushing method, system, user terminal and server |
| CN111046221B (en)* | 2019-12-17 | 2024-06-07 | 腾讯科技(深圳)有限公司 | Song recommendation method, device, terminal equipment and storage medium |
| CN111046221A (en)* | 2019-12-17 | 2020-04-21 | 腾讯科技(深圳)有限公司 | Song recommendation method and device, terminal equipment and storage medium |
| CN111209378A (en)* | 2019-12-26 | 2020-05-29 | 航天信息股份有限公司企业服务分公司 | Ordered hierarchical ordering method based on business dictionary weight |
| CN111190993A (en)* | 2019-12-26 | 2020-05-22 | 航天信息股份有限公司企业服务分公司 | Hierarchical sorting method based on ordered set of keywords |
| CN111190947A (en)* | 2019-12-26 | 2020-05-22 | 航天信息股份有限公司企业服务分公司 | Ordered hierarchical sorting method based on feedback |
| CN111190948A (en)* | 2019-12-26 | 2020-05-22 | 航天信息股份有限公司企业服务分公司 | Retrieval coding method based on keyword sorting |
| CN111190947B (en)* | 2019-12-26 | 2024-02-23 | 航天信息股份有限公司企业服务分公司 | Orderly hierarchical ordering method based on feedback |
| CN111209378B (en)* | 2019-12-26 | 2024-03-12 | 航天信息股份有限公司企业服务分公司 | Ordered hierarchical ordering method based on business dictionary weights |
| CN111831884A (en)* | 2020-07-14 | 2020-10-27 | 深圳市众创达企业咨询策划有限公司 | Matching system and method based on information search |
| CN112765970A (en)* | 2021-01-14 | 2021-05-07 | 深圳前海微众银行股份有限公司 | Text theme determination method and device and readable storage medium |
| CN112966177A (en)* | 2021-03-05 | 2021-06-15 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for identifying consultation intention |
| CN112966177B (en)* | 2021-03-05 | 2022-07-26 | 北京百度网讯科技有限公司 | Recognition method, device, equipment and storage medium for consulting intent |
| CN112905927A (en)* | 2021-03-19 | 2021-06-04 | 北京字节跳动网络技术有限公司 | Searching method, device, equipment and medium |
| CN118069905A (en)* | 2024-04-22 | 2024-05-24 | 中国船舶集团有限公司第七一九研究所 | Interactive electronic manual data conversion system |
| CN118069905B (en)* | 2024-04-22 | 2024-07-19 | 中国船舶集团有限公司第七一九研究所 | Interactive electronic manual data conversion system |
| Publication number | Publication date |
|---|---|
| CN102930022B (en) | 2015-11-25 |
| Publication | Publication Date | Title |
|---|---|---|
| CN102930022B (en) | User oriented information search engine system and method | |
| Cafarella et al. | Data integration for the relational web | |
| Nie et al. | Harvesting visual concepts for image search with complex queries | |
| Weis et al. | DogmatiX tracks down duplicates in XML | |
| CN102609458B (en) | A kind of picture recommendation method and device | |
| Rui et al. | Bipartite graph reinforcement model for web image annotation | |
| Lin et al. | An integrated approach to extracting ontological structures from folksonomies | |
| US10747795B2 (en) | Cognitive retrieve and rank search improvements using natural language for product attributes | |
| Morcos et al. | Dataxformer: An interactive data transformation tool | |
| CN104408115A (en) | Semantic link based recommendation method and device for heterogeneous resource of TV platform | |
| Garrido et al. | Temporally anchored relation extraction | |
| CN102693320B (en) | Searching method and device | |
| CN104657376A (en) | Searching method and searching device for video programs based on program relationship | |
| CN103942274A (en) | Labeling system and method for biological medical treatment image on basis of LDA | |
| Feng et al. | Cmdbench: A benchmark for coarse-to-fine multimodal data discovery in compound AI systems | |
| Balan et al. | Design and development of an algorithm for image clustering in textile image retrieval using color descriptors | |
| Podkorytov et al. | Hybrid. poly: A consolidated interactive analytical polystore system | |
| Xia et al. | Content-irrelevant tag cleansing via bi-layer clustering and peer cooperation | |
| KrishnaKumar et al. | Mining association rules between sets of items in large databases | |
| Paparizos et al. | Answering web queries using structured data sources | |
| Doulaverakis et al. | Ontology-based access to multimedia cultural heritage collections-The REACH project | |
| Canalle et al. | A strategy for selecting relevant attributes for entity resolution in data integration systems | |
| Ventura et al. | Automatic keyframe selection based on mutual reinforcement algorithm | |
| Jónsson et al. | Relational database performance for multimedia: a case study | |
| Rigamonti et al. | Faericworld: browsing multimedia events through static documents and links |
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant |