CN101158971A

Movatterモバイル変換

Info

Publication number: CN101158971A
Application number: CNA2007101872765A
Authority: CN
Inventors: 刘汉洲
Original assignee: Shenzhen Xunlei Networking Technologies Co Ltd
Current assignee: Beijing Zhigu Ruituo Technology Services Co Ltd
Priority date: 2007-11-15
Filing date: 2007-11-15
Publication date: 2008-04-09
Anticipated expiration: 2027-11-15
Also published as: CN100557612C

Abstract

本发明公开了一种基于搜索引擎的搜索结果排序方法及装置，涉及搜索引擎领域，使排序结果更加贴近用户的需求。方法包括：对用户输入的搜索词进行分词处理；以分词处理所得的分词分别在关键词索引中进行查找，以确定所述搜索词在各待排序网络资源中的关键词权重；确定所述搜索词在各待排序网络资源中的总权重；以及按照总权重对所述各待排序的网络资源进行排序，并呈现给用户。装置包括：分词单元、关键词权重确定单元、总权重确定单元、排序单元和呈现单元。

The invention discloses a method and device for sorting search results based on a search engine, relates to the field of search engines, and makes the sorting results closer to the needs of users. The method includes: performing word segmentation processing on the search word input by the user; searching the word segmentation obtained from the word segmentation processing in the keyword index to determine the keyword weight of the search word in each network resource to be sorted; determining the search word The total weight of the words in each network resource to be sorted; and the network resources to be sorted are sorted according to the total weight, and presented to the user. The device includes: a word segmentation unit, a keyword weight determination unit, a total weight determination unit, a sorting unit and a presentation unit.

Description

Translated fromChinese

一种基于搜索引擎的搜索结果排序方法及装置Method and device for sorting search results based on search engine

技术领域technical field

本发明涉及搜索引擎领域，特别是涉及一种基于搜索引擎的搜索结果排序方法及装置。The invention relates to the field of search engines, in particular to a method and device for sorting search results based on a search engine.

背景技术Background technique

随着搜索引擎技术的不断发展与信息处理技术的进步，人们对搜索引擎的需求也越来越广泛，搜索引擎的种类也呈现多样化发展。目前，主流的搜索引擎分为：全文搜索引擎、目录搜索引擎和元搜索引擎。近期，垂直搜索引擎也逐渐进入了人们的视野。With the continuous development of search engine technology and the advancement of information processing technology, people's needs for search engines are becoming more and more extensive, and the types of search engines are also developing in diversification. At present, the mainstream search engines are divided into: full-text search engines, directory search engines and meta-search engines. Recently, vertical search engines have gradually entered people's field of vision.

在搜索引擎领域，评价一个搜索引擎好坏的重要标准就是能否让用户尽快找到所需的信息，即与用户搜索主题相关的各种信息。In the field of search engines, an important criterion for evaluating a search engine is whether it allows users to find the desired information as soon as possible, that is, various information related to the user's search topic.

近年来，各大搜索引擎在搜索结果的相关性排序上均做了优化。所谓搜索结果的相关性就是指用户搜索词和页面的相关程度。通常，相关性是搜索引擎进行排序的一个重要依据。计算页面相关性的主要方法有google的PageRank、Bharat的HillTop、百度的超链分析等。它们的基本原理就是根据网页的引用情况进行排序。In recent years, major search engines have optimized the relevance ranking of search results. The so-called relevance of search results refers to the degree of relevance between the user's search terms and the page. Usually, relevance is an important basis for search engines to sort. The main methods for calculating page relevance include Google's PageRank, Bharat's HillTop, Baidu's hyperlink analysis, etc. Their basic principle is to sort according to the citations of web pages.

但是由于在中文搜索引擎中存在分词的问题，词典是以搜索词作为查询词的搜索引擎的基础。词典的好坏在一定程度上决定着搜索结果排序的效果。词典过小，会导致过多的无关信息的出现；词典过大，有时会导致部分词搜索结果的主题过少等问题出现。因此如何确定词典，增加新的扩充词典集，使搜索引擎的结果更加准确、更加人性化就随之成为一个备受关注的问题。However, due to the problem of word segmentation in Chinese search engines, dictionaries use search words as the basis of search engines for query words. The quality of the dictionary determines the effect of sorting the search results to a certain extent. If the dictionary is too small, too much irrelevant information will appear; if the dictionary is too large, sometimes there will be too few topics in the search results for some words. Therefore, how to determine the dictionary, add a new extended dictionary set, and make the results of the search engine more accurate and more humanized has become a problem that has attracted much attention.

发明内容Contents of the invention

本发明实施例提供一种基于搜索引擎的搜索结果排序方法及装置，使排序结果更加贴近用户的需求。Embodiments of the present invention provide a method and device for sorting search results based on a search engine, so that the sorting results are closer to user needs.

本发明实施例的一种基于搜索引擎的搜索结果排序方法，包括下列步骤：对用户输入的搜索词进行分词处理；以分词处理所得的分词分别在关键词索引中进行查找，以确定所述搜索词在各待排序网络资源中的关键词权重；确定所述搜索词在各待排序网络资源中的总权重；以及按照总权重对所述各待排序的网络资源进行排序，并呈现给用户。A method for sorting search results based on a search engine in an embodiment of the present invention includes the following steps: performing word segmentation processing on the search word input by the user; searching the word segmentation obtained from the word segmentation processing in the keyword index respectively to determine the search term. The keyword weights of the words in each network resource to be sorted; determining the total weight of the search word in each network resource to be sorted; and sorting the network resources to be sorted according to the total weight, and presenting it to the user.

本发明实施例的一种基于搜索引擎的搜索结果排序装置，包括：分词单元，用于对用户输入的搜索词进行分词处理；关键词权重确定单元，用于以分词处理所得分词分别在关键词索引中进行查找，以确定所述搜索词在各待排序网络资源中的关键词权重；总权重确定单元，用于确定所述搜索词在各待排序网络资源中的总权重；排序单元，用于按照总权重对所述各待排序的网络资源进行排序；呈现单元，用于向用户呈现排序结果。A device for sorting search results based on a search engine in an embodiment of the present invention includes: a word segmentation unit for performing word segmentation processing on the search words input by the user; search in the index to determine the keyword weight of the search term in each network resource to be sorted; the total weight determination unit is used to determine the total weight of the search term in each network resource to be sorted; the sorting unit uses for sorting the network resources to be sorted according to the total weight; the presenting unit is configured to present the sorting result to the user.

综上所述，本发明实施例中对用户输入的搜索词进行分词处理；以分词处理所得的分词分别在关键词索引中进行查找，以确定所述搜索词在各待排序网络资源中的关键词权重，并确定所述搜索词在各待排序网络资源中的总权重。由于总权重中考虑了搜索词与关键词的匹配等情况，所以按照总权重对所述各待排序的网络资源进行排序并呈现给用户，可更加贴近用户的需求。To sum up, in the embodiment of the present invention, word segmentation processing is performed on the search word input by the user; the word segmentation processing obtained by word segmentation processing is respectively searched in the keyword index to determine the keyword of the search word in each network resource to be sorted. The word weight, and determine the total weight of the search word in each network resource to be sorted. Since the total weight takes into account the matching of search words and keywords, the network resources to be sorted are sorted according to the total weight and presented to the user, which can be more close to the needs of the user.

附图说明Description of drawings

图1为本发明实施例的方法步骤流程图；Fig. 1 is the flow chart of the method step of the embodiment of the present invention;

图2为本发明实施例的装置结构示意图；Fig. 2 is the device structural representation of the embodiment of the present invention;

图3为本发明实施例的装置优化结构示意图；Fig. 3 is a schematic diagram of an optimized structure of a device according to an embodiment of the present invention;

图4为本发明实施例的索引示意图；FIG. 4 is a schematic diagram of an index according to an embodiment of the present invention;

图5为本发明实施例中确定待排序网络资源的示意图；FIG. 5 is a schematic diagram of determining network resources to be sorted in an embodiment of the present invention;

图6为本发明实施例中查询分词权重的示意图。FIG. 6 is a schematic diagram of query word segmentation weights in an embodiment of the present invention.

具体实施方式Detailed ways

为了使排序结果更加贴近用户的需求，本发明实施例提供了一种基于搜索引擎的搜索结果排序方法及装置，以下分别简要概述。In order to make the sorting results closer to the needs of users, embodiments of the present invention provide a method and device for sorting search results based on a search engine, which are briefly summarized below.

本发明实施例提供的一种基于搜索引擎的搜索结果排序方法，在进行了一些预先设置，用户输入搜索词，并且确定了待排序网络资源之后，参见图1所示，执行下列主要步骤：In the method for sorting search results based on a search engine provided by an embodiment of the present invention, after some preset settings are made, the user inputs a search word, and the network resources to be sorted are determined, as shown in FIG. 1 , the following main steps are performed:

S1、对用户输入的搜索词进行分词处理(本步骤也可在确定待排序网络资源之前执行)。S1. Perform word segmentation processing on the search word input by the user (this step may also be performed before determining the network resources to be sorted).

S2、以分词处理所得的分词分别在关键词索引中进行查找，以确定所述搜索词在各待排序网络资源(包括但不限于网页资源及下载资源，以下不再赘述)中的关键词权重。S2. The word segmentation obtained by the word segmentation processing is searched in the keyword index respectively to determine the keyword weight of the search word in each network resource to be sorted (including but not limited to webpage resources and download resources, which will not be described in detail below). .

S3、确定所述搜索词在各待排序网络资源中的总权重。S3. Determine the total weight of the search word in each network resource to be ranked.

S4、按照总权重对所述各待排序的网络资源进行排序，并呈现给用户。S4. Sort the network resources to be sorted according to the total weight, and present it to the user.

在用户输入搜索词进行搜索之前，预先的设置步骤，具体包括：Before the user enters the search term to search, the pre-setting steps include:

定制关键词词典的步骤：以词和词的属性作为基本结构，定制的关键词词典中包括各有效词和每一有效词对应的属性，以及各无效词和每一无效词对应的属性。所述无效词的集合与有效词的集合互为互斥关系，并且一个无效词包含的字符覆盖一个有效词包含的字符。所述词的属性以字符型数字表示，每一位字符分别表示所述词的一种属性。The steps of customizing the keyword dictionary: taking words and their attributes as the basic structure, the customized keyword dictionary includes each valid word and the attribute corresponding to each valid word, and each invalid word and the attribute corresponding to each invalid word. The set of invalid words and the set of valid words are mutually exclusive, and the characters contained in one invalid word cover the characters contained in one valid word. The attributes of the word are represented by character numbers, and each character represents an attribute of the word.

提取关键词的步骤：依据关键词词典，按最大匹配原则对每一网络资源的主题信息进行分词处理；根据分词处理所得分词的属性对该分词进行过滤，以提取每一网络资源的主题信息的关键词。其中，以网页的标题作为该网页的主题信息，或者从网页的内容中提取该网页的主题信息，或者以描述下载资源的信息作为主题信息等。The step of extracting keywords: according to the keyword dictionary, the subject information of each network resource is segmented according to the principle of maximum matching; the word segmentation is filtered according to the attributes of the segmented words obtained by word segmentation, so as to extract the subject information of each network resource Key words. Wherein, the title of the web page is used as the subject information of the web page, or the subject information of the web page is extracted from the content of the web page, or the information describing the downloaded resources is used as the subject information.

建立关键词索引的步骤：分别对每一网络资源的主题信息的各关键词采用基础分词词典进行分词处理，并建立关键词的各分词到网络资源的关键词索引。The step of establishing a keyword index: each keyword of the subject information of each network resource is segmented using a basic word segmentation dictionary, and a keyword index from each word segmentation of the keyword to the network resource is established.

建立资源索引的步骤：根据基础分词词典对网络资源的主题信息进行分词处理，并建立网络资源的各分词到网络资源的资源索引。The step of establishing a resource index: perform word segmentation processing on the subject information of the network resource according to the basic word segmentation dictionary, and establish a resource index from each word segment of the network resource to the network resource.

配置权重的步骤：根据关键词的各分词词长占该关键词词长的比例，为各分词分别配置分词权重；或者根据网络资源的信息(包括但不限于：被浏览次数和/或被引用情况和/或被下载次数和/或文件格式，以下不再赘述)，为该网络资源配置静态权重，以及根据关键词的各分词词长占该关键词词长的比例，为各分词分别配置分词权重。配置的权重可记录在上述资源索引和关键词索引中。配置权重后在S2中，可将对搜索词进行分词处理所得分词分别在关键词索引中进行查找，以确定每一分词在各待排序网络资源的主题信息的关键词中的分词权重，并将各分词在同一待排序网络资源的主题信息中的分词权重相加，作为搜索词在该待排序网络资源中的关键词权重。在S3中，可取搜索词在当前待排序网络资源中的关键词权重作为总权重；也可取根据当前待排序网络资源的信息配置的静态权重和搜索词在当前待排序网络资源中的关键词权重，并将该静态权重与关键词权重组合成当前待排序网络资源的总权重；或者以其它相关权重与关键词权重组合成当前待排序网络资源的总权重。Steps for configuring weights: according to the ratio of the word length of each word segment of the keyword to the length of the keyword, configure word segmentation weights for each word segment; or according to the information of network resources (including but not limited to: the number of times viewed and/or cited situation and/or the number of downloads and/or file formats, which will not be described in detail below), configure static weights for the network resource, and configure each partici Word segmentation weight. The configured weights can be recorded in the resource index and keyword index mentioned above. After the weight is configured, in S2, the word segmentation processing of the search word can be searched in the keyword index respectively to determine the weight of each word segmentation in the keywords of the subject information of the network resources to be sorted, and The weights of each participle in the subject information of the same network resource to be sorted are summed up to serve as the keyword weight of the search term in the network resource to be sorted. In S3, the keyword weight of the search term in the current network resource to be sorted can be taken as the total weight; the static weight configured according to the information of the current network resource to be sorted and the keyword weight of the search term in the current network resource to be sorted can also be taken , and combine the static weight with the keyword weight to form the total weight of the current network resource to be sorted; or combine other relevant weights with the keyword weight to form the total weight of the current network resource to be sorted.

在用户输入搜索词进行搜索之后，确定待排序网络资源具体以对搜索词进行分词处理所得分词分别在资源索引中进行查找，以分别确定每一分词所属的网络资源的集合；取各所述集合的交集，作为待排序的网络资源。After the user enters the search word to search, determine the specific network resources to be sorted and search the word segmentation for the search word. The resulting word segmentation is respectively searched in the resource index to determine the set of network resources to which each word segmentation belongs; take each set The intersection of , as the network resources to be sorted.

本发明实施例还提供了一种基于搜索引擎的搜索结果排序装置，参见图2所示，其包括：分词单元、关键词权重确定单元、总权重确定单元、排序单元和呈现单元。The embodiment of the present invention also provides a device for sorting search results based on a search engine, as shown in FIG. 2 , which includes: a word segmentation unit, a keyword weight determination unit, a total weight determination unit, a sorting unit and a presentation unit.

分词单元，用于对用户输入的搜索词进行分词处理。The word segmentation unit is used to perform word segmentation processing on the search word input by the user.

关键词权重确定单元，用于以分词处理所得分词分别在关键词索引中进行查找，以确定所述搜索词在各待排序网络资源中的关键词权重。The keyword weight determination unit is used to search the keyword index for the segmented words obtained through the word segmentation process, so as to determine the keyword weight of the search word in each network resource to be sorted.

总权重确定单元，用于确定所述搜索词在各待排序网络资源中的总权重。The total weight determination unit is configured to determine the total weight of the search term in each network resource to be sorted.

排序单元，用于按照总权重对所述各待排序的网络资源进行排序。The sorting unit is configured to sort the network resources to be sorted according to the total weight.

呈现单元，用于向用户呈现排序结果。The presentation unit is configured to present the sorting results to the user.

进一步为了提供上述单元所需的信息，参见图3所示，所述装置还包括：定制单元、提取单元、关键词索引建立单元、资源索引建立单元、确定单元和配置单元。Further, in order to provide information required by the above units, as shown in FIG. 3 , the device further includes: a customization unit, an extraction unit, a keyword index establishment unit, a resource index establishment unit, a determination unit and a configuration unit.

定制单元，用于以词和词的属性作为基本结构，定制关键词词典；定制的关键词词典中包括各有效词和每一有效词对应的属性，以及各无效词和每一无效词对应的属性。The custom unit is used to customize the keyword dictionary with the word and the attribute of the word as the basic structure; the customized keyword dictionary includes each valid word and the corresponding attribute of each valid word, and each invalid word and the corresponding attribute of each invalid word Attributes.

提取单元，用于依据关键词词典，按最大匹配原则对每一网络资源的主题信息进行分词处理；根据分词处理所得分词的属性对该分词进行过滤，以提取每一网络资源的主题信息的关键词。The extraction unit is used to perform word segmentation processing on the theme information of each network resource according to the principle of maximum matching according to the keyword dictionary; filter the word segmentation according to the attributes of the word segmentation processing to extract the key words of the theme information of each network resource word.

关键词索引建立单元，用于根据基础分词词典分别对每一网络资源的主题信息的各关键词进行分词处理，并建立关键词的各分词到网络资源的关键词索引，以备关键词权重确定单元调用。The keyword index establishment unit is used to perform word segmentation processing on each keyword of the subject information of each network resource according to the basic word segmentation dictionary, and establishes keyword indexes from each word segmentation of the keyword to the network resource, in order to determine the weight of the keyword unit calls.

资源索引建立单元，用于根据基础分词词典对网络资源的主题信息进行分词处理，并建立网络资源的各分词到网络资源的资源索引。The resource index building unit is configured to perform word segmentation processing on the topic information of the network resources according to the basic word segmentation dictionary, and establish a resource index from each word segmentation of the network resources to the network resources.

确定单元，以对搜索词进行分词处理所得分词分别在资源索引中进行查找，以分别确定每一分词所属的网络资源的集合；取各所述集合的交集，作为待排序的网络资源。The determining unit is used to search the resource index for the segmented words obtained by segmenting the search word, so as to respectively determine the set of network resources to which each segmented word belongs; take the intersection of each set as the network resource to be sorted.

配置单元，用于根据关键词的各分词词长占该关键词词长的比例，为各分词分别配置分词权重；或者根据网络资源的信息，为该网络资源配置静态权重，并根据关键词的各分词词长占该关键词词长的比例，为各分词分别配置分词权重。配置单元配置权重后，关键词权重确定单元可将对搜索词进行分词处理所得分词分别在关键词索引中进行查找，以确定每一分词在各待排序网络资源的主题信息的关键词中的分词权重，并将各分词在同一待排序网络资源的主题信息中的分词权重相加，作为搜索词在该待排序网络资源中的关键词权重。总权重确定单元可取搜索词在当前待排序网络资源中的关键词权重作为总权重；也可取根据当前待排序网络资源的信息配置的静态权重和搜索词在当前待排序网络资源中的关键词权重，并将该静态权重与关键词权重组合成当前待排序网络资源的总权重；或者以其它相关权重与关键词权重组合成当前待排序网络资源的总权重。The configuration unit is used to configure the word segmentation weight for each word according to the ratio of the word length of each word segment of the keyword to the length of the keyword; or configure the static weight for the network resource according to the information of the network resource, and according to the keyword. The proportion of the word length of each participle to the length of the keyword, and the participle weight is configured for each participle. After the weight is configured by the configuration unit, the keyword weight determination unit can search the word segmentation for the search word in the keyword index respectively, so as to determine the word segmentation of each word in the keywords of the subject information of the network resources to be sorted weight, and add up the word segmentation weights of each word in the topic information of the same network resource to be sorted, as the keyword weight of the search term in the network resource to be sorted. The total weight determination unit can take the keyword weight of the search term in the current network resource to be sorted as the total weight; it can also take the static weight according to the information configuration of the current network resource to be sorted and the keyword weight of the search term in the current network resource to be sorted , and combine the static weight with the keyword weight to form the total weight of the current network resource to be sorted; or combine other relevant weights with the keyword weight to form the total weight of the current network resource to be sorted.

至此，对本发明实施例的方法及装置的概述完毕。以下通过1个实施例进一步详细描述本发明。So far, the overview of the method and device of the embodiment of the present invention is completed. The present invention is further described in detail by an embodiment below.

实施例1、本实施例包括设置步骤、确定待排序网络资源的步骤、计算权重的步骤、排序步骤，以及呈现步骤。其中设置步骤包括：关键词词典的定制子步骤、关键词的提取子步骤、建立关键词索引的子步骤、建立资源索引的子步骤，以及权重配置子步骤。Embodiment 1. This embodiment includes a setting step, a step of determining network resources to be sorted, a step of calculating weights, a sorting step, and a presenting step. The setting steps include: a sub-step of customizing the keyword dictionary, a sub-step of extracting keywords, a sub-step of establishing a keyword index, a sub-step of establishing a resource index, and a sub-step of weight configuration.

101、关键词词典的定制。101. Customization of keyword dictionary.

关键词，即能够标识一个网络资源(网页资源或下载资源)的主题信息的词汇。例如，在搜索引擎中，用户经常会输入软件名称+ “下载”，电影名+“高清晰”等词组，这里的软件名称和电影名就可以定义为这些词组的关键词。Keywords are words that can identify subject information of a network resource (web page resource or download resource). For example, in a search engine, users often enter phrases such as software name + "download", movie name + "high definition", and the software name and movie name here can be defined as keywords of these phrases.

为了有效提取一个网络资源的主题信息的关键词，首先需要建立一个关键词词典。根据用户的日常搜索习惯统计，在影视搜索引擎、音乐搜索引擎以及通用搜索引擎中，用户常常会输入影视名、歌曲名、歌手名等词汇作为搜索词。因此，可以根据目前流行的电影、电视剧、歌曲、歌手、演员等信息建立关键词词典。该词典的基本结构为：(词，属性)。其中，属性描述了词的有效性及类别，如是否有效，是否为电影名、歌名、软件名等。In order to effectively extract the keywords of the subject information of a network resource, it is first necessary to establish a keyword dictionary. According to the statistics of users' daily search habits, in video search engines, music search engines, and general search engines, users often input words such as movie titles, song titles, and singer names as search terms. Therefore, a keyword dictionary can be established according to information such as currently popular movies, TV dramas, songs, singers, and actors. The basic structure of the dictionary is: (word, attribute). Among them, the attribute describes the validity and category of the word, such as whether it is valid, whether it is a movie name, song name, software name, etc.

本实施例采用以下方式(但不限于该方式)描述属性：以一个字节的字符型数字按位描述属性信息，共8位，每一位代表该词的一种属性，1为具有该属性，0为不具有该属性。如“英雄”既可以是电影名又可以是电视剧名，它的属性就可以表示为11100000，各位属性信息参见表1所示：This embodiment adopts the following method (but not limited to this method) to describe the attribute: describe the attribute information bit by bit with a character number of one byte, a total of 8 bits, each bit represents an attribute of the word, and 1 means having the attribute , 0 means not having this attribute. For example, "hero" can be either a movie name or a TV series name, and its attributes can be expressed as 11100000. Please refer to Table 1 for each attribute information:

77 66 55 44 33 2 2 1 1 00 有效性Validity 影视film and television 电视剧 TV drama 歌名song title 歌手singer 导演director 演员 actor 软件名software name

表1Table 1

其中最高位(即表1所示的第7位)的属性定义如下：该位记录了关键词词典中词的有效属性，无效词集合与有效词集合互为互斥关系。无效词集合中的词A在字面上会包含有效词集合中的某个词B，如某电影名“东”这个词为有效词，“东方”、“东门”等为无效词。无效词的优先确定原则为：字面上包含某个有效词，但不属于有效词集合，而且不是某个电影名、歌名等可以作为关键词的词汇。Wherein the attribute of the highest bit (ie the 7th bit shown in Table 1) is defined as follows: this bit records the valid attributes of words in the keyword dictionary, and the set of invalid words and the set of valid words are mutually exclusive. Word A in the set of invalid words will literally contain a certain word B in the set of valid words. For example, the word "Dong" in the title of a movie is a valid word, and "Dongfang" and "Dongmen" are invalid words. The priority determination principle for invalid words is: a valid word is literally included, but it does not belong to the set of valid words, and it is not a vocabulary that can be used as a keyword such as a movie title or song title.

102、关键词的提取。102. Keyword extraction.

针对搜索引擎数据库中的每一网络资源，需要为其主题信息提取相应的关键词。For each network resource in the search engine database, it is necessary to extract corresponding keywords for its subject information.

首先采用关键词词典，按最大匹配原则对网络资源的主题信息进行分词，将分词所得结果根据其属性进行过滤。去掉属性为无效的词汇，保留属性为有效的词汇，并以保留的词汇作为该网络资源的主题信息的关键词。Firstly, the keyword dictionary is used to segment the subject information of network resources according to the principle of maximum matching, and the result of word segmentation is filtered according to its attributes. The vocabulary whose attribute is invalid is removed, the vocabulary whose attribute is valid is reserved, and the reserved vocabulary is used as the keyword of the subject information of the network resource.

例如，关键词词典中有以下一组词：For example, the keyword dictionary has the following set of words:

东 1100 0000East 1100 0000

东方 0000 0000Oriental 0000 0000

东游记 1010 0000Journey to the East 1010 0000

东北 0000 0000Northeast 0000 0000

对如下一组网页标题的提取结果为：The extraction results for the following set of web page titles are:

影片东的花絮 ------ 东Highlights of the movie East ------ East

东游记高清晰版 ------ 东游记Journey to the East HD Version ------ Journey to the East

东北的小路 ------Northeast path ------

对于垂直搜索引擎而言，如对影视搜索引擎，关键词的最后确定还可以根据提取的关键词的其他属性进一步过滤。如对网页标题“龙虎门甄子丹主演”提取的关键词为“龙虎门”和“甄子丹”，但“甄子丹”不是一个影视词汇，而是一个人名，此时就应该将“甄子丹”这个词过滤。这种过滤方式可以依据搜索引擎的具体搜索类别而确定。For a vertical search engine, such as a film and television search engine, the final determination of keywords can be further filtered according to other attributes of the extracted keywords. For example, the keywords extracted from the title of the web page "Starring Donnie Yen" are "Dragon Tiger Gate" and "Donnie Yen", but "Donnie Yen" is not a film and television vocabulary, but a person's name. At this time, the word "Donnie Yen" should be filtered. This filtering method can be determined according to the specific search category of the search engine.

103、建立关键词索引。103. Establish a keyword index.

采用基础分词词典(但不限于)，分别对每一网络资源的主题信息的各关键词进行分词处理，并建立关键词的各分词到网络资源的关键词索引。Using a basic word segmentation dictionary (but not limited to), perform word segmentation processing for each keyword of the subject information of each network resource, and establish a keyword index from each word segmentation of the keyword to the network resource.

例如有如下一批网络资源的主题信息：For example, the subject information of the following batch of network resources:

Doc1：不能说的秘密全集中文字幕；Doc1: The Unspeakable Secret Complete Works with Chinese subtitles;

Doc2：不能说的秘密全集；Doc2: The Complete Works of Unspeakable Secrets;

Doc3：铁三角DVD中文字幕；Doc3: Audio-Technica DVD Chinese subtitles;

Doc4：铁三角全集；Doc4: The Complete Works of the Iron Triangle;

Doc5：铁三角(主演任达华)；Doc5: Iron Triangle (starring Simon Yam);

Doc6：秘密全集；Doc6: The Complete Works of Secrets;

它们的关键词分别为：Their keywords are:

Doc1：不能说的秘密；Doc1: Secrets that cannot be told;

Doc2：不能说的秘密；Doc2: A secret that cannot be told;

Doc3：铁三角；Doc3: Iron Triangle;

Doc4：铁三角；Doc4: Iron Triangle;

Doc5：铁三角；Doc5: Iron Triangle;

Doc6：秘密。Doc6: Secret.

对各关键词进行分词处理，得到如下分词：不能、说、的、秘密、铁三角。Word segmentation processing is performed on each keyword to obtain the following word segmentation: can't, say, of, secret, iron triangle.

关键词索引的建立情况如下：The establishment of the keyword index is as follows:

“不能”关联Doc1和Doc2；“说”关联Doc1和Doc2；“的”关联Doc1和Doc2；“秘密”关联Doc1、Doc2和Doc6；“铁三角”关联Doc3、Doc4和Doc5。"Can't" associates Doc1 and Doc2; "say" associates Doc1 and Doc2; "of" associates Doc1 and Doc2; "secret" associates Doc1, Doc2, and Doc6; "iron triangle" associates Doc3, Doc4, and Doc5.

104、建立资源索引(与建立关键词索引之间不分先后)。104. Establishing a resource index (in no particular order with establishing a keyword index).

根据基础分词词典(但不限于)对网络资源的主题信息进行分词处理，并建立网络资源的各分词到网络资源的资源索引。According to the basic word segmentation dictionary (but not limited to), the subject information of the network resource is segmented, and the resource index of each word segment of the network resource to the network resource is established.

Doc3：铁三角DVD中文字幕；Doc3: Audio-Technica DVD Chinese subtitles;

Doc4：铁三角全集；Doc4: The Complete Works of the Iron Triangle;

Doc5：铁三角(主演任达华)；Doc5: Iron Triangle (starring Ren Dahua);

Doc6：秘密全集；Doc6: Complete Works of Secrets;

分词处理后资源索引的建立情况如下：The establishment of resource index after word segmentation processing is as follows:

“不能”关联Doc1，Doc2；“说”关联Doc1，Doc2；“的”关联Doc1，Doc2；“秘密”关联Doc1，Doc2，Doc6；“全集”关联Doc1，Doc2，Doc4，Doc6；“中文”关联Doc1，Doc3；“字幕”关联Doc1，Doc3；“铁三角”关联Doc3，Doc4，Doc5；“DVD”关联Doc3；“主演”关联Doc5；“任达华”关联Doc5。"Can't" is associated with Doc1, Doc2; "Say" is associated with Doc1, Doc2; "De" is associated with Doc1, Doc2; "Secret" is associated with Doc1, Doc2, Doc6; "Complete Works" is associated with Doc1, Doc2, Doc4, Doc6; "Chinese" is associated Doc1, Doc3; "Subtitle" is associated with Doc1, Doc3; "Iron Triangle" is associated with Doc3, Doc4, Doc5; "DVD" is associated with Doc3; "starring" is associated with Doc5; "Ren Dahua" is associated with Doc5.

105、权重配置。105. Weight configuration.

权重配置包括：对网络资源的静态权重配置以及对关键词中各分词的权重配置这两部分。The weight configuration includes two parts: the static weight configuration of the network resources and the weight configuration of each participle in the keyword.

其中，网页资源的静态权重由网页的浏览次数、网页来源、网页引用情况等信息确定；下载资源的静态权重由资源的下载次数、文件大小、文件格式等信息确定。例如：对某下载资源docid1而言，可以根据docid1的下载次数、docid1的大小等信息确定该下载资源的静态权重为W1。Wherein, the static weight of the webpage resource is determined by information such as the number of page views, the source of the webpage, and the citation of the webpage; the static weight of the downloaded resource is determined by the information such as the download times, file size, and file format of the resource. For example, for a download resource docid1, the static weight of the download resource can be determined as W1 according to information such as the download times of docid1 and the size of docid1.

其中，对关键词中各分词的权重配置包括下列步骤：首先根据基础分词词典(但不限于)对关键词进行分词，如关键词“不能说的秘密”被分为四个词，即分词结果为：不能、说、的、秘密。其次假设每个关键词的权重均为weight＝1，则word1“不能”所对应的权重为W11，word2“说”所对应的权重为W21，word3“的”所对应的权重为W31，word4“秘密”所对应的权重为W41，并且W11＝W41＝1/3，W21＝W31＝1/4，即各分词权重按分词词长占关键词词长的比例确定。Among them, the weight configuration of each participle in the keyword includes the following steps: First, the keyword is segmented according to the basic word segmentation dictionary (but not limited to), such as the keyword "unspeakable secret" is divided into four words, that is, the word segmentation result For: can't, say, of, secret. Secondly, assuming that the weight of each keyword is weight=1, then the weight corresponding to word1 "can't" is W11, the weight corresponding to word2 "said" is W21, the weight corresponding to word3 "of" is W31, and the weight corresponding to word4 " The weight corresponding to "secret" is W41, and W11=W41=1/3, W21=W31=1/4, that is, the weight of each word segment is determined according to the ratio of the length of the word segment to the length of the keyword.

配置的静态权重和关键词中各分词的权重可加入到上述资源索引和关键词索引中。参见图4所示，在具体实现中所有网络资源的静态权重信息都记录在一起，并且以网络资源对应的docid为索引。Word1，Word2...Wordn分别记录了该词在各网络资源的主题信息的关键词中的分词权重，并且以关键词所属网络资源的主题信息对应的docid为索引。The configured static weight and the weight of each participle in the keyword can be added to the above-mentioned resource index and keyword index. Referring to FIG. 4 , in a specific implementation, the static weight information of all network resources is recorded together, and the docid corresponding to the network resource is used as an index. Word1, Word2...Wordn respectively record the word segmentation weights of the word in the keywords of the topic information of each network resource, and use the docid corresponding to the topic information of the network resources to which the keyword belongs as an index.

106、确定待排序网络资源。106. Determine the network resources to be sorted.

参见图5所示，当用户输入某个词word作为搜索词进行搜索时，首先对搜索词word采用基础分词词典进行分词处理，得到分词序列word1，word2，...，wordn。然后在图4所示的资源索引中查找出分词wordk，k＝1，2，...，n所对应的docid序列的交集，如docid2，docid4，docid5等，并以docid序列的交集对应的网络资源的交集作为待排序网络资源。Referring to Fig. 5, when the user inputs a certain word word as a search word to search, first, the search word word is segmented using the basic word segmentation dictionary to obtain the word segment sequence word1, word2, . . . , wordn. Then in the resource index shown in Fig. 4, find out wordk, k=1, 2, ..., the intersection of the corresponding docid sequence of n, as docid2, docid4, docid5 etc., and corresponding with the intersection of docid sequence The intersection of network resources is used as the network resources to be sorted.

107、计算权重。107. Calculate the weight.

计算各待排序网络资源的总权重，以下以docid2为例。Calculate the total weight of each network resource to be sorted. The following uses docid2 as an example.

参见图6所示，在关键词索引(参见图4所示)中分别查找word1，word2，...，wordn在docid2所对应的待排序网络资源的主题信息中的分词权重，取出分词权重W12，W22，...，Wn2进行累加，得到搜索词在docid2所对应的待排序网络资源的主题信息中的关键词权重，即Wk(docid)＝∑Wmn。如果某个wordk所对应的docid中不含docid2，则其相应的权重为Wk2＝0，即该词不是docid2对应的网络资源的主题信息的关键词分词。Referring to Fig. 6, search word1, word2, ..., wordn in the subject information of the network resources to be sorted corresponding to docid2 in the keyword index (see Fig. 4) respectively to find the participle weight, and take out the participle weight W12 , W22, . . . , Wn2 are accumulated to obtain the keyword weight of the search word in the topic information of the network resources to be sorted corresponding to docid2, that is, Wk(docid)=∑Wmn. If the docid corresponding to a certain wordk does not contain docid2, its corresponding weight is Wk2=0, that is, the word is not a keyword segment of the subject information of the network resource corresponding to docid2.

并且在图4所示的资源索引中取docid2对应的网络资源的静态权重Ws(docid)。And the static weight Ws(docid) of the network resource corresponding to docid2 is taken in the resource index shown in FIG. 4 .

最后计算docid2对应的网络资源的总权重W(docid)。可根据具体情况确定Ws(docid)和Wk(docid)在W(docid)中分别所占的比例，如：Ws(docid)占q1，Wk(docid)占q2，则W(docid)＝q1*Ws(docid)+q2*Wk(docid)。Finally, the total weight W(docid) of the network resource corresponding to docid2 is calculated. The respective proportions of Ws(docid) and Wk(docid) in W(docid) can be determined according to specific conditions, such as: Ws(docid) accounts for q1, Wk(docid) accounts for q2, then W(docid)=q1* Ws(docid)+q2*Wk(docid).

108、排序。108. Sort.

计算出各待排序网络资源的总权重后，按照总权重由高至低的顺序对所述各待排序网络资源进行排序。After the total weights of the network resources to be sorted are calculated, the network resources to be sorted are sorted in descending order of the total weights.

当采用上述方案对搜索结果排序后，可以得到比较理想的搜索结果。例如，当用户搜索“秘密预告片”时，若搜索结果中有网页标题1-“秘密预告片”，网页标题2-“不能说的秘密预告片”，则“秘密预告片”的权重将大于“不能说的秘密预告片”的权重。这是因为“秘密预告片”的关键词为“秘密”，“不能说的秘密预告片”的关键词为“不能说的秘密”，而“预告片”为无效关键词。当对关键词分词后，“不能说的秘密”将会被分为“不能、说、的、秘密”四个词。在关键词索引中，“秘密”在网页标题1的关键词中的权重为weight，在网页标题2的关键词中的权重为weight/3。After the search results are sorted using the above solution, a relatively ideal search result can be obtained. For example, when a user searches for "Secret Trailer", if there are webpage title 1-"Secret Trailer" and webpage title 2-"Secret Trailer" in the search results, the weight of "Secret Trailer" will be greater than The weight of the "Unspeakable Secret Trailer". This is because the keyword of "secret trailer" is "secret", the keyword of "unspeakable secret trailer" is "unspeakable secret", and "trailer" is an invalid keyword. After the keywords are segmented, "secret that cannot be said" will be divided into four words: "cannot, say, of, secret". In the keyword index, the weight of "secret" in the keywords of web page title 1 is weight, and the weight of "secret" in the keywords of web page title 2 is weight/3.

109、向用户呈现排序结果。109. Present the sorting result to the user.

将实际总权重最高的网络资源排在最前面，从而使排序结果更加贴近用户的需求。The network resource with the highest actual total weight is ranked first, so that the sorting result is closer to the user's needs.

从实施例1中可以看出，q1和q2是可调节的。在特殊情况下，由于提取关键词的原因，有时当用户输入一个字，且该字是一个电影名时，例如“东”，该搜索词可能会有许多结果均为关键词“东”，这时会导致搜索结果过于单一化，结果显示整页均是有关“东”的电影，这样可能与用户实际想要的结果有一定差距。可以降低q2并升高q1，以针对该特殊情况。It can be seen from Example 1 that q1 and q2 are adjustable. In special cases, due to the reason for extracting keywords, sometimes when the user enters a word, and the word is a movie name, such as "East", many results of the search word may be the keyword "East", which means Sometimes, the search results will be too simple, and the results will show that the entire page is full of movies about "Dong", which may have a certain gap with the results that users actually want. It is possible to lower q2 and raise q1 to address this special case.

进一步，本发明实施例中提供了设置步骤、确定待排序网络资源的步骤、计算权重的步骤、排序步骤，以及呈现步骤的具体实现方案。其中设置步骤包括：关键词词典的定制子步骤、关键词的提取子步骤、建立关键词索引的子步骤、建立资源索引的子步骤，以及权重配置子步骤。更好的支撑了本发明。Further, the embodiment of the present invention provides specific implementation solutions of the setting step, the step of determining the network resource to be sorted, the step of calculating the weight, the step of sorting, and the step of presenting. The setting steps include: a sub-step of customizing the keyword dictionary, a sub-step of extracting keywords, a sub-step of establishing a keyword index, a sub-step of establishing a resource index, and a sub-step of weight configuration. Better supported the present invention.

进一步，本发明实施例1中q1和q2可调节，所以可根据具体情况进行调整，满足用户的各种需求。Furthermore, in Embodiment 1 of the present invention, q1 and q2 are adjustable, so they can be adjusted according to specific conditions to meet various needs of users.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.