Movatterモバイル変換


[0]ホーム

URL:


CN103778206A - Method for providing network service resources - Google Patents

Method for providing network service resources
Download PDF

Info

Publication number
CN103778206A
CN103778206ACN201410015647.1ACN201410015647ACN103778206ACN 103778206 ACN103778206 ACN 103778206ACN 201410015647 ACN201410015647 ACN 201410015647ACN 103778206 ACN103778206 ACN 103778206A
Authority
CN
China
Prior art keywords
interest
user
classification
document
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410015647.1A
Other languages
Chinese (zh)
Inventor
张明川
郑瑞娟
吴庆涛
杨春蕾
魏汪洋
白秀玲
崔敏
李晨
李莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Science and Technology
Original Assignee
Henan University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Science and TechnologyfiledCriticalHenan University of Science and Technology
Priority to CN201410015647.1ApriorityCriticalpatent/CN103778206A/en
Publication of CN103778206ApublicationCriticalpatent/CN103778206A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

一种网络服务资源的提供方法,首先对网络服务资源进行分类,然后根据用户的兴趣提供检索方案,并根据该检索方案提供分类后的网络服务资源,对网络服务资源进行分类时,先提取待分类服务资源的特征向量,然后计算特征向量中每个属性出现在每个分类中的概率以及权值,再利用加权朴素贝叶斯公式获取每个属性属于每个类别的概率,选取最大的作为该服务资源的分类类别。本发明在对服务资源进行分类时,利用贝叶斯分类算法并结合了属性相似度的计算,提高了服务资源分类的准确率;而向用户提供服务资源时,是在向量空间模型的框架下探索基于用户兴趣的个性化服务资源检索方法,降低了检索算法中时间和空间的复杂度,提高了检索效率。

A method for providing network service resources, which firstly classifies the network service resources, then provides a search plan according to the interests of users, and provides the classified network service resources according to the search plan, and when classifying the network service resources, first extracts the Classify the feature vector of service resources, and then calculate the probability and weight of each attribute in each category in the feature vector, and then use the weighted naive Bayesian formula to obtain the probability of each attribute belonging to each category, and select the largest as The classification category for this service resource. When classifying service resources, the present invention uses the Bayesian classification algorithm and combines the calculation of attribute similarity to improve the accuracy of service resource classification; and when providing service resources to users, it is under the framework of the vector space model Exploring a personalized service resource retrieval method based on user interests reduces the complexity of time and space in the retrieval algorithm and improves retrieval efficiency.

Description

Translated fromChinese
一种网络服务资源的提供方法A method for providing network service resources

技术领域technical field

本发明涉及到互联网服务资源的提供领域,具体的说是一种网络服务资源的提供方法。The invention relates to the field of providing Internet service resources, in particular to a method for providing network service resources.

背景技术Background technique

服务资源分类是指对网络中各种已存在的服务和资源按照其各自的属性和特点进行分析,并划分使其归属于特定的类别。随着互联网技术的高速发展以及计算机技术的不断普及,促使人们对网络服务资源的依赖愈加强烈,而服务资源的分类是一个复杂的加工处理过程,其涉及到服务资源的预处理、特征向量集合的提取和分类等技术环节,服务资源分类则可理解为采用一定的方法与模式,按照一定的规则将网络上的各种资源进行全面的分析、优选、加工、排列组合、整理、分类等加工处理,使其形成一个有序的、便于用户高效获取并利用的服务资源体的系统过程。服务资源的分类使得繁杂零散的资源集合形成了有序化的结构,使之转化为一个有意义的整体,便于服务资源能依据某一特定形式的规则进行更高层次的存取和利用。目前网络上的服务和资源十分丰富,如何将海量的服务资源准确的分类,这已成为资源分类技术要处理的一个关键问题。Classification of service resources refers to analyzing various existing services and resources in the network according to their respective attributes and characteristics, and classifying them into specific categories. With the rapid development of Internet technology and the continuous popularization of computer technology, people have become more and more dependent on network service resources, and the classification of service resources is a complex processing process, which involves preprocessing of service resources and feature vector collection. Service resource classification can be understood as the use of certain methods and models to conduct comprehensive analysis, optimization, processing, permutation, sorting, and classification of various resources on the network according to certain rules. It is a systematic process of forming an orderly service resource body that is convenient for users to obtain and utilize efficiently. The classification of service resources makes the complex and scattered resource collections form an orderly structure, transforming them into a meaningful whole, so that service resources can be accessed and utilized at a higher level according to a specific form of rules. At present, there are abundant services and resources on the Internet. How to accurately classify massive service resources has become a key issue to be dealt with by resource classification technology.

当前服务资源分类方法基本是用户依据一定规则自行手动进行人工分类。在资源量过大的时候,使用这种方法必定将消耗大量人力,并且效率相当低下。当用户定义新类别时,又要对原先未定义种类的服务资源进行再次分类,若再进行人工分类,代价过大。近年来,一些学者使用智能学习的方法来进行分类,取得了一定的成果。现在一些常用的智能分类方法有聚类算法、决策树算法等。聚类算法由于具有无监督的学习能力,但在高维数据空间中,聚类往往只存在于某些子空间中,并且不同的聚类所关联的子空间也存在差异。受“维度效应”的影响,传统的聚类算法一般无法直接对高维数据进行有效的聚类,需要通过一些特殊的处理。决策树分类的直观表示方法较容易转化为标准的数据库查询,其归纳的方法行之有效,尤其适合大型数据集。但其算法的伸缩性太差,随着数据量的增大其运行时间大大增加。The current classification method of service resources is basically manual classification by users themselves according to certain rules. When the amount of resources is too large, using this method will consume a lot of manpower, and the efficiency is quite low. When the user defines a new category, it is necessary to classify the service resources of the original undefined category again. If the manual classification is performed again, the cost is too high. In recent years, some scholars have used intelligent learning methods to classify and achieved certain results. Now some commonly used intelligent classification methods include clustering algorithm, decision tree algorithm and so on. Clustering algorithms have unsupervised learning ability, but in high-dimensional data space, clusters often only exist in certain subspaces, and the subspaces associated with different clusters are also different. Affected by the "dimension effect", traditional clustering algorithms generally cannot effectively cluster high-dimensional data directly, and require some special processing. The visual representation method of decision tree classification is easier to transform into standard database query, and its inductive method is effective, especially suitable for large data sets. However, the scalability of its algorithm is too poor, and its running time increases greatly with the increase of data volume.

目前,个性化检索是服务资源检索领域的热点和难点。这一领域的研究范围很广,涉及到的问题也很多。许多学者从不同的角度,提出多种技术方法,主要有:①网络数据库技术(web Database),构建用户等相关数据库;②过程跟踪技术(Process Tracking),如Cookies技术等;③代理技术(Agent),代理指在分布式系统中持续自主的发挥作用的计算实体,他有独立性、自主性和交互性等特性,,借助代理,可以很好地完成用户与系统的交互;④数据挖掘技术(Data Mining),从海量数据中采掘出隐含的、先前未知的、对决策有潜在价值的知识和规则,并根据这些规则,预测用户即将发生的行为;⑤推送技术(Push),根据用户定义的准则,自动搜索用户感兴趣的服务资源,并主动传递至用户指定“地点”;⑥信息过滤技术(Information Filtering),信息过滤是一种用来过滤大量信息流,为用户提供相关信息子集的技术。信息过滤可以分为:基于规则的过滤、基于协作的过滤、基于内容的过滤,它们的目标都是根据用户兴趣需要将最有价值的服务资源信息自动推荐给用户,并最大限度地节省用户的阅读时间。At present, personalized retrieval is a hot and difficult point in the field of service resource retrieval. The scope of research in this field is very wide, and there are many problems involved. Many scholars have proposed a variety of technical methods from different angles, mainly including: ① web database technology (web Database), constructing user-related databases; ② process tracking technology (Process Tracking), such as Cookies technology; ③ agent technology (Agent ), an agent refers to a computing entity that continuously and autonomously plays a role in a distributed system. It has the characteristics of independence, autonomy, and interactivity. With the help of an agent, the interaction between the user and the system can be well completed; ④ data mining technology (Data Mining), which extracts implicit, previously unknown, and potentially valuable knowledge and rules for decision-making from massive data, and predicts the upcoming behavior of users based on these rules; ⑤ Push technology (Push), according to user Defined criteria, automatically search for service resources that users are interested in, and actively deliver them to the user-specified "place"; set of technologies. Information filtering can be divided into: rule-based filtering, collaboration-based filtering, and content-based filtering. Their goals are to automatically recommend the most valuable service resource information to users according to their interests and needs, and to save users' time and money to the greatest extent. reading time.

传统的服务资源检索技术满足了人们一定的检索需要,但受通用性限制,无法满足用户一些复杂的查询需求。随着信息爆炸出现,人们对检索系统的功能、智能化程度以及检索效果有了更高的要求,希望能提供更准确、更精炼和更符合个人需要的检索结果。The traditional service resource retrieval technology meets people's certain retrieval needs, but due to the limitation of versatility, it cannot meet some complex query needs of users. With the emergence of the information explosion, people have higher requirements for the function, intelligence and retrieval effect of the retrieval system, hoping to provide more accurate, refined and more personal retrieval results.

发明内容Contents of the invention

为解决传统的检索技术难以满足人们对检索系统功能、智能化程度以及检索效果的要求的问题,本发明提供了一种网络服务资源的提供方法,来满足用户多样化的实时需求,更加快捷的为用户提供更优质的服务。In order to solve the problem that the traditional retrieval technology is difficult to meet people's requirements for retrieval system functions, intelligence and retrieval effects, the present invention provides a method for providing network service resources to meet the diverse real-time needs of users. To provide users with better services.

本发明为解决上述技术问题采用的技术方案为:一种网络服务资源的提供方法,首先对网络服务资源进行分类,然后根据用户的兴趣提供检索方案,并根据该检索方案提供分类后的网络服务资源,所述对网络服务资源进行分类包括以下步骤:The technical solution adopted by the present invention to solve the above technical problems is: a method for providing network service resources, which first classifies the network service resources, then provides a search plan according to the user's interest, and provides the classified network service according to the search plan Resources, the classification of network service resources includes the following steps:

1)预定义m个类别,类标号为                                               

Figure 2014100156471100002DEST_PATH_IMAGE002
,然后提取待分类服务资源X的若干特征向量,并将这些特征向量组成表征该服务资源的一个n维特征向量
Figure 2014100156471100002DEST_PATH_IMAGE004
,分别描述对n个属性的n个度量; 1) Predefine m categories, the class label is
Figure 2014100156471100002DEST_PATH_IMAGE002
, and then extract several feature vectors of the service resource X to be classified, and combine these feature vectors into an n-dimensional feature vector representing the service resource
Figure 2014100156471100002DEST_PATH_IMAGE004
, respectively describe the pair of n attributes n measures of ;

2)针对n维特征向量中每个属性

Figure 2014100156471100002DEST_PATH_IMAGE008
的属性值,每个类别
Figure 2014100156471100002DEST_PATH_IMAGE012
,计算类别
Figure 543675DEST_PATH_IMAGE012
下属性值
Figure 987426DEST_PATH_IMAGE010
出现的概率
Figure 2014100156471100002DEST_PATH_IMAGE014
,以及类别
Figure 577545DEST_PATH_IMAGE012
出现的概率
Figure 2014100156471100002DEST_PATH_IMAGE016
;2) For each attribute in the n-dimensional feature vector
Figure 2014100156471100002DEST_PATH_IMAGE008
attribute value of , for each category
Figure 2014100156471100002DEST_PATH_IMAGE012
, the computed category
Figure 543675DEST_PATH_IMAGE012
lower property value
Figure 987426DEST_PATH_IMAGE010
probability of occurrence
Figure 2014100156471100002DEST_PATH_IMAGE014
, and the category
Figure 577545DEST_PATH_IMAGE012
probability of occurrence
Figure 2014100156471100002DEST_PATH_IMAGE016
;

3)确定n维特征向量中每个属性

Figure 397733DEST_PATH_IMAGE008
的权值
Figure 2014100156471100002DEST_PATH_IMAGE018
;3) Determine each attribute in the n-dimensional feature vector
Figure 397733DEST_PATH_IMAGE008
weight of
Figure 2014100156471100002DEST_PATH_IMAGE018
;

方法如下:首先,定义两个对象空间

Figure 2014100156471100002DEST_PATH_IMAGE020
Figure 2014100156471100002DEST_PATH_IMAGE022
Figure 2014100156471100002DEST_PATH_IMAGE024
为两对象空间第d维属性集合
Figure 2014100156471100002DEST_PATH_IMAGE026
Figure 2014100156471100002DEST_PATH_IMAGE028
之间的距离The method is as follows: First, define two object spaces
Figure 2014100156471100002DEST_PATH_IMAGE020
and
Figure 2014100156471100002DEST_PATH_IMAGE022
,
Figure 2014100156471100002DEST_PATH_IMAGE024
is the d-th dimension attribute set of two object spaces
Figure 2014100156471100002DEST_PATH_IMAGE026
and
Figure 2014100156471100002DEST_PATH_IMAGE028
the distance between

Figure 2014100156471100002DEST_PATH_IMAGE032
Figure 880710DEST_PATH_IMAGE032
 
Figure 2014100156471100002DEST_PATH_IMAGE032
Figure 880710DEST_PATH_IMAGE032
 

式中:

Figure 2014100156471100002DEST_PATH_IMAGE036
Figure 2014100156471100002DEST_PATH_IMAGE038
分别为属性集合
Figure 869264DEST_PATH_IMAGE026
Figure 244881DEST_PATH_IMAGE028
的中心值;
Figure 2014100156471100002DEST_PATH_IMAGE040
Figure 2014100156471100002DEST_PATH_IMAGE042
各是两属性集合覆盖范围的一半,即:In the formula:
Figure 2014100156471100002DEST_PATH_IMAGE036
and
Figure 2014100156471100002DEST_PATH_IMAGE038
set of attributes
Figure 869264DEST_PATH_IMAGE026
and
Figure 244881DEST_PATH_IMAGE028
the central value;
Figure 2014100156471100002DEST_PATH_IMAGE040
and
Figure 2014100156471100002DEST_PATH_IMAGE042
Each is half of the coverage of the two attribute sets, namely:

Figure 2014100156471100002DEST_PATH_IMAGE044
Figure 2014100156471100002DEST_PATH_IMAGE044

其中

Figure 2014100156471100002DEST_PATH_IMAGE046
Figure 2014100156471100002DEST_PATH_IMAGE048
分别为属性集合
Figure 554247DEST_PATH_IMAGE026
的最小值和最大值;in
Figure 2014100156471100002DEST_PATH_IMAGE046
and
Figure 2014100156471100002DEST_PATH_IMAGE048
set of attributes
Figure 554247DEST_PATH_IMAGE026
The minimum and maximum values of ;

然后,定义类别

Figure 761237DEST_PATH_IMAGE012
的训练样本集合为
Figure 2014100156471100002DEST_PATH_IMAGE050
为i类样本的个数,它的第
Figure 2014100156471100002DEST_PATH_IMAGE054
个属性集合用
Figure 203849DEST_PATH_IMAGE026
表示,集合
Figure 129080DEST_PATH_IMAGE026
的期望值为,最小值为
Figure 291071DEST_PATH_IMAGE046
,最大值为
Figure 555830DEST_PATH_IMAGE048
;Then, define the class
Figure 761237DEST_PATH_IMAGE012
The training sample set of is
Figure 2014100156471100002DEST_PATH_IMAGE050
, is the number of samples of type i, and its
Figure 2014100156471100002DEST_PATH_IMAGE054
set of attributes
Figure 203849DEST_PATH_IMAGE026
show, gather
Figure 129080DEST_PATH_IMAGE026
The expected value of , with a minimum value of
Figure 291071DEST_PATH_IMAGE046
, with a maximum value of
Figure 555830DEST_PATH_IMAGE048
;

按照升序排列属性

Figure 777864DEST_PATH_IMAGE054
Figure 361030DEST_PATH_IMAGE036
Figure 2014100156471100002DEST_PATH_IMAGE060
Figure 2014100156471100002DEST_PATH_IMAGE062
表示排序后的类别标识,类别间属性
Figure 638296DEST_PATH_IMAGE054
的距离表示为:Sort attributes in ascending order
Figure 777864DEST_PATH_IMAGE054
of
Figure 361030DEST_PATH_IMAGE036
and
Figure 2014100156471100002DEST_PATH_IMAGE060
,
Figure 2014100156471100002DEST_PATH_IMAGE062
Indicates the sorted category identification, between-category attributes
Figure 638296DEST_PATH_IMAGE054
distance Expressed as:

Figure 2014100156471100002DEST_PATH_IMAGE066
Figure 2014100156471100002DEST_PATH_IMAGE066

属性

Figure 695245DEST_PATH_IMAGE054
归一化的属性相似度为
Figure 2014100156471100002DEST_PATH_IMAGE068
:Attributes
Figure 695245DEST_PATH_IMAGE054
The normalized attribute similarity is
Figure 2014100156471100002DEST_PATH_IMAGE068
:

Figure 2014100156471100002DEST_PATH_IMAGE070
Figure 2014100156471100002DEST_PATH_IMAGE070

由下式计算每个属性的权重The weight of each attribute is calculated by

Figure 2014100156471100002DEST_PATH_IMAGE072
Figure 2014100156471100002DEST_PATH_IMAGE072
;

4)根据步骤2)和步骤3 )所得结果,利用加权朴素贝叶斯公式

Figure 2014100156471100002DEST_PATH_IMAGE074
获取属性
Figure 399765DEST_PATH_IMAGE008
属于各个类别的概率并进行比较,选出最大的一个,作为该n维特征向量所表征的待分类服务资源的分类类别,从而完成服务资源的分类。4) According to the results obtained in step 2) and step 3), use the weighted naive Bayesian formula
Figure 2014100156471100002DEST_PATH_IMAGE074
get attribute
Figure 399765DEST_PATH_IMAGE008
The probabilities belonging to each category are compared, and the largest one is selected as the classification category of the service resource to be classified represented by the n-dimensional feature vector, thereby completing the classification of the service resource.

所述步骤4)的加权朴素贝叶斯公式

Figure 299588DEST_PATH_IMAGE074
中,
Figure 2014100156471100002DEST_PATH_IMAGE076
表示该服务资源分类后所得的类标签;The weighted Naive Bayes formula of step 4)
Figure 299588DEST_PATH_IMAGE074
middle,
Figure 2014100156471100002DEST_PATH_IMAGE076
Indicates the class label obtained after the service resource is classified;

Figure 245459DEST_PATH_IMAGE014
表示该n维特征向量属于某一个类
Figure 281548DEST_PATH_IMAGE012
的后验概率,
Figure 2014100156471100002DEST_PATH_IMAGE078
,其中
Figure 2014100156471100002DEST_PATH_IMAGE080
是在属性中具有值
Figure 32335DEST_PATH_IMAGE010
的类的训练样本数,而中的训练样本数。
Figure 245459DEST_PATH_IMAGE014
Indicates that the n-dimensional feature vector belongs to a certain class
Figure 281548DEST_PATH_IMAGE012
the posterior probability of
Figure 2014100156471100002DEST_PATH_IMAGE078
,in
Figure 2014100156471100002DEST_PATH_IMAGE080
is in the property has value in
Figure 32335DEST_PATH_IMAGE010
the type The number of training samples, and yes The number of training samples in .

所述步骤1)中提取待分类服务资源X的若干特征向量,特征向量包括文件名、文件后缀名、文本内容和文件大小。In the step 1), several feature vectors of the service resource X to be classified are extracted, and the feature vectors include file name, file extension, text content and file size.

所述根据用户的兴趣提供检索方案,首先,定义文档集合D中的文档总数为N,任一属于集合D的文档都可以表示为t维向量的形式:

Figure 2014100156471100002DEST_PATH_IMAGE086
The retrieval scheme is provided according to the interests of the user. First, the total number of documents in the document collection D is defined as N, and any document belonging to the collection D can be expressed as a t-dimensional vector:
Figure 2014100156471100002DEST_PATH_IMAGE086

其中,t是索引词的个数,向量分量

Figure 2014100156471100002DEST_PATH_IMAGE088
代表第i个索引词
Figure 2014100156471100002DEST_PATH_IMAGE090
在文档
Figure 2014100156471100002DEST_PATH_IMAGE092
中所具有的权重,然后再根据用户的兴趣进行检索,具体步骤如下:Among them, t is the number of index words, and the vector component
Figure 2014100156471100002DEST_PATH_IMAGE088
represents the i-th index term
Figure 2014100156471100002DEST_PATH_IMAGE090
in the document
Figure 2014100156471100002DEST_PATH_IMAGE092
The weights in , and then search according to the user's interests, the specific steps are as follows:

步骤一、获取用户兴趣信息,然后用向量或图形的方法将兴趣信息进行形式化的表示,即形成用户兴趣剖像;Step 1. Obtain user interest information, and then use vector or graphic methods to formally represent the interest information, that is, to form a user interest profile;

步骤二、借助分类目录表征用户兴趣,并将分类目录映射为树状结构形成用户兴趣树,用户兴趣树中的节点表示类目,该节点的权值表示用户对该节点表示的类目的感兴趣程度;Step 2: Represent user interest with the help of classification directory, and map the classification directory into a tree structure to form a user interest tree. The nodes in the user interest tree represent categories, and the weight of the node represents the user's sentiment on the category represented by the node. level of interest;

步骤三、使用二元组兴趣向量

Figure 2014100156471100002DEST_PATH_IMAGE094
来表征用户兴趣剖像,则用户i的兴趣剖像构成的兴趣剖像库表征如下:Step 3. Use the binary interest vector
Figure 2014100156471100002DEST_PATH_IMAGE094
To represent user interest profile, then the interest profile database composed of user i’s interest profile is represented as follows:

Figure 2014100156471100002DEST_PATH_IMAGE096
Figure 2014100156471100002DEST_PATH_IMAGE096

式中,

Figure 2014100156471100002DEST_PATH_IMAGE098
代表分类目录中的一个类目;In the formula,
Figure 2014100156471100002DEST_PATH_IMAGE098
Represents a category in the taxonomy;

 

Figure 2014100156471100002DEST_PATH_IMAGE100
Figure 240704DEST_PATH_IMAGE098
在用户兴趣树的权重,表示用户对
Figure 2014100156471100002DEST_PATH_IMAGE102
的感兴趣程度, 
Figure 2014100156471100002DEST_PATH_IMAGE104
Figure 2014100156471100002DEST_PATH_IMAGE100
for
Figure 240704DEST_PATH_IMAGE098
The weight in the user interest tree represents the user's interest in
Figure 2014100156471100002DEST_PATH_IMAGE102
level of interest,
Figure 2014100156471100002DEST_PATH_IMAGE104
;

步骤四、由公式Step 4, by the formula

Figure 913124DEST_PATH_IMAGE096
中可以得出某文档所属的各分类的类目,根据其所属的各分类的类目计算该文档
Figure 890002DEST_PATH_IMAGE106
的兴趣相关因子J,该文档
Figure 797915DEST_PATH_IMAGE106
的兴趣相关因子J等于该文档所属各类目的权重值之和;
Figure 913124DEST_PATH_IMAGE096
A document can be obtained from The category of each category it belongs to, calculate the document according to the category of each category it belongs to
Figure 890002DEST_PATH_IMAGE106
The interest correlator J, the document
Figure 797915DEST_PATH_IMAGE106
The interest-related factor J of the document is equal to the sum of the weight values of various categories to which the document belongs;

步骤五、提取用户检索请求中的提问向量

Figure 2014100156471100002DEST_PATH_IMAGE108
,然后利用矩阵分析中的空间两向量之间相似度公式计算提问向量
Figure 746280DEST_PATH_IMAGE108
与文档向量
Figure 714236DEST_PATH_IMAGE092
之间的相似度,记做该文档的相关因子I,并取出I值最高的前m篇文档;Step 5. Extract the question vector in the user's retrieval request
Figure 2014100156471100002DEST_PATH_IMAGE108
, and then use the similarity formula between two spatial vectors in matrix analysis to calculate the question vector
Figure 746280DEST_PATH_IMAGE108
with document vector
Figure 714236DEST_PATH_IMAGE092
The similarity between is recorded as the correlation factor I of the document, and the first m documents with the highest I value are taken out;

步骤六、从用户兴趣剖像库

Figure 2014100156471100002DEST_PATH_IMAGE110
中提取出该用户的兴趣剖像
Figure 748051DEST_PATH_IMAGE110
,然后根据步骤四和步骤五得到的兴趣相关因子J和文档相关因子I,利用如下公式计算出文档
Figure 895873DEST_PATH_IMAGE092
与提问向量
Figure 26640DEST_PATH_IMAGE108
的兴趣相似度:Step 6. From the user interest profile library
Figure 2014100156471100002DEST_PATH_IMAGE110
Extract the user's interest profile from
Figure 748051DEST_PATH_IMAGE110
, and then according to the interest correlation factor J and document correlation factor I obtained in steps 4 and 5, use the following formula to calculate the
Figure 895873DEST_PATH_IMAGE092
and question vector
Figure 26640DEST_PATH_IMAGE108
similarity of interests:

Figure 2014100156471100002DEST_PATH_IMAGE112
Figure 2014100156471100002DEST_PATH_IMAGE112

式中,为兴趣权数,

Figure 2014100156471100002DEST_PATH_IMAGE116
,用于反映文档的兴趣相关因子对结果的影响程度;In the formula, is the interest weight,
Figure 2014100156471100002DEST_PATH_IMAGE116
, which is used to reflect the degree of influence of the document’s interest-related factors on the results;

步骤七、根据步骤六得出的兴趣相似度SCOREi的大小对这m篇文档进行排序并在界面上显示,优先推荐这m篇文章中与用户兴趣相关的文档;Step 7. According to the interest similarity SCOREi obtained in step 6, the m documents are sorted and displayed on the interface, and the documents related to the user's interests in the m articles are preferentially recommended;

步骤八、跟踪并记录用户对检索结果的访问情况,以此来更新该用户的兴趣剖像库。Step 8: Track and record the user's access to the retrieval results, so as to update the user's interest profile library.

所述步骤八中更新该用户的兴趣剖像库,其具体步骤如下:In said step eight, update the user's interest profile library, and its specific steps are as follows:

①初始化用户兴趣树,使每个节点均对应一原始权值

Figure 2014100156471100002DEST_PATH_IMAGE118
(其中,0<k<n+1),该数值表征用户对此节点下所有文档的访问次数;① Initialize the user interest tree so that each node corresponds to an original weight
Figure 2014100156471100002DEST_PATH_IMAGE118
(where, 0<k<n+1), this value represents the number of times the user visits all documents under this node;

②叶子节点权值不变,重新计算每一个非叶子节点的权值: 其中

Figure 2014100156471100002DEST_PATH_IMAGE122
是与非叶子节点的子节点,x为该非叶子节点的子节点数目;②The weight of leaf nodes remains unchanged, and the weight of each non-leaf node is recalculated: in
Figure 2014100156471100002DEST_PATH_IMAGE122
is the child node of the non-leaf node, and x is the number of child nodes of the non-leaf node;

所述叶子节点是指用户兴趣树中最小的分类类目,非叶子节点是指用户兴趣树中具有子分类的分类类目;The leaf node refers to the smallest classification category in the user interest tree, and the non-leaf node refers to the classification category with sub-categories in the user interest tree;

③若用户访问某些节点中的文档,则重复以上两个步骤;③ If the user accesses documents in some nodes, repeat the above two steps;

④根据步骤②中更新后的非叶子节点的权值来更新用户兴趣剖像④Update the user interest profile according to the weights of non-leaf nodes updated in step ②

Figure 2014100156471100002DEST_PATH_IMAGE124
Figure 2014100156471100002DEST_PATH_IMAGE124

式中

Figure 2014100156471100002DEST_PATH_IMAGE126
Figure 2014100156471100002DEST_PATH_IMAGE128
Figure 2014100156471100002DEST_PATH_IMAGE130
为兴趣树中节点总数,则
Figure 290131DEST_PATH_IMAGE110
即为用户个人兴趣剖像。In the formula
Figure 2014100156471100002DEST_PATH_IMAGE126
,
Figure 2014100156471100002DEST_PATH_IMAGE128
,
Figure 2014100156471100002DEST_PATH_IMAGE130
is the total number of nodes in the interest tree, then
Figure 290131DEST_PATH_IMAGE110
It is the user's personal interest profile.

有益效果:本发明与现有技术相比,具有以下优点:Beneficial effect: compared with the prior art, the present invention has the following advantages:

1)本发明通过引入数学中的相似度的概念,利用加权朴素贝斯公式,将属性相似度的计算应用到加权朴素贝叶斯公式中,用来确定出每个特征属性的权值,将此算法应用到服务资源分类上,对于未知的服务资源数据样本按加权朴素贝叶斯公式计算其属于每一个类别的概率,然后选择其中概率最大的类别作为其类别,以得到基于属性相似度的服务资源分类方法,大大提高了服务资源分类的准确率;1) The present invention introduces the concept of similarity in mathematics and uses the weighted naive Bayesian formula to apply the calculation of attribute similarity to the weighted naive Bayesian formula to determine the weight of each feature attribute. The algorithm is applied to the classification of service resources. For unknown service resource data samples, the probability of belonging to each category is calculated according to the weighted naive Bayesian formula, and then the category with the highest probability is selected as its category to obtain a service based on attribute similarity. The resource classification method greatly improves the accuracy of service resource classification;

2)在对服务资源进行检索时,通过对用户兴趣的提取并分析,并使用兴趣路径上各节点权值的和作为兴趣相关因子,准确建立用户兴趣模型,随着时间推移,用户兴趣也在不断变化,及时对该用户兴趣模型进行更新,保证了用户兴趣模型的时间可靠性,使得检索更加符合用户的实际需要,实现了基于用户兴趣的个性化服务资源检索,明显改善了检索效果;2) When retrieving service resources, by extracting and analyzing user interests, and using the sum of the weights of each node on the interest path as an interest-related factor, an accurate user interest model is established. As time goes by, user interests also Constantly changing, updating the user interest model in time ensures the time reliability of the user interest model, makes the retrieval more in line with the actual needs of the user, realizes the personalized service resource retrieval based on the user interest, and significantly improves the retrieval effect;

3)本发明通过对服务资源的分类和检索这两种方法的有效结合,提出了一种基于分类的服务资源提供方案,该方案在对用户提供服务时,提高了网络服务资源分类的准确性,降低了对海量的网络服务资源检索的时间,效率有明显的提高。3) Through the effective combination of the two methods of classification and retrieval of service resources, the present invention proposes a classification-based service resource provision scheme, which improves the accuracy of network service resource classification when providing services to users , reducing the time for searching massive network service resources, and the efficiency is obviously improved.

附图说明Description of drawings

图1为本发明所用到的朴素贝叶斯模型示意图;Fig. 1 is the naive Bayesian model schematic diagram used in the present invention;

图2为本发明所述的用户兴趣树的示意图;FIG. 2 is a schematic diagram of a user interest tree according to the present invention;

图3为本发明对分类服务资源进行检索时的检索流程图。Fig. 3 is a retrieval flow chart of the present invention when retrieving classified service resources.

具体实施方式Detailed ways

一种网络服务资源的提供方法,首先对网络服务资源进行分类,然后根据用户的兴趣提供检索方案,并根据该检索方案提供分类后的网络服务资源,所述对网络服务资源进行分类包括以下步骤:A method for providing network service resources, firstly classifying the network service resources, then providing a search plan according to the interests of users, and providing classified network service resources according to the search plan, said classifying the network service resources includes the following steps :

1)预定义m个类别,类标号为

Figure 607980DEST_PATH_IMAGE002
,然后提取待分类服务资源X的若干特征向量,并将这些特征向量组成表征该服务资源的一个n维特征向量
Figure 60958DEST_PATH_IMAGE004
,分别描述对n个属性的n个度量; 1) Predefine m categories, the class label is
Figure 607980DEST_PATH_IMAGE002
, and then extract several feature vectors of the service resource X to be classified, and combine these feature vectors into an n-dimensional feature vector representing the service resource
Figure 60958DEST_PATH_IMAGE004
, respectively describe the pair of n attributes n measures of ;

2)针对n维特征向量中每个属性

Figure 792209DEST_PATH_IMAGE008
的属性值,每个类别
Figure 588444DEST_PATH_IMAGE012
,计算类别
Figure 428224DEST_PATH_IMAGE012
下属性值出现的概率
Figure 404587DEST_PATH_IMAGE014
,以及类别
Figure 996105DEST_PATH_IMAGE012
出现的概率
Figure 893654DEST_PATH_IMAGE016
;2) For each attribute in the n-dimensional feature vector
Figure 792209DEST_PATH_IMAGE008
attribute value of , for each category
Figure 588444DEST_PATH_IMAGE012
, the computed category
Figure 428224DEST_PATH_IMAGE012
lower property value probability of occurrence
Figure 404587DEST_PATH_IMAGE014
, and the category
Figure 996105DEST_PATH_IMAGE012
probability of occurrence
Figure 893654DEST_PATH_IMAGE016
;

3)确定n维特征向量中每个属性

Figure 545215DEST_PATH_IMAGE008
的权值
Figure 767030DEST_PATH_IMAGE018
;3) Determine each attribute in the n-dimensional feature vector
Figure 545215DEST_PATH_IMAGE008
weight of
Figure 767030DEST_PATH_IMAGE018
;

方法如下:首先,定义两个对象空间

Figure 162239DEST_PATH_IMAGE020
Figure 914295DEST_PATH_IMAGE022
Figure 736757DEST_PATH_IMAGE024
为两对象空间第d维属性集合
Figure 941474DEST_PATH_IMAGE026
Figure 140374DEST_PATH_IMAGE028
之间的距离The method is as follows: First, define two object spaces
Figure 162239DEST_PATH_IMAGE020
and
Figure 914295DEST_PATH_IMAGE022
,
Figure 736757DEST_PATH_IMAGE024
is the d-th dimension attribute set of two object spaces
Figure 941474DEST_PATH_IMAGE026
and
Figure 140374DEST_PATH_IMAGE028
the distance between

Figure 746936DEST_PATH_IMAGE030
Figure 746936DEST_PATH_IMAGE030

Figure 930847DEST_PATH_IMAGE032
 
Figure 933438DEST_PATH_IMAGE034
Figure 930847DEST_PATH_IMAGE032
 
Figure 933438DEST_PATH_IMAGE034

式中:

Figure 394506DEST_PATH_IMAGE036
分别为属性集合
Figure 738080DEST_PATH_IMAGE026
Figure 278783DEST_PATH_IMAGE028
的中心值;
Figure 594358DEST_PATH_IMAGE040
各是两属性集合覆盖范围的一半,即:In the formula:
Figure 394506DEST_PATH_IMAGE036
and set of attributes
Figure 738080DEST_PATH_IMAGE026
and
Figure 278783DEST_PATH_IMAGE028
the central value;
Figure 594358DEST_PATH_IMAGE040
and Each is half of the coverage of the two attribute sets, namely:

其中

Figure 609139DEST_PATH_IMAGE048
分别为属性集合
Figure 380786DEST_PATH_IMAGE026
的最小值和最大值;in and
Figure 609139DEST_PATH_IMAGE048
set of attributes
Figure 380786DEST_PATH_IMAGE026
The minimum and maximum values of ;

然后,定义类别

Figure 269107DEST_PATH_IMAGE012
的训练样本集合为
Figure 151613DEST_PATH_IMAGE050
Figure 441780DEST_PATH_IMAGE052
为i类样本的个数,它的第
Figure 384328DEST_PATH_IMAGE054
个属性集合用
Figure 258481DEST_PATH_IMAGE026
表示,集合
Figure 944677DEST_PATH_IMAGE026
的期望值为
Figure 89351DEST_PATH_IMAGE056
,最小值为
Figure 140483DEST_PATH_IMAGE046
,最大值为
Figure 65714DEST_PATH_IMAGE048
Figure 227705DEST_PATH_IMAGE058
;Then, define the class
Figure 269107DEST_PATH_IMAGE012
The training sample set of is
Figure 151613DEST_PATH_IMAGE050
,
Figure 441780DEST_PATH_IMAGE052
is the number of samples of type i, and its
Figure 384328DEST_PATH_IMAGE054
set of attributes
Figure 258481DEST_PATH_IMAGE026
show, gather
Figure 944677DEST_PATH_IMAGE026
The expected value of
Figure 89351DEST_PATH_IMAGE056
, with a minimum value of
Figure 140483DEST_PATH_IMAGE046
, with a maximum value of
Figure 65714DEST_PATH_IMAGE048
Figure 227705DEST_PATH_IMAGE058
;

按照升序排列属性

Figure 554781DEST_PATH_IMAGE054
Figure 405558DEST_PATH_IMAGE062
表示排序后的类别标识,类别间属性
Figure 587141DEST_PATH_IMAGE054
的距离
Figure 980076DEST_PATH_IMAGE064
表示为:Sort attributes in ascending order
Figure 554781DEST_PATH_IMAGE054
of and ,
Figure 405558DEST_PATH_IMAGE062
Indicates the sorted category identification, between-category attributes
Figure 587141DEST_PATH_IMAGE054
distance
Figure 980076DEST_PATH_IMAGE064
Expressed as:

Figure 614319DEST_PATH_IMAGE066
Figure 614319DEST_PATH_IMAGE066

属性归一化的属性相似度为

Figure 419781DEST_PATH_IMAGE068
:Attributes The normalized attribute similarity is
Figure 419781DEST_PATH_IMAGE068
:

Figure 482153DEST_PATH_IMAGE070
Figure 482153DEST_PATH_IMAGE070

由下式计算每个属性的权重The weight of each attribute is calculated by

Figure 603693DEST_PATH_IMAGE072
Figure 603693DEST_PATH_IMAGE072
;

4)根据步骤2)和步骤3 )所得结果,利用加权朴素贝叶斯公式

Figure 911177DEST_PATH_IMAGE074
获取属性属于各个类别的概率并进行比较,选出最大的一个,作为该n维特征向量所表征的待分类服务资源的分类类别,从而完成服务资源的分类。4) According to the results obtained in step 2) and step 3), use the weighted naive Bayesian formula
Figure 911177DEST_PATH_IMAGE074
get attribute The probabilities belonging to each category are compared, and the largest one is selected as the classification category of the service resource to be classified represented by the n-dimensional feature vector, thereby completing the classification of the service resource.

所述步骤4)的加权朴素贝叶斯公式

Figure 864407DEST_PATH_IMAGE074
中,
Figure 410926DEST_PATH_IMAGE076
表示该服务资源分类后所得的类标签;The weighted Naive Bayes formula of step 4)
Figure 864407DEST_PATH_IMAGE074
middle,
Figure 410926DEST_PATH_IMAGE076
Indicates the class label obtained after the service resource is classified;

表示该n维特征向量属于某一个类

Figure 267204DEST_PATH_IMAGE012
的后验概率,
Figure 235160DEST_PATH_IMAGE078
,其中
Figure 767510DEST_PATH_IMAGE080
是在属性
Figure 479114DEST_PATH_IMAGE082
中具有值
Figure 281985DEST_PATH_IMAGE010
的类的训练样本数,而中的训练样本数。 Indicates that the n-dimensional feature vector belongs to a certain class
Figure 267204DEST_PATH_IMAGE012
the posterior probability of
Figure 235160DEST_PATH_IMAGE078
,in
Figure 767510DEST_PATH_IMAGE080
is in the property
Figure 479114DEST_PATH_IMAGE082
has value in
Figure 281985DEST_PATH_IMAGE010
the type The number of training samples, and yes The number of training samples in .

所述步骤1)中提取待分类服务资源X的若干特征向量,特征向量包括文件名、文件后缀名、文本内容和文件大小。一个文件的特征向量集合可能涉及到近百个特征,特征集合的表述质量会直接影响分类的效果,所以要尽量选择能够代表文件特性的属性作为特征属性,但又不能选择太多,属性太多会增加分类计算量并且带来噪声影响,从而造成分类准确率的下降。对此,经过综合分析,选取了以下文件的特征属性向量进行分析:In the step 1), several feature vectors of the service resource X to be classified are extracted, and the feature vectors include file name, file extension, text content and file size. The feature vector set of a file may involve nearly a hundred features, and the expression quality of the feature set will directly affect the classification effect, so try to choose the attributes that can represent the characteristics of the file as feature attributes, but you can’t choose too many, too many attributes It will increase the amount of classification calculations and bring noise effects, resulting in a decline in classification accuracy. In this regard, after comprehensive analysis, the characteristic attribute vectors of the following files are selected for analysis:

a、文件名,可以通过文件名内的关键词进行分析从而进行分类;a. The file name can be classified by analyzing the keywords in the file name;

b、文件后缀名,可以对文件后缀名进行筛选并进行分类;b. File suffixes, which can filter and classify file suffixes;

c、文本内容,如果文件后缀名是文本类型文件再根据文本内容进行分析并分类;c. Text content, if the file suffix name is a text type file, analyze and classify according to the text content;

d、文件大小,获取文件的大小并赋以权重再进行分类。d. File size, obtain the size of the file and assign weights to classify.

本发明上述的根据用户的兴趣提供检索方案,首先,定义文档集合D中的文档总数为N,任一属于集合D的文档都可以表示为t维向量的形式:

Figure 114626DEST_PATH_IMAGE086
The present invention provides a retrieval scheme based on the user's interests as described above. At first, the total number of documents in the document collection D is defined as N, and any document belonging to the collection D can be expressed as a t-dimensional vector:
Figure 114626DEST_PATH_IMAGE086

其中,t是索引词的个数,向量分量代表第i个索引词

Figure 931327DEST_PATH_IMAGE090
在文档
Figure 984733DEST_PATH_IMAGE092
中所具有的权重,然后再根据用户的兴趣进行检索,具体步骤如下:Among them, t is the number of index words, and the vector component represents the i-th index term
Figure 931327DEST_PATH_IMAGE090
in the document
Figure 984733DEST_PATH_IMAGE092
The weights in , and then search according to the user's interests, the specific steps are as follows:

步骤一、获取用户兴趣信息,然后用向量或图形的方法将兴趣信息进行形式化的表示,即形成用户兴趣剖像;Step 1. Obtain user interest information, and then use vector or graphic methods to formally represent the interest information, that is, to form a user interest profile;

步骤二、借助分类目录表征用户兴趣,并将分类目录映射为树状结构形成用户兴趣树,用户兴趣树中的节点表示类目,该节点的权值表示用户对该节点表示的类目的感兴趣程度;Step 2: Represent user interest with the help of classification directory, and map the classification directory into a tree structure to form a user interest tree. The nodes in the user interest tree represent categories, and the weight of the node represents the user's sentiment on the category represented by the node. level of interest;

步骤三、使用二元组兴趣向量

Figure 762196DEST_PATH_IMAGE094
来表征用户兴趣剖像,则用户i的兴趣剖像构成的兴趣剖像库表征如下:Step 3. Use the binary interest vector
Figure 762196DEST_PATH_IMAGE094
To represent user interest profile, then the interest profile database composed of user i’s interest profile is represented as follows:

Figure 242856DEST_PATH_IMAGE096
Figure 242856DEST_PATH_IMAGE096

式中,

Figure 738560DEST_PATH_IMAGE098
代表分类目录中的一个类目;In the formula,
Figure 738560DEST_PATH_IMAGE098
Represents a category in the taxonomy;

 

Figure 962048DEST_PATH_IMAGE098
在用户兴趣树的权重,表示用户对
Figure 879188DEST_PATH_IMAGE102
的感兴趣程度,  for
Figure 962048DEST_PATH_IMAGE098
The weight in the user interest tree represents the user's interest in
Figure 879188DEST_PATH_IMAGE102
level of interest, ;

步骤四、由公式Step 4, by the formula

Figure 496212DEST_PATH_IMAGE096
中可以得出某文档
Figure 248268DEST_PATH_IMAGE106
所属的各分类的类目,根据其所属的各分类的类目计算该文档
Figure 70730DEST_PATH_IMAGE106
的兴趣相关因子J,该文档
Figure 275446DEST_PATH_IMAGE106
的兴趣相关因子J等于该文档所属各类目的权重值之和;
Figure 496212DEST_PATH_IMAGE096
A document can be obtained from
Figure 248268DEST_PATH_IMAGE106
The category of each category it belongs to, calculate the document according to the category of each category it belongs to
Figure 70730DEST_PATH_IMAGE106
The interest correlator J, the document
Figure 275446DEST_PATH_IMAGE106
The interest-related factor J of the document is equal to the sum of the weight values of various categories to which the document belongs;

步骤五、提取用户检索请求中的提问向量

Figure 474347DEST_PATH_IMAGE108
,然后利用矩阵分析中的空间两向量之间相似度公式计算提问向量
Figure 80908DEST_PATH_IMAGE108
与文档向量
Figure 74272DEST_PATH_IMAGE092
之间的相似度,记做该文档的相关因子I,并取出I值最高的前m篇文档;Step 5. Extract the question vector in the user's retrieval request
Figure 474347DEST_PATH_IMAGE108
, and then use the similarity formula between two spatial vectors in matrix analysis to calculate the question vector
Figure 80908DEST_PATH_IMAGE108
with document vector
Figure 74272DEST_PATH_IMAGE092
The similarity between is recorded as the correlation factor I of the document, and the first m documents with the highest I value are taken out;

步骤六、从用户兴趣剖像库

Figure 264820DEST_PATH_IMAGE110
中提取出该用户的兴趣剖像
Figure 1832DEST_PATH_IMAGE110
,然后根据步骤四和步骤五得到的兴趣相关因子J和文档相关因子I,利用如下公式计算出文档与提问向量
Figure 892744DEST_PATH_IMAGE108
的兴趣相似度:Step 6. From the user interest profile library
Figure 264820DEST_PATH_IMAGE110
Extract the user's interest profile from
Figure 1832DEST_PATH_IMAGE110
, and then according to the interest correlation factor J and document correlation factor I obtained in steps 4 and 5, use the following formula to calculate the and question vector
Figure 892744DEST_PATH_IMAGE108
similarity of interests:

Figure 72053DEST_PATH_IMAGE112
Figure 72053DEST_PATH_IMAGE112

式中,

Figure 347177DEST_PATH_IMAGE114
为兴趣权数,
Figure 928331DEST_PATH_IMAGE116
,用于反映文档的兴趣相关因子对结果的影响程度;In the formula,
Figure 347177DEST_PATH_IMAGE114
is the interest weight,
Figure 928331DEST_PATH_IMAGE116
, which is used to reflect the degree of influence of the document’s interest-related factors on the results;

步骤七、根据步骤六得出的兴趣相似度SCOREi的大小对这m篇文档进行排序并在界面上显示,优先推荐这m篇文章中与用户兴趣相关的文档;Step 7. According to the interest similarity SCOREi obtained in step 6, the m documents are sorted and displayed on the interface, and the documents related to the user's interests in the m articles are preferentially recommended;

步骤八、跟踪并记录用户对检索结果的访问情况,以此来更新该用户的兴趣剖像库。Step 8: Track and record the user's access to the retrieval results, so as to update the user's interest profile database.

所述步骤八中更新该用户的兴趣剖像库,其具体步骤如下:In said step eight, update the user's interest profile library, and its specific steps are as follows:

①初始化用户兴趣树,使每个节点均对应一原始权值

Figure 263497DEST_PATH_IMAGE118
(其中,0<k<n+1),该数值表征用户对此节点下所有文档的访问次数;① Initialize the user interest tree so that each node corresponds to an original weight
Figure 263497DEST_PATH_IMAGE118
(where, 0<k<n+1), this value represents the number of times the user visits all documents under this node;

②叶子节点权值不变,重新计算每一个非叶子节点的权值: 其中

Figure 507451DEST_PATH_IMAGE122
是与非叶子节点的子节点,x为该非叶子节点的子节点数目;②The weight of leaf nodes remains unchanged, and the weight of each non-leaf node is recalculated: in
Figure 507451DEST_PATH_IMAGE122
is the child node of the non-leaf node, and x is the number of child nodes of the non-leaf node;

所述叶子节点是指用户兴趣树中最小的分类类目,非叶子节点是指用户兴趣树中具有子分类的分类类目;The leaf node refers to the smallest classification category in the user interest tree, and the non-leaf node refers to the classification category with sub-categories in the user interest tree;

③若用户访问某些节点中的文档,则重复以上两个步骤;③ If the user accesses documents in some nodes, repeat the above two steps;

④根据步骤②中更新后的非叶子节点的权值来更新用户兴趣剖像④Update the user interest profile according to the weights of non-leaf nodes updated in step ②

Figure 943112DEST_PATH_IMAGE124
Figure 943112DEST_PATH_IMAGE124

式中

Figure 714759DEST_PATH_IMAGE126
为兴趣树中节点总数,则即为用户个人兴趣剖像。In the formula
Figure 714759DEST_PATH_IMAGE126
, , is the total number of nodes in the interest tree, then It is the user's personal interest profile.

本发明上述步骤一中,所述的获取用户兴趣信息是指采用特定的方法获取能够反映用户兴趣的信息,以生成能表示用户兴趣的特征文件,即用户兴趣剖像。如果用户经常访问某一页面或文档,或者用户在某一页面或文档上停留较长的时间,则说明用户对该页面或该文档感兴趣。这表明,用户对检索结果的访问情况等用户行为能够反映用户的兴趣。为了学习用户的兴趣,可以使用计算机对这些访问信息进行跟踪和记录并进行挖掘,从中抽取出能反映用户兴趣的信息,进而生成用户兴趣剖像;In the above step 1 of the present invention, the acquisition of user interest information refers to the use of specific methods to obtain information that can reflect user interest, so as to generate a feature file that can represent user interest, that is, user interest profile. If the user frequently visits a certain page or document, or the user stays on a certain page or document for a long time, it means that the user is interested in the page or document. This shows that user behavior such as user access to search results can reflect user interest. In order to learn the interests of users, computers can be used to track, record and mine these access information, extract information that can reflect the interests of users, and then generate user interest profiles;

将获取到的用户兴趣信息用向量或图形的方法进行形式化的表示,即形成用户兴趣剖像。它存储在计算机上,是高度结构化的,并且能够自动生成和动态更新。本文提到的用户兴趣剖像或兴趣剖像均指用户个人兴趣剖像。建立用户兴趣剖像是实现个性化检索的基础和关键。Formally represent the obtained user interest information with vector or graphic methods, that is, form a user interest profile. It is stored on the computer, is highly structured, and can be automatically generated and dynamically updated. The user interest profile or interest profile mentioned in this article refers to the user's personal interest profile. Establishing user interest profiles is the basis and key to realize personalized retrieval.

本发明上述步骤二中,所述的用户兴趣树的具体含义如下:In the above-mentioned step 2 of the present invention, the specific meaning of the described user interest tree is as follows:

在多数检索中,用户其实是对某一主题感兴趣。如果用户对检出的某篇文档感兴趣,则他对同一主题的其它文档应当有相同的兴趣。而分类法中同一类目下的文档拥有相同的主题,因此借助分类目录来表示用户兴趣,并将其映射为树状结构,即用户兴趣树(如附图2所示)。用户兴趣树中的节点表示类目。在实际检索中,用户对每个分类的兴趣并不相同,因此在兴趣树中,代表用户兴趣度的节点权值也不同。对语料库中的文档进行分类,则每篇文档均包含在兴趣树的某一节点中;相应地,兴趣树中每篇文档都有其“兴趣路径”。如附图2的兴趣树中,文档《姚明伤愈复出重返火箭》的兴趣路径是:体育~篮球~NBA。文档的兴趣相关因子表示用户对这篇文档的偏好程度,它等于文档所在的兴趣路径上的所有节点的权值的和。上例中,((姚明伤愈复出重返火箭》的兴趣相关因子为:J=w[体育]+w[篮球]+w[NBA]。In most searches, the user is actually interested in a certain topic. If a user is interested in a document checked out, he should be interested in other documents on the same topic. In the taxonomy, the documents under the same category have the same subject, so the user interest is represented by the taxonomy, and it is mapped into a tree structure, that is, the user interest tree (as shown in Figure 2). Nodes in the user interest tree represent categories. In actual retrieval, users have different interests in each category, so in the interest tree, the node weights representing user interest are also different. To classify the documents in the corpus, each document is included in a certain node of the interest tree; correspondingly, each document in the interest tree has its "interest path". As shown in the interest tree in Figure 2, the interest path of the document "Yao Ming returns from injury and returns to the Rockets" is: sports ~ basketball ~ NBA. The interest correlation factor of a document represents the user's preference for this document, which is equal to the sum of the weights of all nodes on the interest path where the document is located. In the above example, ((Yao Ming returned from injury and returned to the Rockets) The correlation factor of interest is: J=w[sports]+w[basketball]+w[NBA].

Claims (5)

1. the supplying method of a network service resource, first network service resource is classified, then provide retrieval scheme according to user's interest, and provide sorted network service resource according to this retrieval scheme, it is characterized in that: described network service resource is classified and comprised the following steps:
1) predefine m classification, class label is
Figure 2014100156471100001DEST_PATH_IMAGE002
, then extract some proper vectors of Service Source X to be sorted, and these proper vector compositions characterized to a n dimensional feature vector of this Service Source
Figure 2014100156471100001DEST_PATH_IMAGE004
, describe respectively n attributen tolerance;
2) for each attribute in n dimensional feature vector
Figure 2014100156471100001DEST_PATH_IMAGE008
property value
Figure 2014100156471100001DEST_PATH_IMAGE010
, each classification
Figure 2014100156471100001DEST_PATH_IMAGE012
, calculate classification
Figure 720475DEST_PATH_IMAGE012
lower property valuethe probability occurring
Figure 2014100156471100001DEST_PATH_IMAGE014
, and classification
Figure 739344DEST_PATH_IMAGE012
the probability occurring
Figure 2014100156471100001DEST_PATH_IMAGE016
;
3) determine each attribute in n dimensional feature vector
Figure 815884DEST_PATH_IMAGE008
weights
Figure 2014100156471100001DEST_PATH_IMAGE018
;
Method is as follows: first, define two object spaces
Figure 2014100156471100001DEST_PATH_IMAGE020
with
Figure 2014100156471100001DEST_PATH_IMAGE022
,
Figure 2014100156471100001DEST_PATH_IMAGE024
be two object space d dimension attribute set
Figure 2014100156471100001DEST_PATH_IMAGE026
with
Figure 2014100156471100001DEST_PATH_IMAGE028
between distance
Figure 2014100156471100001DEST_PATH_IMAGE030
Figure 648580DEST_PATH_IMAGE032
?
Figure 2014100156471100001DEST_PATH_IMAGE034
In formula:
Figure 2014100156471100001DEST_PATH_IMAGE036
with
Figure 2014100156471100001DEST_PATH_IMAGE038
be respectively community set
Figure 773662DEST_PATH_IMAGE026
withcentral value;
Figure 2014100156471100001DEST_PATH_IMAGE040
with
Figure 2014100156471100001DEST_PATH_IMAGE042
respectively the half of two community set coverages, that is:
Wherein
Figure 2014100156471100001DEST_PATH_IMAGE046
with
Figure 2014100156471100001DEST_PATH_IMAGE048
be respectively community setminimum value and maximal value;
Then, definition classification
Figure 388686DEST_PATH_IMAGE012
training sample set be combined into
Figure 2014100156471100001DEST_PATH_IMAGE050
,for the number of i class sample, its
Figure DEST_PATH_IMAGE054
individual community set is used
Figure 815994DEST_PATH_IMAGE026
represent set
Figure 655774DEST_PATH_IMAGE026
expectation value be
Figure DEST_PATH_IMAGE056
, minimum value is
Figure 74117DEST_PATH_IMAGE046
, maximal value is
Figure 304241DEST_PATH_IMAGE048
Figure DEST_PATH_IMAGE058
;
According to ascending order alignment attribute
Figure 591698DEST_PATH_IMAGE054
's
Figure 285984DEST_PATH_IMAGE036
with,
Figure DEST_PATH_IMAGE062
represent the classification logotype after sequence, attribute between classificationdistance
Figure DEST_PATH_IMAGE064
be expressed as:
Attribute
Figure 795911DEST_PATH_IMAGE054
normalized attributes similarity is
Figure DEST_PATH_IMAGE068
:
Figure DEST_PATH_IMAGE070
Calculated the weight of each attribute by following formula
Figure DEST_PATH_IMAGE072
4) according to step 2) and step 3) acquired results, Weighted naive bayes formula utilized
Figure DEST_PATH_IMAGE074
getattrbelong to the probability of each classification and compare, selecting maximum one, the class categories of the Service Source to be sorted characterizing as this n dimensional feature vector, thus complete the classification of Service Source.
2. the supplying method of a kind of network service resource according to claim 1, is characterized in that: the Weighted naive bayes formula of described step 4)
Figure 51498DEST_PATH_IMAGE074
in,
Figure DEST_PATH_IMAGE076
represent the class label of the rear gained of this Service Source classification;
Figure 77223DEST_PATH_IMAGE014
represent that this n dimensional feature vector belongs to some classes
Figure 344256DEST_PATH_IMAGE012
posterior probability,
Figure DEST_PATH_IMAGE078
, wherein
Figure DEST_PATH_IMAGE080
at attributein there is value
Figure 651478DEST_PATH_IMAGE010
class
Figure 258040DEST_PATH_IMAGE012
number of training, and
Figure DEST_PATH_IMAGE084
be
Figure 454666DEST_PATH_IMAGE012
in number of training.
3. the supplying method of a kind of network service resource according to claim 1, is characterized in that: in described step 1), extract some proper vectors of Service Source X to be sorted, proper vector comprises filename, file suffixes name, content of text and file size.
4. the supplying method of a kind of network service resource according to claim 1, it is characterized in that: the described interest according to user provides retrieval scheme, first, the total number of documents in definition collection of document D is N, arbitrary form that belongs to the document of gathering D and can be expressed as t dimensional vector:
Wherein, t is the number of index terms, component of a vector
Figure DEST_PATH_IMAGE088
represent i index terms
Figure DEST_PATH_IMAGE090
at documentmiddle had weight, and then retrieve according to user's interest, concrete steps are as follows:
Step 1, obtain user interest information, then by the method for vector or figure, interest information is carried out to formal expression, form user interest and cut open picture;
Step 2, characterize user interest by split catalog, and split catalog is mapped as to tree structure forms user interest tree, the node in user interest tree represents classification, and the weights of this node represent the interest level of the classification that user represents this node;
Step 3, use two tuple interest vectors
Figure DEST_PATH_IMAGE094
characterize user interest and cut open picture, the interest profile storehouse that interest profile of user i forms characterizes as follows:
Figure DEST_PATH_IMAGE096
In formula,represent a classification in split catalog;
Figure DEST_PATH_IMAGE100
for
Figure 572445DEST_PATH_IMAGE098
in the weight of user interest tree, represent user couple
Figure DEST_PATH_IMAGE102
interest level,
Figure DEST_PATH_IMAGE104
;
Step 4, by formula
Figure 512719DEST_PATH_IMAGE096
in can draw certain documentthe classification of affiliated each classification, calculates the document according to the classification of the each classification under it
Figure 973787DEST_PATH_IMAGE106
interest correlation factor J, the document
Figure 574271DEST_PATH_IMAGE106
interest correlation factor J equal all kinds of object weighted value sums under the document;
Question vector in step 5, the request of extraction user search
Figure DEST_PATH_IMAGE108
, then utilize similarity formula between space two vectors in matrix analysis to calculate question vector
Figure 753579DEST_PATH_IMAGE108
with document vector
Figure 294282DEST_PATH_IMAGE092
between similarity, note is the correlation factor I of the document, and takes out the front m piece of writing document that I value is the highest;
Step 6, from user interest cut open picture storehouse
Figure DEST_PATH_IMAGE110
in extract this user's interest profile
Figure 609857DEST_PATH_IMAGE110
, the interest correlation factor J and the document correlation factor I that then obtain according to step 4 and step 5, utilize following formula to calculate document
Figure 148286DEST_PATH_IMAGE092
with question vector
Figure 611628DEST_PATH_IMAGE108
interest Similarity:
In formula,
Figure DEST_PATH_IMAGE114
for interest flexible strategy,, the influence degree for the interest correlation factor of represent to result;
The size of step 7, the Interest Similarity SCOREi that draws according to step 6 sorts and shows on interface this m piece of writing document, preferentially recommends the document relevant to user interest in this m piece of writing article;
Step 8, follow the tracks of and the access situation of recording user to result for retrieval, upgrade this user's interest profile storehouse with this.
5. the supplying method of a kind of network service resource according to claim 4, is characterized in that: in described step 8, upgrade this user's interest profile storehouse, its concrete steps are as follows:
1. initialization user interest tree, makes all corresponding original weights of each node
Figure DEST_PATH_IMAGE118
(wherein, 0<k<n+1), the access times of this numerical representation method user to all documents under this node;
2. leaf node weights are constant, recalculate the weights of each non-leaf node:
Figure DEST_PATH_IMAGE120
wherein
Figure DEST_PATH_IMAGE122
the child node of right and wrong leaf node, the child node number that x is this non-leaf node;
Described leaf node refers to minimum series in user interest tree, and non-leaf node refers to the series in user interest tree with subclassification;
If 3. user accesses the document in some node, repeat above two steps;
4. the weights of non-leaf node after upgrading in 2. according to step upgrade user interest and cut open picture
Figure DEST_PATH_IMAGE124
In formula
Figure DEST_PATH_IMAGE126
,
Figure DEST_PATH_IMAGE128
,
Figure DEST_PATH_IMAGE130
for node sum in interest tree,
Figure 579190DEST_PATH_IMAGE110
be individual subscriber interest profile.
CN201410015647.1A2014-01-142014-01-14Method for providing network service resourcesPendingCN103778206A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201410015647.1ACN103778206A (en)2014-01-142014-01-14Method for providing network service resources

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201410015647.1ACN103778206A (en)2014-01-142014-01-14Method for providing network service resources

Publications (1)

Publication NumberPublication Date
CN103778206Atrue CN103778206A (en)2014-05-07

Family

ID=50570441

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201410015647.1APendingCN103778206A (en)2014-01-142014-01-14Method for providing network service resources

Country Status (1)

CountryLink
CN (1)CN103778206A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104102875A (en)*2014-07-222014-10-15河海大学Software service quality monitoring method and system based on weighted naive Bayes classifier
CN106204083A (en)*2015-04-302016-12-07中国移动通信集团山东有限公司A kind of targeted customer's sorting technique, Apparatus and system
CN106407445A (en)*2016-09-292017-02-15重庆邮电大学Unstructured data resource identification and locating method based on URL (Uniform Resource Locator)
CN108229748A (en)*2018-01-162018-06-29北京三快在线科技有限公司For the matching process, device and electronic equipment of rideshare service
CN108256827A (en)*2018-01-102018-07-06广东轩辕网络科技股份有限公司Target job analysis method and system
CN108897802A (en)*2018-06-142018-11-27桂林电子科技大学A kind of intelligent information browsing method based on data mining
CN109063209A (en)*2018-09-202018-12-21新乡学院A kind of webpage recommending solution based on probabilistic model
CN111414556A (en)*2020-02-102020-07-14华北电力大学 A service discovery method based on knowledge graph
CN114880548A (en)*2021-01-212022-08-09广州视源电子科技股份有限公司User interest multi-level processing method, device, equipment and storage medium

Cited By (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104102875A (en)*2014-07-222014-10-15河海大学Software service quality monitoring method and system based on weighted naive Bayes classifier
CN104102875B (en)*2014-07-222017-05-03河海大学Software service quality monitoring method and system based on weighted naive Bayes classifier
CN106204083A (en)*2015-04-302016-12-07中国移动通信集团山东有限公司A kind of targeted customer's sorting technique, Apparatus and system
CN106407445A (en)*2016-09-292017-02-15重庆邮电大学Unstructured data resource identification and locating method based on URL (Uniform Resource Locator)
CN108256827A (en)*2018-01-102018-07-06广东轩辕网络科技股份有限公司Target job analysis method and system
CN108229748A (en)*2018-01-162018-06-29北京三快在线科技有限公司For the matching process, device and electronic equipment of rideshare service
CN108897802A (en)*2018-06-142018-11-27桂林电子科技大学A kind of intelligent information browsing method based on data mining
CN108897802B (en)*2018-06-142021-04-06桂林电子科技大学Intelligent information browsing method based on data mining
CN109063209A (en)*2018-09-202018-12-21新乡学院A kind of webpage recommending solution based on probabilistic model
CN111414556A (en)*2020-02-102020-07-14华北电力大学 A service discovery method based on knowledge graph
CN111414556B (en)*2020-02-102023-11-21华北电力大学Knowledge graph-based service discovery method
CN114880548A (en)*2021-01-212022-08-09广州视源电子科技股份有限公司User interest multi-level processing method, device, equipment and storage medium

Similar Documents

PublicationPublication DateTitle
CN106997382B (en) Automatic labeling method and system for innovative creative labels based on big data
CN111680173A (en) A CMR Model for Unified Retrieval of Cross-Media Information
CN103838833B (en)Text retrieval system based on correlation word semantic analysis
CN103778206A (en)Method for providing network service resources
CN103514183B (en)Information search method and system based on interactive document clustering
CN109829104A (en)Pseudo-linear filter model information search method and system based on semantic similarity
CN111061939B (en)Scientific research academic news keyword matching recommendation method based on deep learning
CN108647322B (en)Method for identifying similarity of mass Web text information based on word network
CN106547864B (en) A Personalized Information Retrieval Method Based on Query Expansion
CN103425740B (en)A kind of material information search method based on Semantic Clustering of internet of things oriented
CN110633365A (en) A hierarchical multi-label text classification method and system based on word vectors
CN101174273A (en) News Event Detection Method Based on Metadata Analysis
CN112597305B (en)Scientific literature author name disambiguation method and web end disambiguation device based on deep learning
Karthikeyan et al.Probability based document clustering and image clustering using content-based image retrieval
CN103761286B (en)A kind of Service Source search method based on user interest
CN107688870A (en)A kind of the classification factor visual analysis method and device of the deep neural network based on text flow input
CN113962293A (en) A Name Disambiguation Method and System Based on LightGBM Classification and Representation Learning
CN104199826A (en)Heterogeneous media similarity calculation method and retrieval method based on correlation analysis
De Boom et al.Semantics-driven event clustering in Twitter feeds
CN107291895A (en)A kind of quick stratification document searching method
CN118445406A (en)Integration system based on massive polymorphic circuit heritage information
CN117196716A (en) Digital signage advertising theme recommendation method based on Transformer network model
CN105677830B (en)A kind of dissimilar medium similarity calculation method and search method based on entity mapping
CN112948544A (en)Book retrieval method based on deep learning and quality influence
Kinariwala et al.Onto_TML: Auto-labeling of topic models

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20140507


[8]ページ先頭

©2009-2025 Movatter.jp