CN103793504B

Movatterモバイル変換

Info

Publication number: CN103793504B
Application number: CN201410035844.XA
Authority: CN
Inventors: 宿红毅; 王彩群; 闫波; 郑宏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2014-01-24
Filing date: 2014-01-24
Publication date: 2018-02-27
Anticipated expiration: 2034-01-24
Also published as: CN103793504A

Abstract

The present invention relates to a kind of cluster initial point system of selection based on user preference and item attribute, belong to machine learning field.Project-based similar matrix and the homologous factors based on user preference are determined first, and final similar matrix is obtained by two matrixes；And then by removing marginal point, selection cluster initial center point, complete the selection to initial center point.The present invention can effectively improve Clustering Effect.

Description

Translated fromChinese

一种基于用户偏好与项目属性的聚类初始点选择方法A clustering initial point selection method based on user preference and item attributes

技术领域technical field

本发明涉及一种基于用户偏好与项目属性的聚类初始点选择方法，属于机器学习领域。The invention relates to a method for selecting an initial clustering point based on user preference and item attributes, and belongs to the field of machine learning.

背景技术Background technique

聚类是一种无监督的学习方法，它通过一定的规则将数据对象按照定义的相似性划分成为多个类或簇，在同一个簇中的对象之间具有较高的相似度，而不同簇中的对象差别较大。到目前为止，聚类分析的应用已十分广泛，包括统计学、机器学习、图像分割、和数据挖掘等。目前，主要的聚类算法分为划分方法、层次方法、基于密度的方法、基于网格的方法和基于模型的方法。而划分式聚类算法是实际应用中聚类分析的支柱。划分式聚类算法需要预先指定聚类数目或聚类中心，通过反复迭代运算，逐步降低目标函数的误差值，当目标函数值收敛时，得到最终聚类结果。划分式聚类算法简单、快速而且能有效的处理大数据集，但此聚类算法存在高计算性及对数据的输入顺序敏感的缺点，且需要预先指定聚类数目或聚类中心。初始聚类中心点对聚类结果的影响很大。如果初始聚类中心点选择不当，得到的聚类结果可能会陷入局部最优，从而得不到较好的聚类结果。而划分式聚类初始聚类中心点的选择方法也是多种多样，主要有以下几种方法：Clustering is an unsupervised learning method, which divides data objects into multiple classes or clusters according to the defined similarity through certain rules, and objects in the same cluster have a high degree of similarity, while different The objects in the cluster are quite different. So far, cluster analysis has been widely used, including statistics, machine learning, image segmentation, and data mining. At present, the main clustering algorithms are divided into partition methods, hierarchical methods, density-based methods, grid-based methods and model-based methods. Partitioning clustering algorithm is the pillar of clustering analysis in practical application. The partitioning clustering algorithm needs to specify the number of clusters or cluster centers in advance, and gradually reduce the error value of the objective function through repeated iterative operations. When the objective function value converges, the final clustering result is obtained. Partition clustering algorithm is simple, fast and can effectively process large data sets, but this clustering algorithm has the disadvantages of high calculation and sensitivity to the input order of data, and needs to specify the number of clusters or cluster centers in advance. The initial cluster center point has a great influence on the clustering results. If the initial clustering center point is not selected properly, the obtained clustering results may fall into local optimum, so that better clustering results cannot be obtained. There are also various methods for selecting the initial clustering center point of partitioned clustering, and there are mainly the following methods:

随机选择法：随机选取k个数据点作为初始聚类中心点；Random selection method: randomly select k data points as the initial cluster center point;

经验法：依据经验，根据个体性质，选择k个有代表意义的点作为初始聚类中心点；Empirical method: Based on experience and individual properties, select k representative points as the initial cluster center point;

递推法：首先计算全体数据样本的均值，以这个数值点作为初始聚类中心，然后计算距离第一个数值点最远的一个点作为第2个聚类中心，以此类推，由第k-1个聚类中心计算聚类最远的一个数据样本作为最后一个聚类中心。Recursion method: first calculate the mean value of all data samples, take this numerical point as the initial cluster center, then calculate the point farthest from the first numerical point as the second cluster center, and so on, from the kth -1 cluster center calculates the data sample with the farthest cluster as the last cluster center.

密度估计选择法：计算特定半径内的每个数据样本的密度，具有最大密度的点作为第一个聚类中心点，然后再计算剩下的初始中心点，若是具有第二大密度的点距离第一个聚类中心点的距离大于特定值则作为第2个初始聚类中心点，按此方法依次选出k个中心点；Density estimation selection method: Calculate the density of each data sample within a specific radius, and use the point with the largest density as the first cluster center point, and then calculate the remaining initial center points, if the point distance with the second largest density If the distance of the first cluster center point is greater than a specific value, it will be used as the second initial cluster center point, and k center points will be selected in turn according to this method;

距离优化选择法：按照最大最小距离计算Distance optimization selection method: calculate according to the maximum and minimum distances

采用遗传算法计算聚类初始中心点等。The genetic algorithm is used to calculate the initial center point of clustering, etc.

由于初始聚类中心点对聚类结果的影响很大。如果初始聚类中心点选择不当，得到的聚类结果可能会陷入局部最优，从而得不到较好的聚类结果。为了获得恰当的初始聚类中心点，避免聚类结果陷入局部最优，本专利提出一种新的聚类初始中心点的选择方法。Because the initial cluster center point has a great influence on the clustering results. If the initial clustering center point is not selected properly, the obtained clustering results may fall into local optimum, so that better clustering results cannot be obtained. In order to obtain an appropriate initial clustering center point and prevent the clustering result from falling into a local optimum, this patent proposes a new method for selecting the initial clustering center point.

发明内容Contents of the invention

本发明的目的是为了解决基于划分的算法的初始中心点的选择的问题，使用用户的偏好信息和商品属性来构造相似矩阵，从而得到初试中心点。The purpose of the present invention is to solve the problem of selecting the initial central point of the partition-based algorithm, and use the user's preference information and commodity attributes to construct a similarity matrix, thereby obtaining the initial test central point.

本发明技术方案的实现过程为：The realization process of technical scheme of the present invention is:

步骤1、确定基于项目的相似矩阵；Step 1. Determine an item-based similarity matrix;

定义项目的特征向量：item_i=(p₁，p₂，…，p_m)；其中m为项目的属性个数，p_i(1≤i≤m)代表了此项目第i个特征向量的值。然后每个项目可以转换为用一个向量item_i＝(w₁，w₂，…，w_m)表示，其中向量维数是m，即项目的属性特征个数。然后通过计算表示项目的向量间的距离A_ij来表示item_i和item_j之间的相似性，从而构成相似矩阵Define the feature vector of the item: item_i =(p₁ , p₂ ,..., p_m ); where m is the number of attributes of the item, and p_i (1≤i≤m) represents the i-th feature vector of the item value. Then each item can be converted to be represented by a vector item_i = (w₁ ,_w₂ , . Then, the similarity between item_i and item_j is represented by calculating the distance A_ij between the vectors representing the items, thus forming a similarity matrix

所属项目u与项目v之间通过距离获取相似度的计算方法包括：皮尔逊相关的距离、欧氏距离、余弦距离、斯皮尔曼距离和基于谷本相关的距离。The calculation methods for obtaining similarity through distance between item u and item v include: Pearson correlation distance, Euclidean distance, cosine distance, Spearman distance and distance based on Tanimoto correlation.

步骤2、确定基于用户偏好的同现矩阵；Step 2, determining a co-occurrence matrix based on user preference;

定义用户对项目的偏好列表：prefs＝(user_id，item_id，pref)，其中pref代表用户对项目的评分，所有用户对项目的评分组成评分列表prefs。通过计算item_i和item_j同时出现在相同的用户的偏好列表中的次数B_ij，来构成同现矩阵Define user's preference list for items: prefs=(user_id, item_id, pref), where pref represents user's ratings for items, and all users' ratings for items form the rating list prefs. The co-occurrence matrix is formed by calculating the number of times B_ij that item_i and item_j appear in the same user's preference list at the same time

步骤3、确定最终的相似矩阵；Step 3, determine the final similarity matrix;

最终的相似矩阵定义为其中和β为自定义的权重。The final similarity matrix is defined as in and β are custom weights.

步骤4、去除边缘点；Step 4, remove edge points;

在TS的每行中，分别计算相似度大于给定阈值θ的项目的个数，记为α_i，若是α_i的个数小于给定阈值μ表示此点是边缘点，则从相似矩阵中删除代表此项目的行和列以此来实现从相似矩阵中去除此边缘点；遍历所有的行后完成所有去除边缘点的操作后再次获得相似矩阵；In each row of TS, calculate the number of items whose similarity is greater than a given threshold θ, which is recorded as α_i , if the number of α_i is less than a given threshold μ, it means that this point is an edge point, then from the similarity matrix Delete the row and column representing this item to remove this edge point from the similarity matrix; after traversing all the rows and completing all the edge point removal operations, the similarity matrix is obtained again;

步骤5、选择聚类初始中心点：Step 5. Select the initial center point of clustering:

（1）在步骤4中获得的相似矩阵中，找出最大相似度，然后将这个最大相似度的两个点的中心点作为聚类的中心点，记录到Cluster[]中；并计算两个点到它们的中心点的距离，找出较大距离的点，将相似矩阵中代表较大的距离的点的行和列删除，得到新的相似矩阵；(1) In the similarity matrix obtained in step 4, find the maximum similarity, and then use the center point of the two points with the maximum similarity as the center point of the cluster, and record it in Cluster[]; and calculate two The distance between the points and their center points, find the point with a larger distance, delete the row and column of the point representing the larger distance in the similarity matrix, and get a new similarity matrix;

（2）再从上述相似矩阵中找到最大相似度，依次计算具有此最大相似度的两个点分别到所有聚类初始中心点Cluster[]的距离，若是存在距离小于给定阈值ω，则合并此点到具有最小距离的聚类中，重新计算聚类中心点，否则若是不存在距离小于给定阈值ω，则此点作为新的聚类中心，并将此点作为另外一个初始中心点加入到Cluster[]中；然后将此最大相似度的两个点所代表的的行和列删除得到新的相似矩阵。进行迭代，直至聚类中心点的个数为k。(2) Then find the maximum similarity from the above similarity matrix, and calculate the distances from the two points with the maximum similarity to the initial center point Cluster[] of all clusters in turn. If there is a distance smaller than the given threshold ω, then merge From this point to the cluster with the minimum distance, recalculate the cluster center point, otherwise, if there is no distance less than the given threshold ω, then this point will be used as the new cluster center, and this point will be added as another initial center point to Cluster[]; then delete the row and column represented by the two points of the maximum similarity to obtain a new similarity matrix. Iterate until the number of cluster center points is k.

项目到聚类中心点的距离的计算方法包括：皮尔逊相关的距离、基于欧氏距离的距离、余弦距离、斯皮尔曼距离和基于谷本相关的距离。The calculation methods of the distance from the item to the cluster center point include: Pearson correlation distance, distance based on Euclidean distance, cosine distance, Spearman distance and distance based on Tanimoto correlation.

经过以上操作则完成对初始中心点的选择。After the above operations, the selection of the initial center point is completed.

有益效果Beneficial effect

本发明通过提出基于用户偏好信息与商品属性的初始点选择方法，来提高聚类的效果。The present invention improves the effect of clustering by proposing an initial point selection method based on user preference information and commodity attributes.

附图说明Description of drawings

图1为本发明实施的具体流程示意图Fig. 1 is the concrete schematic flow chart that the present invention implements

具体实施方式detailed description

下面通过实施例对的具体实施方式做进一步详细说明。The specific implementation manner of the pair of examples will be described in further detail below.

在某站点中，有用户1000个，电影5000部，每部电影具有名称、发售年份、类别3种属性，现使用基于改进的相似矩阵的聚类算法实现对该站点中的第1个物品20个聚类，基于用户偏好与项目属性的聚类初始点选择方法实施的具体流程如图1所示：In a certain site, there are 1000 users and 5000 movies, and each movie has three attributes: name, release year, and category. Now, the clustering algorithm based on the improved similarity matrix is used to realize the first item in the site 20 The specific process of implementing the clustering initial point selection method based on user preferences and item attributes is shown in Figure 1:

根据步骤1：确定基于项目的相似矩阵；According to step 1: determine the item-based similarity matrix;

定义电影的特征向量：item_i＝(p₁，p₂，p₃)，p_i(1≤i≤3)代表了此项目第i个特征的取值。首先将每部电影用3维向量表示item_i＝(w₁，w₂，w₃)，其中w_i(1≤i≤3)表示物品第i个特征的值。然后通过计算表示项目的向量间的距离A_ij来表示item_i和item_j之间的相似性，从而构成相似矩阵Define the feature vector of the movie: item_i = (p₁ , p₂ , p₃ ), p_i (1≤i≤3) represents the value of the i-th feature of this item. First, each movie is represented by a 3-dimensional vector item_i = (w₁ , w₂ , w₃ ), where w_i (1≤i≤3) represents the value of the i-th feature of the item. Then, the similarity between item_i and item_j is represented by calculating the distance A_ij between the vectors representing the items, thus forming a similarity matrix

所属项目u与项目v之间通过距离获取相似度的计算方法采用欧氏距离计算得到。The calculation method of obtaining the similarity between the item u and the item v through the distance is obtained by calculating the Euclidean distance.

根据步骤2：确定基于用户偏好的同现矩阵；According to step 2: determine the co-occurrence matrix based on user preference;

定义用户对项目的偏好列表：prefs＝(userid,itemid，pref),其中pref代表用户对项目的评分，所有用户对项目的评分组成评分列表prefs。，通过计算每一对项目同时出现在同一个用户的偏好列表中的次数B_ij（表示item_i和item_j同时出现在相同的用户的偏好列表中的次数）来构成同现矩阵Define the user's preference list for items: prefs=(userid, itemid, pref), where pref represents the user's rating for the item, and all users' ratings for the item form the rating list prefs. , to form a co-occurrence matrix by calculating the number of times B_ij (indicating the number of times item_i and item_j simultaneously appear in the same user's preference list) that each pair of items simultaneously appear in the same user's preference list

根据步骤3：确定最终的相似矩阵；According to step 3: determine the final similarity matrix;

最终的相似矩阵定义为The final similarity matrix is defined as

其中α和β分别为0.5。 where α and β are 0.5, respectively.

根据步骤4：去除边缘点；According to step 4: remove edge points;

在TS的每行中，分别计算相似度大于给定阈值θ（θ定义为此行中最大相似度的0.2倍）的项目的个数，记为α_i，若是α_i的个数小于给定阈值μ(μ定义为 0·0O1N其中N代表所有聚类点的个数即5000）表示此点是边缘点，则从相似矩阵中删除代表此项目的行和列以此来实现从相似矩阵中去除此边缘点。遍历所有的行后完成所有去除边缘点的操作后再次获得相似矩阵。In each row of TS, calculate the number of items whose similarity is greater than a given threshold θ (θ is defined as 0.2 times the maximum similarity in this row), and record it as α_i , if the number of α_i is less than a given Threshold μ (μ is defined as 0·0O1N, where N represents the number of all cluster points, that is, 5000) indicates that this point is an edge point, then delete the row and column representing this item from the similarity matrix to achieve from the similarity matrix Remove this edge point. After traversing all rows, the similarity matrix is obtained again after all operations of removing edge points are completed.

根据步骤5：选择初始中心点；According to step 5: select the initial center point;

（1）：在步骤4中获得的相似矩阵中，找出最大相似度即所有数据中的最大值，然后将这个最大相似度的两个点的中心点作为聚类的中心点，记录到Cluster[]中。并计算两个点到它们的中心点的距离，找出较大距离的点。然后找出最下相似度即所有数据中的最小值，然后计算这个最小相似度的两个点间的距离，即为distance。并将相似矩阵中代表较大的距离的点的行和列删除，得到新的相似矩阵；(1): In the similarity matrix obtained in step 4, find the maximum similarity, that is, the maximum value in all data, and then use the center point of the two points with the maximum similarity as the center point of the cluster, and record it in the Cluster []middle. And calculate the distance of two points to their center point, find the point with larger distance. Then find the lowest similarity, that is, the minimum value in all data, and then calculate the distance between the two points of the minimum similarity, which is distance. And delete the rows and columns of points representing larger distances in the similarity matrix to obtain a new similarity matrix;

（2）：再从上述相似矩阵中找到最大相似度，依次计算具有此最大相似度的两个点分别到所有聚类初始中心点Cluster[]的距离，若是存在距离小于给定阈值ω(ω为distance/20*2，其中distance为步骤（1）中获得数据），则合并此点到具有最小距离的聚类中，重新计算聚类中心点，否则若是不存在距离小于给定阈值ω，则此点作为新的聚类中心，并将此点作为另外一个初始中心点加入到Cluster[]中。然后将此最大相似度的两个点所代表的行和列删除得到新的相似矩阵。迭代步骤直至聚类中心点的个数为20。(2): Then find the maximum similarity from the above similarity matrix, and calculate the distances from the two points with the maximum similarity to the initial center point Cluster[] of all clusters in turn, if there is a distance smaller than the given threshold ω(ω is distance/20*2, where distance is the data obtained in step (1), merge this point into the cluster with the minimum distance, and recalculate the cluster center point, otherwise, if there is no distance less than the given threshold ω, Then this point is used as the new cluster center, and this point is added to Cluster[] as another initial center point. Then delete the rows and columns represented by the two points with the maximum similarity to obtain a new similarity matrix. Iterate steps until the number of cluster center points is 20.

项目到聚类中心点的距离的计算方法选择基于欧氏距离的距离。The calculation method of the distance from the item to the cluster center point selects the distance based on the Euclidean distance.

Claims

Translated fromChinese

1.一种基于用户偏好与项目属性的聚类初始点选择方法，其特征在于：1. A clustering initial point selection method based on user preference and item attributes, characterized in that:

步骤1、确定基于项目的相似矩阵；定义项目的特征向量：item_i＝(p₁,p₂,…,p_m)；其中m为项目的属性个数，p_r(1≤r≤m)代表了此项目第r个特征向量的值；然后每个项目可以转换为用一个向量item_i＝(w₁,w₂,…,w_m)表示，其中向量维数是m，即项目的属性特征个数，w_m表示第m个属性特征值；然后通过计算表示项目的向量间的距离A_ij来表示item_i和item_j之间的相似性，从而构成相似矩阵item_j表示第j个项目，n表示项目的个数；Step 1. Determine the item-based similarity matrix; define the feature vector of the item: item_i = (p₁ ,p₂ ,...,p_m ); where m is the number of attributes of the item, p_r (1≤r≤m) Represents the value of the rth eigenvector of this item; then each item can be converted to a vector item_i = (w₁ ,w₂ ,…,w_m ), where the dimension of the vector is m, which is the attribute of the item The number of features, w_m represents the feature value of the mth attribute; then the similarity between item_i and item_j is represented by calculating the distance A_ij between the vectors representing the items, thus forming a similarity matrix item_j represents the jth item, and n represents the number of items;

步骤2、确定基于用户偏好的同现矩阵；定义用户对项目的偏好列表：prefs＝(user_id,item_id,pref)，其中pref代表用户对项目的评分，所有用户对项目的评分组成评分列表prefs；通过计算item_i和item_j同时出现在相同的用户的偏好列表中的次数B_ij，来构成同现矩阵Step 2. Determine the co-occurrence matrix based on user preference; define the user's preference list for items: prefs=(user_id, item_id, pref), where pref represents the user's rating for the item, and all users' ratings for the item form the rating list prefs; The co-occurrence matrix is formed by calculating the number of times B_ij that item_i and item_j appear in the same user's preference list at the same time

步骤3、确定最终的相似矩阵:其中和β为自定义的权重；Step 3, determine the final similarity matrix: in and β are custom weights;

步骤4、去除边缘点；在TS的每行中，分别计算相似度大于给定阈值θ的项目的个数，记为α_q，若是α_q的个数小于给定阈值μ表示此点是边缘点，则从相似矩阵中删除代表此项目的行和列以此来实现从相似矩阵中去除此边缘点；遍历所有的行后完成所有去除边缘点的操作后再次获得相似矩阵；Step 4. Remove edge points; in each row of TS, calculate the number of items whose similarity is greater than a given threshold θ, and record it as α_q . If the number of α_q is less than a given threshold μ, it means that this point is an edge point, delete the row and column representing this item from the similarity matrix to remove the edge point from the similarity matrix; after traversing all the rows and completing all the operations of removing edge points, the similarity matrix is obtained again;

步骤5、选择聚类初始中心点；所述选择聚类初始中心点具体包括：Step 5, select the initial central point of the cluster; the selection of the initial central point of the cluster specifically includes:

(1)在获得的相似矩阵中，找出最大相似度，然后将这个最大相似度的两个点的中心点作为聚类的中心点，记录到Cluster[]中；并计算两个点到它们的中心点的距离，找出较大距离的点，将相似矩阵中代表较大的距离的点的行和列删除，得到新的相似矩阵；(1) In the obtained similarity matrix, find the maximum similarity, and then use the center point of the two points of the maximum similarity as the center point of the cluster, record it in Cluster[]; and calculate the two points to them The distance between the central points of the distance, find the point with a larger distance, delete the row and column of the point representing the larger distance in the similarity matrix, and obtain a new similarity matrix;

(2)再从上述相似矩阵中找到最大相似度，依次计算具有此最大相似度的两个点分别到所有聚类初始中心点Cluster[]的距离，若是存在距离小于给定阈值ω，则合并此点到具有最小距离的聚类中，重新计算聚类中心点，否则若是不存在距离小于给定阈值ω，则此点作为新的聚类中心，并将此点作为另外一个初始中心点加入到Cluster[]中；然后将此最大相似度的两个点所代表的的行和列删除得到新的相似矩阵；进行迭代，直至聚类中心点的个数为k。(2) Then find the maximum similarity from the above similarity matrix, and calculate the distances from the two points with the maximum similarity to the initial center point Cluster[] of all clusters in turn, if there is a distance smaller than the given threshold ω, then merge From this point to the cluster with the minimum distance, recalculate the cluster center point, otherwise, if there is no distance less than the given threshold ω, then this point will be used as the new cluster center, and this point will be added as another initial center point to Cluster[]; then delete the rows and columns represented by the two points of maximum similarity to obtain a new similarity matrix; iterate until the number of cluster center points is k.