CN116451675A

Movatterモバイル変換

Info

Publication number: CN116451675A
Application number: CN202310557918.5A
Authority: CN
Inventors: 熊倩; 周莲英
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-07-18

Abstract

The invention discloses a detection optimization method for similar repeated records based on a density clustering algorithm DBSCAN algorithm, which belongs to the field of machine learning and data mining and comprises four steps. The method comprises the steps of firstly, carrying out data standardization processing and data set preliminary division on a data set, and secondly, introducing an initial point optimization and parameter self-adaption method based on a DBSCAN algorithm to improve non-uniformity of clustering results of similar repeated records; thirdly, performing secondary clustering on the similar repeated record clusters in the second step by using an N-Gram algorithm, so that the detection precision is effectively improved; and fourthly, training and testing the constructed method by using the student data set, and verifying the validity of the method. The method and the device effectively improve the detection precision of similar repeated records on the basis of a density clustering algorithm, avoid the influence of noise points and reduce the calculation cost.

Description

Translated fromChinese

一种基于密度聚类算法DBSCAN算法的相似重复记录的检测优化方法A detection optimization of similar duplicate records based on the density clustering algorithm DBSCAN algorithmmethod

技术领域technical field

本发明涉及聚类技术和数据清洗技术，具体涉及基于DBSCAN算法的相似重复记录检测优化方法。The invention relates to clustering technology and data cleaning technology, in particular to a method for detecting and optimizing similar duplicate records based on DBSCAN algorithm.

背景技术Background technique

在大数据时代来临的当前，数据量的增长速度飞快，互联网环境下的大数据处理也不再只是对数据的收集，还需要对数据信息进行分析和处理，整合数据信息背后隐藏的价值，获得干净的数据显得尤为重要。尤其随着数据收集的方式多种多样，其中必须要进行的阶段就是数据清洗，大量的事实证明，在数据挖掘系统中，数据预处理所占的工作量达到了整个工作量的60％至80％(Zhang Yan,Han Feng.A Study of Duplicate Eliminationin Data Cleaning.International Computer Science and Technology Conference in2007,Ningbo,China,2007.5：141-144)。合并多个数据源后，由于各种数据源表示模式，表示惯例并不相同，再加上一些输入错误、不一致的缩写等其他因素，使得合并后的数据存在不同描述但却表示同一实体的情况，称为相似重复记录。基于数据的一致性原则，我们必须检测出相似重复记录，并删除它。With the advent of the era of big data, the amount of data is increasing rapidly, and the processing of big data in the Internet environment is no longer just the collection of data, but also needs to analyze and process data information, integrate the hidden value behind the data information, and obtain Clean data is particularly important. Especially with the variety of data collection methods, the necessary stage is data cleaning. A large number of facts have proved that in the data mining system, the workload of data preprocessing has reached 60% to 80% of the entire workload. % (Zhang Yan, Han Feng. A Study of Duplicate Elimination in Data Cleaning. International Computer Science and Technology Conference in 2007, Ningbo, China, 2007.5: 141-144). After merging multiple data sources, due to the different representation modes and conventions of various data sources, plus some input errors, inconsistent abbreviations and other factors, the merged data may have different descriptions but represent the same entity , called similar duplicate records. Based on the principle of data consistency, we must detect similar duplicate records and delete them.

基于相似重复记录的紧密性，和聚类算法各自的优缺点，采取了基于密度的DBSCAN算法来进行重复相似记录的检测。不需要用户提前设置簇的个数，对任意形状的稠密数据集进行聚类，在聚类时可以发现异常点，对大型数据集适应性良好(ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clustersin large spatial databases with noise[C]//Proceedings of the 2ndInternational Conference on Knowledge Discovery and Data Mining.Portland,Oregon:AAAI Press,1996:226-231)。传统的DBSCAN算法存在在初始点选取和参数设置人为因素影响过大的缺点，导致密度聚类不均匀，形成巨大的簇，而忽略低密度的簇。2016年，DAI Yangyang等人(DAI Yangyang,LI Chaofeng,XU Hua,et al.Density spatialclustering algorithm with initial point optimization and parameter self-adaption[J].Computer Engineering,2016,42(1):203-209)提出了OS-DBSCAN算法，优化了初始点的选择，并结合数据集的特点自适应计算Eps和MinPts值。但是引入了聚类个数k、密度参数α、倍率参数β三个参数，不仅没有减少人为干预，反而带来了更大的复杂性。Based on the closeness of similar duplicate records and the advantages and disadvantages of clustering algorithms, the density-based DBSCAN algorithm is used to detect duplicate similar records. It does not require the user to set the number of clusters in advance, clusters dense data sets of any shape, and can find outliers during clustering, and has good adaptability to large data sets (ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996: 226-231). The traditional DBSCAN algorithm has the disadvantage of too much influence on initial point selection and parameter setting, resulting in uneven density clustering, forming huge clusters, and ignoring low-density clusters. In 2016, DAI Yangyang et al. (DAI Yangyang, LI Chaofeng, XU Hua, et al.Density spatial clustering algorithm with initial point optimization and parameter self-adaption[J].Computer Engineering,2016,42(1):203-209) The OS-DBSCAN algorithm is proposed, which optimizes the selection of the initial point, and combines the characteristics of the data set to adaptively calculate the Eps and MinPts values. However, the introduction of three parameters, the number of clusters k, the density parameter α, and the magnification parameter β, not only did not reduce human intervention, but brought greater complexity.

现在难度最大而又关键的问题是如何进行优化初始点和参数自适应选取，减少人为设置参数的影响，同时降低算法的复杂性，能过适应大型数据集的相似重复记录的检测，减少运行时间。因此，在DBSCAN密度聚类改进的算法的基础上，提高对于数据集相似重复记录的检测精度和时效性，具有很高的研究价值。The most difficult and key problem now is how to optimize the initial point and parameter adaptive selection, reduce the influence of artificial setting parameters, reduce the complexity of the algorithm, and be able to adapt to the detection of similar duplicate records in large data sets and reduce running time. . Therefore, based on the improved algorithm of DBSCAN density clustering, it is of high research value to improve the detection accuracy and timeliness of similar duplicate records in the dataset.

发明内容Contents of the invention

本发明的目的是将改进的DBSCAN聚类算法运用到数据集的相似重复记录检测中，来提高相似重复记录检测的精确性和时效性，为此提出了一于DBSCAN算法的相似重复记录检测优化方法。The purpose of the present invention is to apply the improved DBSCAN clustering algorithm to the similar duplicate record detection of data sets to improve the accuracy and timeliness of similar duplicate record detection. For this reason, a similar duplicate record detection optimization based on DBSCAN algorithm is proposed method.

本发明采用的技术方案是：基于DBSCAN算法的相似重复记录检测优化方法包括以下步骤：The technical scheme that the present invention adopts is: the similar duplicate record detection optimization method based on DBSCAN algorithm comprises the following steps:

步骤1，对于获取的数据集中数据进行规范化处理，主要是对于进行数据集初步划分；步骤2，基于DBSCAN算法引入初始点优化和参数自适应方法，用于改善相似重复记录聚类结果不均匀；步骤3，对步骤2聚类的相似重复记录簇使用N-Gram算法进行二次聚类，有效提高检测精度；步骤4，使用学生数据集对上述构建的方法进行训练与测试，验证该方法的有效性。Step 1, normalize the data in the acquired data set, mainly for the preliminary division of the data set; Step 2, introduce the initial point optimization and parameter self-adaptive method based on the DBSCAN algorithm, to improve the uneven clustering results of similar repeated records; Step 3, use the N-Gram algorithm to perform secondary clustering on the similar and repeated record clusters clustered in step 2, which can effectively improve the detection accuracy; step 4, use the student data set to train and test the method constructed above, and verify the effectiveness of the method effectiveness.

进一步，所述步骤1具体包括：Further, the step 1 specifically includes:

首先引入中文分词系统对于数据集中的字符型数据进行字段划分处理，并对数值型数据进行规范化处理，然后对于上述字段划分处理进行权重等级转化法，利用优先队列算法对数据集进行初步划分。First, the Chinese word segmentation system is introduced to divide the character data in the data set into fields, and the numerical data is normalized. Then, the weight level conversion method is used for the above-mentioned field division processing, and the data set is initially divided using the priority queue algorithm.

进一步，所述调整节点间的距离的具体方法是：Further, the specific method for adjusting the distance between nodes is:

步骤1.1，利用中文分词系统ICTCLAS对于数据集中字符型数据进行划分，在分词结果上记录建立倒排索引，统计各个词条在数据集中出现的频率；对于数值型数据进行数据转换，规范化处理，将其中无法识别的字符串或者带有标识性含义的标点进行处理，以及相关的加密处理符号“*”等；Step 1.1, use the Chinese word segmentation system ICTCLAS to divide the character data in the data set, record and build an inverted index on the word segmentation results, and count the frequency of each entry in the data set; perform data conversion and normalization processing on numerical data, and convert Among them, unrecognizable character strings or punctuation marks with identifying meanings are processed, as well as related encryption processing symbols "*", etc.;

步骤1.2，根据字段词频和重要程度划分等级，利用等级权重转化法转化为相应权重；Step 1.2, according to the word frequency and importance of the field, classify the grades, and use the grade weight conversion method to convert them into corresponding weights;

步骤1.3，根据字段权重，采用优先队列算法；Step 1.3, according to the field weight, adopt the priority queue algorithm;

进一步，所述步骤2的具体过程为：Further, the specific process of the step 2 is:

步骤2.1，基于初始点优化：由于聚类初始点是从数据集中任意选取出一样本，判断其为核心点后开始聚类，故计算各个样本间对象间密度，各个样本以对象密度进行从大大小的排序，寻找数据集中全局密度最大的数据样本，将该样本取做为DBSCAN算法的聚类初始点。首先求得各个数据点之间的相互距离，然后将所有的距离数据从小到大排序；然后得到所有距离序列中最小部分所占全部序列的百分比；最后统计出最小距离数据中出现次数最多的数据点，该点就是当前簇的初始点。即在以percent比例对应的距离值为半径的圆，当以初始点为圆心时，该圆所包含的数据点最多，也就是说该初始点是percent条件下的全局密度最大的点。Step 2.1, based on the initial point optimization: Since the initial point of clustering is to randomly select a sample from the data set, and then start clustering after judging it as the core point, the inter-object density of each sample is calculated, and each sample is carried out from the large to the large based on the object density. Sorting by size, looking for the data sample with the highest global density in the data set, and taking this sample as the clustering initial point of the DBSCAN algorithm. First find the mutual distance between each data point, and then sort all the distance data from small to large; then get the percentage of the smallest part of all distance sequences in the total sequence; finally count the data with the most occurrences in the minimum distance data point, which is the initial point of the current cluster. That is, in the circle whose radius corresponds to the distance value corresponding to the percent ratio, when the initial point is the center of the circle, the circle contains the most data points, that is to say, the initial point is the point with the highest global density under the percent condition.

步骤2.2，基于函数拟合对于DBSCAN聚类算法的参数自适应选取；参数自适应选取的方法主要过程如下：Step 2.2, based on function fitting for parameter adaptive selection of DBSCAN clustering algorithm; the main process of parameter adaptive selection method is as follows:

步骤2.2.1，计算样本间距离：使用欧氏距离计算数值型数据间距离：Step 2.2.1, calculate the distance between samples: use Euclidean distance to calculate the distance between numerical data:

其中，n代表数据集中样本点的数目，n即维欧氏空间，x_i，y_i代表不同样本点。x_i(i＝1,2,…,n)代表样本点x的第i个坐标，d(x,y)代表两个点x和y之间的距离；Among them, n represents the number of sample points in the data set, n is the dimensional Euclidean space, and xi_and y_i represent different sample points. x_i (i=1,2,...,n) represents the i-th coordinate of the sample point x, and d(x,y) represents the distance between two points x and y;

字符型数据距离度量计算：Character data distance metric calculation:

其中数据集X有n个样本x₁,x₂,x₃,…,x_n每个样本有A个特征，分别记为a₁,a₂,a₃,…,a_A，且这些特征均为符号属性；The data set X has n samples x₁ , x₂ , x₃ ,…,x_n and each sample has A features, which are recorded as a₁ , a₂ , a₃ ,…,a_A , and these features are is a symbol attribute;

步骤2.2.2，利用三角不等式性质，两边之和大于第三边，减少一些不必要的计算与比较，Step 2.2.2, using the nature of the triangle inequality, the sum of the two sides is greater than the third side, reducing some unnecessary calculations and comparisons,

d(C₁,C₂)≤d(b,C₁)+d(b,C₂)d(C₁ ,C₂ )≤d(b,C₁ )+d(b,C₂ )

步骤2.2.3，由于利用欧式距离的DBSCAN算法难以描述复杂结构数据的潜在关系，所以采用测地距离来代替欧式距离。测地距离是用于刻画连通曲面上给定的两点间的最短距离，代表了流体结构上样本的真实距离，可以反映数据分布的全局一致性。具体计算步骤如下：In step 2.2.3, since the DBSCAN algorithm using Euclidean distance is difficult to describe the potential relationship of complex structural data, geodesic distance is used instead of Euclidean distance. The geodesic distance is used to describe the shortest distance between two given points on the connected surface, which represents the real distance of the sample on the fluid structure and can reflect the global consistency of the data distribution. The specific calculation steps are as follows:

首先假设给定数据集X是D维空间内的数据集(x₁,x₂,...,x_n)，n表示数据集X中样本的数量。n＝|X|，对于数据集X中任意两个样本x和y，根据据欧式距离计算x_i的k近邻，公式如下：First assume that the given data set X is a data set (x₁ , x₂ ,...,x_n ) in a D-dimensional space, and n represents the number of samples in the data set X. n=|X|, For any two samples x and y in the data set X, the k nearest neighbors of x_i are calculated according to the Euclidean distance, the formula is as follows:

其中，N_k(x)＝N，表示点x的k近邻，其中k的取值范围是1≤k≤n-1，d(x,y)为两点之间的欧氏距离；Among them, N_k (x)=N, represents the k-nearest neighbor of point x, wherein the value range of k is 1≤k≤n-1, and d(x,y) is the Euclidean distance between two points;

然后初始化测地距离。当样本y为样本x的k近邻时，两个样本点之间的欧氏距离就是测地距离，Then initialize the geodesic distance. When sample y is the k-nearest neighbor of sample x, the Euclidean distance between two sample points is the geodesic distance,

若样本y不是样本x的k近邻时，需要计算样本x，y之间的最短路径距离来表示测地距离，计算公式如下：If the sample y is not the k-nearest neighbor of the sample x, it is necessary to calculate the shortest path distance between the samples x and y to represent the geodesic distance. The calculation formula is as follows:

其中，表示样本点之间的测地距离，min表示最短距离。in, Indicates the geodesic distance between sample points, and min indicates the shortest distance.

步骤2.2.4，计算样本间对象密度：数值型数据的对象密度为Step 2.2.4, calculate the object density between samples: the object density of numerical data is

字符型数据的对象密度为：The object density for character data is:

其中A为属性集合，U为数据集，对象密度值最大值为0，最小值未-A，即Among them, A is the attribute set, U is the data set, the maximum value of the object density value is 0, and the minimum value is -A, that is

-|A|≤Dens(x)≤0-|A|≤Dens(x)≤0

步骤2.2.4，对于每一个样本引入统计学的方法，绘制Eps和MinPts之间关系的图像；Step 2.2.4, introduce a statistical method for each sample, and draw an image of the relationship between Eps and MinPts;

步骤2.2.5，通过最小二乘法进行多次函数拟合Eps和MinPts之间的关系Step 2.2.5, the relationship between Eps and MinPts is performed by the least square method for multiple function fitting

进一步，所述步骤3的具体过程为：Further, the specific process of the step 3 is:

步骤3.1，步骤1.1中以对数据集中无法识别的字符串或者带有标识性含义的标点进行处理，以及相关的加密处理符号“*”等Step 3.1, in step 1.1, process unrecognized character strings or punctuation marks with identifying meanings in the data set, as well as related encryption processing symbols "*", etc.

步骤3.2，建立语料库：顺序遍历整个相似重复记录簇，根据N-Gram算法，N值＝2时，语料库中每一个元素字符串的长度则为1；Step 3.2, building a corpus: sequentially traversing the entire cluster of similar repeated records, according to the N-Gram algorithm, when the N value = 2, the length of each element character string in the corpus is 1;

步骤3.3，分割数据记录，计算重复矩阵。将字符串分割成元字符串，遍历产生有关字符串的N-Gram值，采用“马尔可夫假设”的二元Bigram方法计算：并得到重复矩阵M，M(a,b)代表“ab”在整个相似重复记录簇中的数量；Step 3.3, split the data records and calculate the repetition matrix. Divide the string into metastrings, traverse to generate the N-Gram value of the string, and use the binary Bigram method of "Markov assumption" to calculate: and get the repeated matrix M, M(a,b) represents "ab" the number of similar duplicate records in the entire cluster;

步骤3.4，再次遍历根据字符串中出现的所有N-Gram值，参照重复矩阵的信息为每个字符串分派一个WNGN值，计算记录RNGN值，RNGN＝∑WNGN，该值也是聚类记录时的键值。Step 3.4, traverse again according to all N-Gram values that appear in the string, assign a WNGN value to each string with reference to the information of the repetition matrix, and calculate the record RNGN value, RNGN=∑WNGN, which is also the value when clustering records key value.

步骤3.5，以记录的RNGN值为序将相应数据库记录映射为当前记录，然后将当前记录与队列中所有代表记录作滑动窗口Pair-wise比较。Step 3.5: Map the corresponding database record to the current record in order of the RNGN value of the record, and then compare the current record with all representative records in the queue in a pair-wise sliding window.

进一步，所述步骤4的具体包括：Further, the specific step 4 includes:

通过初始点优化和参数自适应得到新的优化算法，步骤2中函数拟合的Eps和MinPts参数关系可以根据训练结果的对比进行拟合次数的调整，步骤3中计Pair-wise比较算法的判断记录是否相似的阈值可以根据实际情况进行挑战；在训练时，使用学生数据集作为训练数据集对该模型进行训练，模型得到优化后验证该模型的有效性，最终得到相似重复记录簇。A new optimization algorithm is obtained through initial point optimization and parameter adaptation. In step 2, the parameter relationship between Eps and MinPts of function fitting can be adjusted according to the comparison of training results. In step 3, the judgment of Pair-wise comparison algorithm is calculated. The threshold of whether the records are similar can be challenged according to the actual situation; during training, the model is trained using the student data set as the training data set. After the model is optimized, the effectiveness of the model is verified, and similar duplicate record clusters are finally obtained.

本发明的有益效果是：The beneficial effects of the present invention are:

传统的相似重复检测方法一般分为两种：一种是基于字面的字符串相似度检测，另一种是基于语义的词语相似度检测。目前，对数据相似度检测方法的研究基本集中在基于字面的字符串相似度检测方面，最为常用的是经典编辑距离方法以及在编辑距离方法基础上改进的方法，但是根据这些方法所获得的数据相似度检测结果精确度不是很高；而在基于语义的词语相似度检测方面，研究成果相对较少，而且相关的检测方法还存在许多不足之处。本发明提出了一种基于DBSCAN密度聚类算法的相似重复记录检测优化方法。其主要任务是将数据集中的每一个数据样本划分到相应的簇中，使簇内的样本彼此相似，而簇间的样本彼此不相似。以密度聚类算法DBSCAN为基础，在进行DBSCAN算法聚类之前，引进中文分词系统ICTCLAS先对数据集中的数据进行规范化处理，并且将其中无法识别的字符串或者带有标识性含义的标点进行处理；其次利用函数拟合DBSCAN参数Eps和MinPts的关系，选取最为合适的参数，避免了DBSCAN算法一开始便人为设置参数造成密度聚类不均匀，容易形成重叠簇，巨大簇，忽略低密度簇的结果，并且利用三角不等式性质，减少了计算量；最后在DBSCAN密度聚类的相似重复记录簇上进行N-Gram二次聚类，来减少由于常见的拼写错误如插入、删除、交换、替换导致的错误相似重复记录，可以进一步有效的提高相似重复记录的检测精度。这种基于密度聚类算法DBSCAN的相似重复记录检测方法，利用相似重复记录的紧密性，与密度聚类算法相结合，有效的提高了相似重复记录的检测精度和处理时间，并且良好的适用于大型数据集，为提高相似重复记录检测和数据清洗提供技术支撑Traditional similarity and repetition detection methods are generally divided into two types: one is based on literal string similarity detection, and the other is semantically based word similarity detection. At present, the research on data similarity detection methods basically focuses on string similarity detection based on literals. The most commonly used methods are the classic edit distance method and the improved method based on the edit distance method. However, the data obtained by these methods The accuracy of similarity detection results is not very high; while in terms of word similarity detection based on semantics, there are relatively few research results, and there are still many shortcomings in related detection methods. The invention proposes a similar duplicate record detection and optimization method based on the DBSCAN density clustering algorithm. Its main task is to divide each data sample in the data set into the corresponding cluster, so that the samples in the cluster are similar to each other, and the samples in the cluster are not similar to each other. Based on the density clustering algorithm DBSCAN, before clustering with the DBSCAN algorithm, the Chinese word segmentation system ICTCLAS is introduced to standardize the data in the data set, and process the unrecognizable strings or punctuation points with identifying meanings ;Secondly, use the function to fit the relationship between DBSCAN parameters Eps and MinPts, select the most appropriate parameters, avoid the DBSCAN algorithm from the beginning of artificially setting parameters to cause uneven density clustering, easy to form overlapping clusters, huge clusters, and ignore low-density clusters As a result, and using the nature of the triangle inequality, the amount of calculation is reduced; finally, N-Gram secondary clustering is performed on the similar duplicate record clusters of DBSCAN density clustering to reduce common spelling errors such as insertion, deletion, exchange, and replacement. false similar duplicate records, which can further effectively improve the detection accuracy of similar duplicate records. This similar duplicate record detection method based on the density clustering algorithm DBSCAN uses the closeness of similar duplicate records and combines with the density clustering algorithm to effectively improve the detection accuracy and processing time of similar duplicate records, and is well applicable to Large data sets, providing technical support for improving similar duplicate record detection and data cleaning

附图说明Description of drawings

下面结合附图和具体实施方式对本发明做进一步详细说明：Below in conjunction with accompanying drawing and specific embodiment the present invention is described in further detail:

图1是DBSCAN算法流程图Figure 1 is a flow chart of the DBSCAN algorithm

图2是二次聚类算法流程图Figure 2 is a flowchart of the secondary clustering algorithm

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整的描述。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention.

图1为基于DBSCAN算法的相似重复记录检测的流程图，其中数据规范化处理是利用中文分词系统ICTCLAS将其中无法识别的字符串或者带有标识性含义的标点进行处理，等级权重转换法是对字段赋予相应的权重，利用优先队列算法对数据集初步，避免出现重叠簇，巨大簇；初始点优化和参数自适应选取合适参数避免了认为选取参数对于聚类结果的影响，避免聚类不均匀；对于相似重复记录簇二次聚类主要是使用N-Gram算法，来进一步的提高相似重复记录的检测精度；最终可以首先基于DBSCAN算法的相似重复记录的检测。Figure 1 is a flow chart of similar duplicate record detection based on the DBSCAN algorithm, in which data normalization processing is to use the Chinese word segmentation system ICTCLAS to process unrecognizable strings or punctuation points with identifying meanings, and the level weight conversion method is to field Give corresponding weights, use priority queue algorithm to preliminarily set data sets, avoid overlapping clusters and huge clusters; initial point optimization and parameter adaptive selection of appropriate parameters avoid the influence of selected parameters on clustering results and avoid uneven clustering; For the secondary clustering of similar duplicate record clusters, the N-Gram algorithm is mainly used to further improve the detection accuracy of similar duplicate records; finally, the detection of similar duplicate records based on the DBSCAN algorithm can be performed first.

步骤1：对于数据集中字符型数据和数值型数据进行规范化处理，主要是对数据集进行初步划分。Step 1: Normalize the character data and numerical data in the data set, mainly to preliminarily divide the data set.

(1)数据规范化处理，将其中无法识别的字符串或者带有标识性含义的标点进行处理。(1) Data normalization processing, processing unrecognizable character strings or punctuation points with identifying meanings.

利用中文分词系统ICTCLAS对于数据集中字符型数据进行划分，在分词结果上记录建立倒排索引，统计各个词条在数据集中出现的频率；对于数值型数据进行数据转换，规范化处理，将其中无法识别的字符串或者带有标识性含义的标点进行处理，以及相关的加密处理符号“*”等。Use the Chinese word segmentation system ICTCLAS to divide the character data in the data set, record and build an inverted index on the word segmentation results, and count the frequency of each entry in the data set; perform data conversion and normalization processing on numerical data, and make them unrecognizable character strings or punctuation marks with identifying meanings, as well as related encryption processing symbols "*", etc.

(2)采用等级权重转化法转化为字段赋予相应的权重。(2) Use the grade weight conversion method to convert the fields into corresponding weights.

根据字段的词频以及重要程度分别赋予相应的等级，通过公式计算出综合的等级和，随后确定统一的等级Sk：According to the word frequency and importance of the field, the corresponding grades are respectively assigned, and the comprehensive grade sum is calculated by the formula, and then the unified grade Sk is determined:

T_k为字段Y_k所指定的等级的和；s_ik为根据字段词频和重要程度分别指定的等级；S_k为各个字段最终的统一等级(字段词频最高等级指定为1，其次为2；最重要字段等级指定为1，其次为2，类推)T_k is the sum of the levels specified by the field Y_k ; s_ik is the level specified according to the field word frequency and importance; S_k is the final unified level of each field (the highest level of field word frequency is designated as 1, followed by 2; the final The important field level is designated as 1, followed by 2, and so on)

完成字段记录等级划分后，采用Rank-Centroid转换方法，将字段等级转变成相应的权重：After completing the classification of field records, use the Rank-Centroid conversion method to convert the field levels into corresponding weights:

其中当字段等级相同时则变成where when the field ranks are the same it becomes

S表示字段Y_k最大等级，S represents the field Y_k maximum rank,

(3)采用优先队列算法对数据集进行初步划分。(3) Use the priority queue algorithm to preliminarily divide the data set.

步骤2：基于DBSCAN算法引入初始点优化和参数自适应方法，用于改善相似重复记录聚类结果不均匀。Step 2: Based on the DBSCAN algorithm, the initial point optimization and parameter adaptive methods are introduced to improve the uneven clustering results of similar duplicate records.

(1)通过对象间密度进行初始点优化选取。(1) Optimal selection of the initial point through the density between objects.

由于聚类初始点是从数据集中任意选取出一样本，判断其为核心点后开始聚类，故计算各个样本间对象间密度，各个样本以对象密度进行从大大小的排序，寻找数据集中全局密度最大的数据样本，将该样本取做为DBSCAN算法的聚类初始点：Since the initial point of clustering is to select a sample arbitrarily from the data set and start clustering after judging it as the core point, the inter-object density between each sample is calculated, and each sample is sorted from the largest to the largest according to the object density to find the global data in the data set. The data sample with the highest density is taken as the initial clustering point of the DBSCAN algorithm:

数值型数据的对象密度为The object density for numeric data is

字符型数据的对象密度为：The object density for character data is:

-|A|≤Dens(x)≤0-|A|≤Dens(x)≤0

(2)参数自适应选取方法(2) Parameter adaptive selection method

首先计算样本间的距离和上述计算的样本间的对象密度，然后对于每一个样本引入统计学的方法，绘制Eps和MinPts之间关系的图像；First calculate the distance between samples and the object density between the samples calculated above, and then introduce a statistical method for each sample to draw an image of the relationship between Eps and MinPts;

使用欧氏距离计算数值型数据间距离：Calculate the distance between numeric data using Euclidean distance:

字符型数据距离度量计算：Character data distance metric calculation:

其中数据集X有n个样本x₁,x₂,x₃,…,x_n每个样本有A个特征，分别记为x₁,a₂,x₃,…,x_A，且这些特征均为符号属性；The data set X has n samples x₁ , x₂ , x₃ ,…,x_n and each sample has A features, which are recorded as x₁ , a₂ , x₃ ,…,x_A , and these features are is a symbol attribute;

期间利用三角不等式性质，两边之和大于第三边，减少一些不必要的计算与比较，During the use of the nature of the triangle inequality, the sum of the two sides is greater than the third side, reducing some unnecessary calculations and comparisons,

d(C₁,C₂)≤d(b,C₁)+d(b,C₂)d(C₁ ,C₂ )≤d(b,C₁ )+d(b,C₂ )

由于利用欧式距离的DBSCAN算法难以描述复杂结构数据的潜在关系，所以采用测地距离来代替欧式距离。测地距离是用于刻画连通曲面上给定的两点间的最短距离，代表了流体结构上样本的真实距离，可以反映数据分布的全局一致性。具体计算步骤如下：Since the DBSCAN algorithm using Euclidean distance is difficult to describe the potential relationship of complex structure data, geodesic distance is used instead of Euclidean distance. The geodesic distance is used to describe the shortest distance between two given points on the connected surface, which represents the real distance of the sample on the fluid structure and can reflect the global consistency of the data distribution. The specific calculation steps are as follows:

最后通过最小二乘法进行多次函数拟合Eps和MinPts之间的关系Finally, the relationship between Eps and MinPts is fitted by the least square method for multiple functions

步骤3：对步骤2中DBSCAN算法聚类的相似重复记录簇使用N-Gram算法进行二次聚类，有效提高检测精度，具体流程如图2所示。Step 3: Use the N-Gram algorithm to perform secondary clustering on the similar duplicate record clusters clustered by the DBSCAN algorithm in step 2 to effectively improve the detection accuracy. The specific process is shown in Figure 2.

(1)建立语料库(1) Build a corpus

顺序遍历整个相似重复记录簇，根据N-Gram算法，N值＝2时，语料库中每一个元素字符串的长度则为1。Sequentially traverse the entire cluster of similar repeated records. According to the N-Gram algorithm, when the value of N=2, the length of each element string in the corpus is 1.

(2)分割数据记录，计算重复矩阵。(2) Split the data records and calculate the repetition matrix.

将字符串分割成元字符串，遍历产生有关字符串的N-Gram值，采用“马尔可夫假设”的二元Bigram方法计算，并得到重复矩阵M，M(a,b)代表“ab”在整个相似重复记录簇中的数量。Split the string into metastrings, traverse to generate the N-Gram value of the string, and use the binary Bigram method of "Markov assumption" to calculate, And obtain the repetition matrix M, M(a,b) represents the number of "ab" in the entire similar repetition record cluster.

(3)计算记录RNGN值。(3) Calculate and record the RNGN value.

再次遍历根据字符串中出现的所有N-Gram值，参照重复矩阵的信息为每个字符串分派一个WNGN值，计算记录RNGN值，RNGN＝∑WNGN，该值也是聚类记录时的键值。Traverse again according to all N-Gram values that appear in the string, assign a WNGN value to each string with reference to the information of the repeat matrix, and calculate the record RNGN value, RNGN=∑WNGN, which is also the key value when clustering records.

(4)当前记录进行Pair-wise比较。(4) The current record is pair-wise compared.

以记录的RNGN值为序将相应数据库记录映射为当前记录，然后将当前记录与队列中所有代表记录作滑动窗口Pair-wise比较。主要是采用编辑距离比较方法，设立两个阈值：Map the corresponding database record to the current record in order of the RNGN value of the record, and then compare the current record with all representative records in the queue in a sliding window pair-wise. It mainly uses the edit distance comparison method to set up two thresholds:

阈值l₁：用来判定两条记录中同一字段是否相似。如果两个字段的相似度大于l1，则两字段不相似，从而也判定两条记录不是相似重复记录；Threshold l₁ : used to determine whether the same field in two records is similar. If the similarity of the two fields is greater than l1, the two fields are not similar, and thus it is determined that the two records are not similar duplicate records;

阈值l₂：用来判定两条记录是否是相似重复记录。记录的相似度为对应字段的相似度之和，如果两条记录的相似度大于l2，则两条记录不是相重复记录Threshold l₂ : used to determine whether two records are similar duplicate records. The similarity of the records is the sum of the similarities of the corresponding fields. If the similarity of the two records is greater than l2, the two records are not duplicate records.

步骤4：使用学生信息数据集对上述优化后的相似重复记录方法进行训练与测试，验证该方法的有效性。Step 4: Use the student information dataset to train and test the above-mentioned optimized similar duplicate recording method to verify the effectiveness of the method.

在进行数据集训练时，对于数据集中选中的内容可以进行认为修改设置相似重复记录的数目和总体的数据集数目，来进行多次的相似重复记录的检测实验，随后，可以将几个小型的训练数据集合并为一个大的数据集，并将其作为训练样本进行训练，来测试对于大规模数据集的适应性。When performing data set training, you can modify the selected content in the data set to set the number of similar duplicate records and the number of overall data sets to conduct multiple similar duplicate record detection experiments. Then, several small The training data set is merged into a large data set, and it is used as a training sample for training to test the adaptability to large-scale data sets.

优选实施例：Preferred embodiment:

本发明的一个最优具体实施方式：对于数据集中字符型数据和数值型数据进行规范化处理，对数据集进行初步划分过程中，采用了中文分词系统ICTCLAS来对于字符型数据进行处理，将其中无法识别的字符串或者带有标识性含义的标点进行处理。进行DBSCAN聚类处理时，进行初始点优化和参数自适应选取，利用函数拟合方法选取最适合数据集的参数Eps和MinPts，并且依据三角不等式性质，减少了一定的计算量；再次聚类的相似重复记录簇的基础上，进行了N-Gram二次聚类处理，给每条记录赋值一个N-Gram值，以该值作为排序键，先对记录进行排序，然后再利用优先队列进行Pair-wise比较，能够适应常见的拼写错误从而较好的聚类相似重复记录，提高相似重复记录的检测精度，在进行数据集训练时，对于数据集中选中的内容可以进行认为修改设置相似重复记录的数目和总体的数据集数目，来进行多次的相似重复记录的检测实验，随后，可以将几个小型的训练数据集合并为一个大的数据集，并将其作为训练样本进行训练，来测试对于大规模数据集的适应性。An optimal specific embodiment of the present invention: standardize the character data and numerical data in the data set, and in the process of preliminary division of the data set, the Chinese word segmentation system ICTCLAS is used to process the character data. Recognized character strings or punctuation with identifying meanings are processed. When performing DBSCAN clustering processing, the initial point optimization and parameter adaptive selection are carried out, and the parameters Eps and MinPts that are most suitable for the data set are selected by using the function fitting method, and according to the nature of the triangle inequality, a certain amount of calculation is reduced; the re-clustering On the basis of similar duplicate record clusters, the N-Gram secondary clustering process is performed, and an N-Gram value is assigned to each record, and this value is used as a sort key to sort the records first, and then use the priority queue for Pair -wise comparison, it can adapt to common misspellings so as to better cluster similar duplicate records and improve the detection accuracy of similar duplicate records. number and the overall number of data sets, to conduct multiple similar and repeated record detection experiments, and then, several small training data sets can be merged into a large data set, and it is used as a training sample for training to test Adaptability to large datasets.

综上所述，本发明的基于密度聚类算法DBSCAN算法的相似重复记录的检测的优化方法，共分为四步。第一步对于数据集进行数据规范化处理和数据集初步划分，第二步基于DBSCAN算法引入初始点优化和参数自适应方法，用于改善相似重复记录聚类结果不均匀；第三步对第二步的相似重复记录簇使用N-Gram算法进行二次聚类，有效提高检测精度；第四步使用学生数据集对上述构建的方法进行训练与测试，验证该方法的有效性。本发明在密度聚类算法的基础上，有效提高对于相似重复记录的检测精度，避免噪声点影响，降低计算成本。In summary, the optimization method for detecting similar duplicate records based on the density clustering algorithm DBSCAN algorithm of the present invention is divided into four steps. The first step is to perform data normalization processing and preliminary division of the data set, and the second step is to introduce initial point optimization and parameter adaptive methods based on the DBSCAN algorithm to improve the uneven clustering results of similar and repeated records; The similar and repeated record clusters in the first step use the N-Gram algorithm for secondary clustering, which effectively improves the detection accuracy; the fourth step uses the student data set to train and test the method constructed above to verify the effectiveness of the method. Based on the density clustering algorithm, the present invention effectively improves the detection accuracy for similar repeated records, avoids the influence of noise points, and reduces the calculation cost.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, references to the terms "one embodiment," "some embodiments," "exemplary embodiments," "example," "specific examples," or "some examples" are intended to mean that the implementation A specific feature, structure, material, or characteristic described by an embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications, substitutions and modifications can be made to these embodiments without departing from the principle and spirit of the present invention. The scope of the invention is defined by the claims and their equivalents.