





技术领域technical field
本发明涉及遥感与地理信息系统领域。The invention relates to the fields of remote sensing and geographic information systems.
背景技术Background technique
在现有空间栅格数据监督分类领域中,主要应用的技术包括神经网、支持向量机、决策树、贝叶斯、KNN等算法。这些算法采用的主要手段就是输入训练数据算法进行学习产生“分类模型”,通过“分类模型”可以进一步预测位置数据的类别信息。对于高维度数据,通常采用“属性选取”算法,降低维度提高速度。In the field of supervised classification of existing spatial raster data, the main applied technologies include neural network, support vector machine, decision tree, Bayesian, KNN and other algorithms. The main method used by these algorithms is to input the training data algorithm to learn and generate a "classification model", through which the category information of the location data can be further predicted. For high-dimensional data, the "attribute selection" algorithm is usually used to reduce the dimension and improve the speed.
当前所采用的另外一项重要技术就是“集成分类器”,集成分类器通过异构的多个分类器组合进行投票,期望获得比单一分类器更高的分类精度。Another important technology currently used is the "integrated classifier". The integrated classifier votes through a combination of heterogeneous multiple classifiers, and is expected to obtain higher classification accuracy than a single classifier.
.在处理空间栅格数据过程中,经常需要面对海量的、超高维度的数据,如某些空间数据包含2000个以上的空间属性,数据量在几个TB以上,要快速有效的处理这些数据将会面临一些困难:.In the process of processing spatial raster data, it is often necessary to face massive and ultra-high-dimensional data. For example, some spatial data contains more than 2,000 spatial attributes, and the data volume is more than several TB. It is necessary to process these data quickly and effectively. Data will face some difficulties:
(1)速度问题:数据量过大时,尤其是维度加大的时候,算法训练分类模型的开销也将加大,当前流行的基于C++的SVM算法程序(如:LIBSVM)可能数个小时也不能获得训练结果,或者直到内存空间耗尽也无法存储分析结果。(1) Speed problem: When the amount of data is too large, especially when the dimension increases, the overhead of algorithm training classification models will also increase. The current popular C++-based SVM algorithm program (such as: LIBSVM) may take several hours. The training results cannot be obtained, or the analysis results cannot be stored until the memory space is exhausted.
(2)属性子集问题:为了提高速度,很多算法均采用“属性选取”。一方面,从一个很大的属性集选取合适的属性子集是一个非确定多项式问题,组合数目过多难以穷举;近似最优的子属性通常具有“偏置”特性,某些类目的预测精度会有一定损失。(2) Attribute subset problem: In order to improve the speed, many algorithms use "attribute selection". On the one hand, selecting a suitable subset of attributes from a large attribute set is a non-deterministic polynomial problem, and the number of combinations is too large to be exhaustive; the approximate optimal sub-attributes usually have "bias" characteristics, and some categories of There will be a certain loss in prediction accuracy.
(3)精度问题:为了解决精度问题,很多算法采用“集成分类器”技术,就是将训练数据划分为多个子集,然后在进行训练,投票。对于高维度数据栅格,一方面,由于数据量较大,所以难以保证子分类器之间的差异,而多个子分类器过于近似将达不到“集成”和“投票”的目的;另一方面,大量的属性对应部分的训练数据子集,将导致“过度拟合”现象;这两种问题均导致分类精度降低。(3) Accuracy problem: In order to solve the accuracy problem, many algorithms use the "integrated classifier" technology, which is to divide the training data into multiple subsets, and then perform training and vote. For high-dimensional data grids, on the one hand, due to the large amount of data, it is difficult to ensure the difference between sub-classifiers, and if multiple sub-classifiers are too similar, the purpose of "integration" and "voting" will not be achieved; on the other hand On the one hand, a large number of attributes correspond to a subset of training data, which will lead to the phenomenon of "overfitting"; both problems lead to a decrease in classification accuracy.
综上所述在现有空间栅格数据监督分类领域中存在速度慢、精度低、属性子集具有偏置特性以及属性子集为非确定多项式的问题。To sum up, in the field of supervised classification of existing spatial raster data, there are problems such as slow speed, low precision, biased attribute subsets and non-deterministic polynomial attribute subsets.
发明内容Contents of the invention
本发明为了解决现有空间栅格数据监督分类领域中存在速度慢、精度低、属性子集具有偏置特性以及属性子集为非确定多项式的问题,从而提出了集成分类器及该装置的分类方法。In order to solve the problems of slow speed, low precision, attribute subsets with bias characteristics and attribute subsets being non-deterministic polynomials in the field of supervised classification of existing spatial raster data, the present invention proposes an integrated classifier and the classification of the device method.
集成分类器的分类方法,它包括下述步骤:The classification method of ensemble classifier, it comprises the following steps:
步骤一、采用多进程和多线程组合的方式读取待处理的栅格数据,具体过程包括如下步骤:
A、输入集成分类器的子分类器个数n;A. Input the number n of sub-classifiers of the integrated classifier;
n为子分类器的个数,n大于等于2,通过期望算法将栅格数据的所有空间属性按照决策能力分为n份,每个分类器均具备全集全部的分类能力,n is the number of sub-classifiers, n is greater than or equal to 2, all the spatial attributes of the raster data are divided into n parts according to the decision-making ability through the expectation algorithm, each classifier has all the classification capabilities of the complete set,
B、启动n+1个进程;B. Start n+1 processes;
其中,n+1个进程为Rank 0、Rank 1…Rankn;Rank0为管理进程,Rank 1…Rankn均为运算进程,运算进程Rank 1…Rankn分别与n个子分类器一一对应,Among them, n+1 processes are Rank 0,
C、在当前进程为管理进程Rank 0时,构造空的粗糙关系表,将待处理的栅格数据均匀划分给每个运算进程;启动n个线程,每个线程单独对应一个运算进程;C. When the current process is the management process Rank 0, construct an empty rough relational table, and evenly divide the raster data to be processed into each operation process; start n threads, and each thread corresponds to an operation process;
其中,线程包括第1线程、第2线程…第n线程,Wherein, the threads include the first thread, the second thread...the nth thread,
D、在当前进程为运算进程时,每个进程均同时读取待处理的栅格数据;D. When the current process is an operation process, each process simultaneously reads the raster data to be processed;
步骤二、管理进程Rank0维护属性离散化区间表,并将该属性离散化区间表均匀划分给多个线程,所述多个线程同时启动对相应的空间连续属性的栅格数据进行离散化;Step 2, the management process Rank0 maintains the attribute discretization interval table, and evenly divides the attribute discretization interval table into multiple threads, and the multiple threads start to discretize the grid data of corresponding spatially continuous attributes at the same time;
步骤三、管理进程Rank0将空间属性均匀分给n个运算进程处理,并收集n个运算进程的处理结果、构建完整的粗糙关系表,将该粗糙关系表发给每个运算进程,每个运算进程根据粗糙关系表建立一个属性子集;Step 3: The management process Rank0 evenly distributes the spatial attributes to n computing processes, collects the processing results of n computing processes, builds a complete rough relational table, and sends the rough relational table to each computing process. The process builds a subset of attributes based on the rough relational table;
步骤四、管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型,该子分类器为与所述的进程一一对应的子分类器,每个子分类器进行预测该子分类器对应的属性子集的类型,统计所有子分类器预测结果,以投票选举的方式选取投票最多的预测结果。Step 4, the management process Rank0 performs parallel training of each operation process according to the corresponding attribute subset to generate a model of a sub-classifier, the sub-classifier is a sub-classifier corresponding to the process one by one, and each sub-classifier performs Predict the type of the attribute subset corresponding to the sub-classifier, count the prediction results of all sub-classifiers, and select the prediction result with the most votes by voting.
集成分类器,它包括下述装置:An ensemble classifier, which includes the following means:
用于多进程和多线程组合的方式读取待处理的栅格数据的装置,该装置包括如下模块:A device for reading raster data to be processed in a combination of multi-process and multi-thread, the device includes the following modules:
用于输入集成分类器的子分类器个数n的模块;A module for inputting the number n of sub-classifiers of the integrated classifier;
其中,n为子分类器的个数,n大于等于2,通过期望算法将栅格数据的所有空间属性按照决策能力分为n份,每个分类器均具备全集全部的分类能力,Among them, n is the number of sub-classifiers, n is greater than or equal to 2, all the spatial attributes of the raster data are divided into n parts according to the decision-making ability through the expectation algorithm, and each classifier has the classification ability of the whole set,
用于启动n+1个进程的模块;A module for starting n+1 processes;
其中,n+1个进程为Rank 0、Rank 1…Rankn;Rank0为管理进程,Rank 1…Rankn均为运算进程,运算进程Rank 1…Rankn分别与n个子分类器一一对应,Among them, n+1 processes are Rank 0,
用于在当前进程为管理进程Rank 0时,构造空的粗糙关系表,将待处理的栅格数据均匀划分给每个运算进程;启动n个线程,每个线程单独对应一个运算进程的模块;It is used to construct an empty rough relational table when the current process is the management process Rank 0, and evenly divide the raster data to be processed into each operation process; start n threads, and each thread corresponds to a module of an operation process;
其中,线程包括第1线程、第2线程…第n线程,Wherein, the threads include the first thread, the second thread...the nth thread,
用于在当前进程为运算进程时,每个进程均同时读取待处理的栅格数据的模块;When the current process is an operation process, each process reads the raster data to be processed at the same time;
用于管理进程Rank0维护属性离散化区间表,并将该属性离散化区间表均匀划分给多个线程,所述多个线程同时启动对相应的空间连续属性的栅格数据进行离散化的装置;The management process Rank0 maintains the attribute discretization interval table, and evenly divides the attribute discretization interval table into multiple threads, and the multiple threads simultaneously start the device for discretizing the grid data of corresponding spatially continuous attributes;
用于管理进程Rank0将空间属性均匀分给n个运算进程处理,并收集n个运算进程的处理结果、构建完整的粗糙关系表,将该粗糙关系表发给每个运算进程,每个运算进程根据粗糙关系表建立一个属性子集的装置;It is used to manage the process Rank0 to evenly distribute the spatial attributes to n operation processes, collect the processing results of n operation processes, build a complete rough relationship table, send the rough relationship table to each operation process, and each operation process means for establishing a subset of attributes from a rough relational table;
用于管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型,该子分类器为与所述的进程一一对应的子分类器,每个子分类器进行预测该子分类器对应的属性子集的类型,统计所有子分类器预测结果,以投票选举的方式选取投票最多的预测结果的装置。For the management process Rank0, each operation process is trained in parallel according to the corresponding attribute subset to generate a sub-classifier model. The sub-classifier is a sub-classifier corresponding to the process, and each sub-classifier performs prediction. The type of the attribute subset corresponding to the sub-classifier, the prediction results of all sub-classifiers are counted, and the device with the most voted prediction result is selected by voting.
本发明具有以下优势:The present invention has the following advantages:
(1)采用属性划分方式,而不是样本划分方式构造训练数据子集。(1) Use the attribute division method instead of the sample division method to construct training data subsets.
(2)将训练数据子集与并行计算技术结合起来,应用于高纬度栅格数据。(2) Combining training data subsets with parallel computing techniques and applying them to high-latitude raster data.
(3)应用模糊粗集理论作为高纬度属性并行划分的标准,使得每个子集即有自己独立特性,又保持了决策完整性。(3) Apply fuzzy rough set theory as the standard for parallel division of high-latitude attributes, so that each subset not only has its own independent characteristics, but also maintains the integrity of decision-making.
(4)适应于离散型、连续型的异构数据。(4) Adapt to discrete and continuous heterogeneous data.
附图说明Description of drawings
图1为集成分类器的分类方法的流程图;Fig. 1 is the flowchart of the classification method of integrated classifier;
图2为采用多进程和多线程组合的方式读取待处理的栅格数据具体步骤的流程图;Fig. 2 is a flow chart of the specific steps of reading raster data to be processed in a combination of multi-process and multi-thread;
图3为每个线程启动对相应的空间连续属性的栅格数据进行离散化的具体步骤流程图;Fig. 3 is a flow chart of specific steps for each thread to start discretizing raster data with corresponding spatial continuous attributes;
图4为离散化过程中各线程之间的关系图,图中2≤l≤n,;Fig. 4 is the relationship diagram between each thread in the discretization process, in the figure 2≤l≤n,;
图5为粗糙关系表的构造和属性使用表的关系图;Fig. 5 is the structure of the rough relational table and the relationship diagram of the attribute usage table;
图6为训练产生模型阶段的流程图。Fig. 6 is a flowchart of the phase of training and generating a model.
具体实施方式Detailed ways
具体实施方式一、结合图1和图2具体说明本实施方式,本实施方式所述的集成分类器的分类方法,它包括下述步骤:The specific embodiment one, in conjunction with Fig. 1 and Fig. 2, specifically illustrate this embodiment, the classification method of the integrated classifier described in this embodiment, it comprises the following steps:
步骤一、采用多进程和多线程组合的方式读取待处理的栅格数据,具体过程包括如下步骤:
A、输入集成分类器的子分类器个数n;A. Input the number n of sub-classifiers of the integrated classifier;
n为子分类器的个数,n大于等于2,通过期望算法将栅格数据的所有空间属性按照决策能力分为n份,每个分类器均具备全集全部的分类能力,n is the number of sub-classifiers, n is greater than or equal to 2, all the spatial attributes of the raster data are divided into n parts according to the decision-making ability through the expectation algorithm, each classifier has all the classification capabilities of the complete set,
B、启动n+1个进程;B. Start n+1 processes;
其中,n+1个进程为Rank 0、Rank 1…Rankn;Rank0为管理进程,Rank 1…Rankn均为运算进程,运算进程Rank 1…Rankn分别与n个子分类器一一对应,Among them, n+1 processes are Rank 0,
C、在当前进程为管理进程Rank 0时,构造空的粗糙关系表,将待处理的栅格数据均匀划分给每个运算进程;启动n个线程,每个线程单独对应一个运算进程;C. When the current process is the management process Rank 0, construct an empty rough relational table, and evenly divide the raster data to be processed into each operation process; start n threads, and each thread corresponds to an operation process;
其中,线程包括第1线程、第2线程…第n线程,Wherein, the threads include the first thread, the second thread...the nth thread,
D、在当前进程为运算进程时,每个进程均同时读取待处理的栅格数据;D. When the current process is an operation process, each process simultaneously reads the raster data to be processed;
步骤二、管理进程Rank0维护属性离散化区间表,并将该属性离散化区间表均匀划分给多个线程,所述多个线程同时启动对相应的空间连续属性的栅格数据进行离散化;Step 2, the management process Rank0 maintains the attribute discretization interval table, and evenly divides the attribute discretization interval table into multiple threads, and the multiple threads start to discretize the grid data of corresponding spatially continuous attributes at the same time;
步骤三、管理进程Rank0将空间属性均匀分给n个运算进程处理,并收集n个运算进程的处理结果、构建完整的粗糙关系表,将该粗糙关系表发给每个运算进程,每个运算进程根据粗糙关系表建立一个属性子集;Step 3: The management process Rank0 evenly distributes the spatial attributes to n computing processes, collects the processing results of n computing processes, builds a complete rough relational table, and sends the rough relational table to each computing process. The process builds a subset of attributes based on the rough relational table;
步骤四、管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型,该子分类器为与所述的进程一一对应的子分类器,每个子分类器进行预测该子分类器对应的属性子集的类型,统计所有子分类器预测结果,以投票选举的方式选取投票最多的预测结果。Step 4, the management process Rank0 performs parallel training of each operation process according to the corresponding attribute subset to generate a model of a sub-classifier, the sub-classifier is a sub-classifier corresponding to the process one by one, and each sub-classifier performs Predict the type of the attribute subset corresponding to the sub-classifier, count the prediction results of all sub-classifiers, and select the prediction result with the most votes by voting.
本实施方式在步骤三之后,各个进程均获得“属性子集”,各个进程通过属性子集并行训练一个指定的分类器(如:ID3,SVM,神经网此类模型为传统算法),可以用相对较小的数据量(相对数百维,本算法每个子集的大小通常10-20个,数据量缩小数十被倍)快速训练产生模型。这些模型在决策过程中可以只用投票选举的形式如图6所示,可以有效的防止过度拟合,增加分类精度。In this embodiment, after step three, each process obtains an "attribute subset", and each process trains a specified classifier in parallel through the attribute subset (such as: ID3, SVM, neural network and other models are traditional algorithms), which can be used Relatively small amount of data (compared to hundreds of dimensions, the size of each subset of this algorithm is usually 10-20, and the amount of data is reduced by tens of times) to quickly train and generate models. These models can only use voting in the decision-making process, as shown in Figure 6, which can effectively prevent overfitting and increase classification accuracy.
本实施方式所述的投票选举方式为:假如目前有n个分类器,对于一个需要预测的对象x,这n个分类器分别作出预测,期中m1个分类器决策认为是”A类型”,m2个分类器决策认为是“B”类型,。这时以投票,少数服从多数为原则,取较多分类器认同的决策为集成分类器整体的决策。就是投票选举过程。The voting method described in this embodiment is as follows: if there are currently n classifiers, for an object x that needs to be predicted, these n classifiers make predictions respectively, and m1 classifiers decide to be "type A", m2 A classifier decides to consider type "B", . At this time, voting is based on the principle that the minority obeys the majority, and the decision agreed by more classifiers is taken as the overall decision of the integrated classifier. It is the voting process.
具体实施方式二、本实施方式与具体实施方式一所述的集成分类器的分类方法的区别在于,步骤A所述的栅格数据是高维度栅格数据。Embodiment 2. The difference between this embodiment and the classification method with integrated classifiers described in
本实施方式对于海量的高维度的栅格数据,传统算法速度慢精度低,而本专利达到快速处理栅格数据,获取分类模型,而且由于采用异构决策机制,所以分类精度也较高。In this embodiment, for massive high-dimensional raster data, traditional algorithms are slow in speed and low in precision. However, this patent achieves fast processing of raster data and acquisition of classification models. Moreover, due to the adoption of a heterogeneous decision-making mechanism, the classification accuracy is also high.
具体实施方式三、结合图3具体说明本实施方式,本实施方式与具体实施方式一或二所述的集成分类器的分类方法的区别在于,步骤二所述每个线程启动对相应的空间连续属性的栅格数据进行离散化的具体步骤为:Specific Embodiment 3. This embodiment is described in detail in conjunction with FIG. 3 . The difference between this embodiment and the classification method of the integrated classifier described in
步骤二一、设置聚类个数为ceil;Step 21. Set the number of clusters to ceil;
步骤二二、在该线程启动的空间连续属性的最大值和最小值之间求取均匀分布聚类初始中心;Step 22. Find the initial center of uniformly distributed clustering between the maximum value and the minimum value of the spatial continuity attribute started by the thread;
步骤二三、根据K-Means算法对均匀分布聚类初始中心进行聚类,形成ceil个聚类;Step two and three, according to the K-Means algorithm, cluster the initial center of the evenly distributed cluster to form ceil clusters;
步骤二四、对于每一个聚类输出其最小和最大值,形成ceil个值域区间;Step 24, output its minimum and maximum values for each cluster to form ceil range intervals;
步骤二五、将所述ceil个值域区间构成一个区间列表。Step 25: Construct the ceil value range intervals into an interval list.
本实施方式通过离散化,获得离散化区间,通过这组区间就可以将原有的连续数据变为有限个数的1,2,3,4等数字,明晰关系,加快比对分析速度。对于多进程情况下,所有数据的处理流程如图4。In this embodiment, discretization intervals are obtained through discretization, and through this group of intervals, the original continuous data can be changed into a limited number of numbers such as 1, 2, 3, 4, etc., so as to clarify the relationship and speed up the comparison and analysis. For the case of multi-process, the processing flow of all data is shown in Figure 4.
具体实施方式四、本实施方式与具体实施方式一或二所述的集成分类器的分类方法的区别在于,所述的步骤三中所述的粗糙关系表是一个二维表,表示二个属性直接的交叠程度,粗糙关系为1表示属性直接相关性最强,粗糙关系为0表示最不相关,粗糙关系表如下:Embodiment 4. The difference between this embodiment and the classification method of the integrated classifier described in
具体实施方式五、本实施方式与具体实施方式四所述的集成分类器的分类方法的区别在于,步骤三中每个运算进程根据粗糙关系表建立一个属性子集的具体步骤为:Embodiment 5. The difference between this embodiment and the classification method of the integrated classifier described in Embodiment 4 is that in step 3, each operation process establishes an attribute subset according to the rough relational table. The specific steps are as follows:
步骤三一、在所述运算进程的粗糙关系表中随机选择一对粗糙关系不相关的属性,该属性的状态为“未使用”,将该属性加入所述运算进程的属性子集中,该子集为与所述运算进程一一对应的子集,并将其标记为“已使用”,Step 31: Randomly select a pair of attributes irrelevant to the rough relationship in the rough relationship table of the operation process, and the status of the attribute is "unused", add this attribute to the attribute subset of the operation process, set as a subset that corresponds one-to-one to said computing process, and mark it as "used",
属性的状态为“已使用”或“未使用”;The status of the attribute is "used" or "not used";
步骤三二、在所述运算进程中,根据公式(8)计算每一对“未使用”的属性与所述运算进程的属性子集的关系,Step 32. In the operation process, calculate the relationship between each pair of "unused" attributes and the attribute subset of the operation process according to formula (8),
属性与属性子集的粗糙关系为:The rough relationship between attributes and attribute subsets is:
其中,b表示所述运算进程的属性子集,an表示任意一对“未使用”的属性,RT(b,an)表示所述运算进程的属性子集与任意一对“未使用”的属性的粗糙关系;Wherein, b represents the attribute subset of the operation process, an represents any pair of "unused" attributes, RT(b, an) represents the attribute subset of the operation process and any pair of "unused" attributes rough relationship;
步骤三三、选出计算结果最小的属性,将该属性加入到所述运算进程的属性子集中,并将所述运算进程的属性子集标记为“已使用”;Step 33: Select the attribute with the smallest calculation result, add this attribute to the attribute subset of the operation process, and mark the attribute subset of the operation process as "used";
步骤三四、根据公式(6)计算所述运算进程的属性子集与维度全集D的关系;Step 34: Calculate the relationship between the attribute subset of the operation process and the complete set of dimensions D according to the formula (6);
其中,w表示所述运算进程的属性子集,IND(w)为w子集所对应的不可区分关系,也就是在集合W中的元素,认为是区分不开的,不可比较的;Card(U)为计算集合的秩,为cardinal的缩写;POSD(X)为X对应于D的正域,更一般的说法是,X集合被D集合完全包含;Wherein, w represents the attribute subset of the operation process, and IND(w) is an indistinguishable relationship corresponding to the w subset, that is, the elements in the set W are considered to be indistinguishable and incomparable; Card( U) is the rank of the calculation set, which is the abbreviation of cardinal; POSD (X) is the positive domain of X corresponding to D. More generally, the X set is completely contained by the D set;
步骤三五、当γD(w)=1时,输出所述运算进程的属性子集;Step 35, when γD (w)=1, output the attribute subset of the operation process;
步骤三六、当γD(w)=0时,在所述运算进程中,根据公式(8)计算每一对“未使用”的属性与所述运算进程的属性子集的关系。Step 36: When γD (w)=0, in the operation process, calculate the relationship between each pair of "unused" attributes and the attribute subset of the operation process according to formula (8).
根据Pawlak的粗集理论,一个信息系统S可以被看作是一个数据表。它可以由对S=(U,A)表示,其中:论域U是非空有限集合;A是非空有限的属性集合;对于A中的每一个元素a∈A,存在一个映射a:U→Va,其中Va是a取值的集合。一个决策表就是形如S=(U,A∪{d})的信息系统,其中是决策属性。对于任意的属性集合存在一个不可区分关系IND(P):According to Pawlak's rough set theory, an information system S can be regarded as a data table. It can be represented by the pair S=(U, A), where: the universe of discourse U is a non-empty finite set; A is a non-empty finite set of attributes; for each element a∈A in A, there is a mapping a: U→Va , where Va is a set of values of a. A decision table is an information system of the form S=(U, A∪{d}), where is a decision attribute. For any set of attributes There is an indistinguishable relation IND(P):
其中,x和y均为多维空间下的,多维度矢量;Among them, both x and y are multi-dimensional vectors in multi-dimensional space;
一个基于P不可区分关系的等价类可以定义为:An equivalence class based on P-indistinguishable relations can be defined as:
[x]p={y∈U |(x,y)∈IND(P)} (2)[x]p ={y∈U|(x,y)∈IND(P)} (2)
根据不可区分关系可以定义上下近似集.令集合X∈U,X可以由下面两个集合近似的表示:According to the indistinguishable relationship, the upper and lower approximation sets can be defined. Let the set X∈U, X can be approximated by the following two sets:
下近似集:上近似集:如果那么对就称之为粗糙集。定义正域、负域和边域:X为一个集合,表示X的下近似集,[x]D∩X表示X的上近似集,The next approximate set: Upper approximation set: if then yes It is called a rough set. Define positive domain, negative domain and border domain: X is a set, Represents the lower approximate set of X, [x]D ∩X represents the upper approximate set of X,
POSD(X)=DX (3)POSD (X) =D X (3)
BNDD(X)=NEGD(X)-POSD(X) (5)BNDD (X)=NEGD (X)-POSD (X) (5)
粗集的一个重要概念是属性之间的依赖度。属性Q对于属性D的依赖程度可以定义为(属性依赖度,由γ表示):An important concept of rough sets is the dependency between attributes. The dependence degree of attribute Q on attribute D can be defined as (attribute dependency, represented by γ):
根据公式(6)可知当子集R为所有维度(属性)的总和时,公式(6)结果为1。当D1={a1},D2={a1,a2}时计算Diff(D2D1)=γD2(Q)-γD1(Q),此时如果Diff值较大,说明a1和a2维度覆盖的决策领域不同且范围较大,适合组合一起;如果Diff值较小,说明a1和a2维度覆盖的决策领域相近(极端情况a2=a1,Diff=0两个属性没有任何区别),两个属性不适合组合在一起。所以粗糙关系表中的每一个表项计算公式为:According to formula (6), when the subset R is the sum of all dimensions (attributes), the result of formula (6) is 1. When D1 ={a1 }, D2 ={a1 ,a2 }, calculate Diff(D2D1)=γD2 (Q)-γD1 (Q), at this time, if the Diff value is large, it means that a1 and a2 The decision-making areas covered by the dimensions are different and have a large range, and are suitable for combination; if the Diff value is small, it means that the decision-making areas covered by the a1 and a2 dimensions are similar (in extreme cases, a2=a1, Diff=0, there is no difference between the two attributes), and the two attributes attributes are not suitable for grouping together. Therefore, the calculation formula for each entry in the rough relational table is:
粗糙关系RT(a1,a2)=1-(γ{a1,a2}(Q)-γa1(Q)) (7)Rough relation RT(a1, a2) =1-(γ{a1, a2} (Q)-γa1 (Q)) (7)
对于一个属性b与一个属性集D={a1,a2,…,an},其粗糙关系为:For an attribute b and an attribute set D={a1 ,a2 ,…,an }, the rough relationship is:
粗糙关系计算量较大,需要按照图5进行并行,在计算之后可以遍历粗糙关系表来计算“属性与集合关系”该过程为一个求和统计过程,计算量较小。在粗糙关系表建立之后,获得各个属性子集。The calculation of rough relationship is relatively large, and it needs to be parallelized according to Figure 5. After the calculation, the rough relationship table can be traversed to calculate the "property and set relationship". This process is a summation and statistical process, and the calculation amount is small. After the rough relational table is established, each attribute subset is obtained.
具体实施方式六、本实施方式所述的集成分类器的分类方法,它包括下述步骤:Specific embodiment six, the classification method of the integrated classifier described in the present embodiment, it comprises the following steps:
用于多进程和多线程组合的方式读取待处理的栅格数据的装置,该装置包括如下模块:A device for reading raster data to be processed in a combination of multi-process and multi-thread, the device includes the following modules:
用于输入集成分类器的子分类器个数n的模块;A module for inputting the number n of sub-classifiers of the integrated classifier;
其中,n为子分类器的个数,n大于等于2,通过期望算法将栅格数据的所有空间属性按照决策能力分为n份,每个分类器均具备全集全部的分类能力,Among them, n is the number of sub-classifiers, n is greater than or equal to 2, all the spatial attributes of the raster data are divided into n parts according to the decision-making ability through the expectation algorithm, and each classifier has the classification ability of the whole set,
用于启动n+1个进程的模块;A module for starting n+1 processes;
其中,n+1个进程为Rank 0、Rank 1…Rankn;Rank0为管理进程,Rank 1…Rankn均为运算进程,运算进程Rank 1…Rankn分别与n个子分类器一一对应,Among them, n+1 processes are Rank 0,
用于在当前进程为管理进程Rank 0时,构造空的粗糙关系表,将待处理的栅格数据均匀划分给每个运算进程;启动n个线程,每个线程单独对应一个运算进程的模块;It is used to construct an empty rough relational table when the current process is the management process Rank 0, and evenly divide the raster data to be processed into each operation process; start n threads, and each thread corresponds to a module of an operation process;
其中,线程包括第1线程、第2线程…第n线程,Wherein, the threads include the first thread, the second thread...the nth thread,
用于在当前进程为运算进程时,每个进程均同时读取待处理的栅格数据的模块;When the current process is an operation process, each process reads the raster data to be processed at the same time;
用于管理进程Rank0维护属性离散化区间表,并将该属性离散化区间表均匀划分给多个线程,所述多个线程同时启动对相应的空间连续属性的栅格数据进行离散化的装置;The management process Rank0 maintains the attribute discretization interval table, and evenly divides the attribute discretization interval table into multiple threads, and the multiple threads simultaneously start the device for discretizing the grid data of corresponding spatially continuous attributes;
用于管理进程Rank0将空间属性均匀分给n个运算进程处理,并收集n个运算进程的处理结果、构建完整的粗糙关系表,将该粗糙关系表发给每个运算进程,每个运算进程根据粗糙关系表建立一个属性子集的装置;It is used to manage the process Rank0 to evenly distribute the spatial attributes to n operation processes, collect the processing results of n operation processes, build a complete rough relationship table, send the rough relationship table to each operation process, and each operation process means for establishing a subset of attributes from a rough relational table;
用于管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型,该子分类器为与所述的进程一一对应的子分类器的装置。The management process Rank0 performs parallel training of each operation process according to the corresponding attribute subset to generate a model of a sub-classifier, and the sub-classifier is a device corresponding to the sub-classifier of the process one by one.
用于管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型,该子分类器为与所述的进程一一对应的子分类器,每个子分类器根据模糊粗糙集理论进行预测该子分类器对应的属性子集的类型,统计所有子分类器预测结果,以投票选举的方式选取投票最多的预测结果的装置。For the management process Rank0, each operation process is trained in parallel according to the corresponding attribute subset to generate a sub-classifier model. The sub-classifier is a sub-classifier corresponding to the process, and each sub-classifier is based on the Rough set theory predicts the type of the attribute subset corresponding to the sub-classifier, counts the prediction results of all sub-classifiers, and selects the device with the most voted prediction results by voting.
具体实施方式七、本实施方式与具体实施方式六所述的集成分类器的分类方法的区别在于,所述的栅格数据是高维度栅格数据。Embodiment 7. The difference between this embodiment and the method for classifying integrated classifiers described in Embodiment 6 is that the raster data is high-dimensional raster data.
具体实施方式八、本实施方式与具体实施方式六或七所述的集成分类器的分类方法的区别在于,用于管理进程Rank0维护属性离散化区间表,并将该属性离散化区间表均匀划分给多个线程,所述多个线程同时启动对相应的空间连续属性的栅格数据进行离散化的装置,包括如下模块:Embodiment 8. The difference between this embodiment and the classification method of the integrated classifier described in Embodiment 6 or 7 is that the management process Rank0 maintains the attribute discretization interval table, and evenly divides the attribute discretization interval table For a plurality of threads, the plurality of threads simultaneously start the device for discretizing the grid data of the corresponding spatial continuous attributes, including the following modules:
用于设置聚类个数为ceil的模块;A module used to set the number of clusters to ceil;
用于在该线程启动的空间连续属性的最大值和最小值之间求取均匀分布聚类初始中心的模块;A module for finding the initial center of a uniformly distributed cluster between the maximum value and the minimum value of the spatial continuity attribute started by this thread;
用于根据K-Means算法对均匀分布聚类初始中心进行聚类,形成ceil个聚类的模块;A module for clustering the initial centers of evenly distributed clusters according to the K-Means algorithm to form ceil clusters;
用于将每一个聚类输出其最小和最大值,形成ceil个值域区间的模块;A module for outputting the minimum and maximum values of each cluster to form ceil range intervals;
用于将所述ceil个值域区间构成一个区间列表的模块。A module for forming the ceil value range intervals into an interval list.
具体实施方式九、本实施方式与具体实施方式六所述的集成分类器的分类方法的区别在于,所述的粗糙关系表是一个二维表,表示二个属性直接的交叠程度,粗糙关系为1表示属性直接相关性最强,粗糙关系为0表示最不相关,粗糙关系表如下:Embodiment 9. The difference between this embodiment and the classification method of the integrated classifier described in Embodiment 6 is that the rough relation table is a two-dimensional table, which indicates the degree of direct overlap between two attributes, and the rough relation A value of 1 indicates that the attribute has the strongest direct correlation, and a rough relationship of 0 indicates the least correlation. The rough relationship table is as follows:
具体实施方式十、本实施方式与具体实施方式六所述的集成分类器的分类方法的区别在于,用于管理进程Rank0将空间属性均匀分给n个运算进程处理并收集n个运算进程的处理结果、构建完整的粗糙关系表,将该粗糙关系表发给每个运算进程,每个运算进程根据粗糙关系表建立一个属性子集的装置,包括如下模块:Embodiment 10. The difference between this embodiment and the classification method of the integrated classifier described in Embodiment 6 is that the management process Rank0 evenly distributes the spatial attributes to n operation processes for processing and collects n operation processes. As a result, a complete rough relational table is constructed, and the rough relational table is sent to each operation process, and each operation process establishes an attribute subset device according to the rough relational table, including the following modules:
用于在所述运算进程的粗糙关系表中随机选择一对粗糙关系不相关的属性,该属性的状态为“未使用”,将该属性加入所述运算进程的属性子集中,该子集为与所述运算进程一一对应的子集,并将其标记为“已使用”的模块,It is used to randomly select a pair of attributes irrelevant to the rough relationship in the rough relationship table of the operation process, and the status of the attribute is "unused", and add the attribute to the attribute subset of the operation process, the subset is a subset of modules that correspond one-to-one to said operational processes and mark them as "used",
属性的状态为“已使用”或“未使用”;The status of the attribute is "used" or "not used";
用于在所述运算进程中,根据公式(8)计算每一对“未使用”的属性与所述运算进程的属性子集的关系的模块,A module for calculating the relationship between each pair of "unused" attributes and the attribute subset of the operation process according to formula (8) in the operation process,
属性与属性子集的粗糙关系为:The rough relationship between attributes and attribute subsets is:
其中,b表示所述运算进程的属性子集,an表示任意一对“未使用”的属性,RT(b,an)表示所述运算进程的属性子集与任意一对“未使用”的属性的粗糙关系;Wherein, b represents the attribute subset of the operation process, an represents any pair of "unused" attributes, RT(b, an) represents the attribute subset of the operation process and any pair of "unused" attributes rough relationship;
用于选出计算结果最小的属性,将该属性加入到所述运算进程的属性子集中,并将所述运算进程的属性子集标记为“已使用”的模块A module for selecting the attribute with the smallest calculation result, adding this attribute to the attribute subset of the operation process, and marking the attribute subset of the operation process as "used"
用于根据公式(6)计算所述运算进程的属性子集与维度全集D的关系的模块;A module for calculating the relationship between the attribute subset of the operation process and the full set of dimensions D according to formula (6);
其中,w表示所述运算进程的属性子集,IND(w)为w子集所对应的不可区分关系,Card(U)为计算集合的秩,POSD(X)为X对应于D的正域;Among them, w represents the attribute subset of the operation process, IND(w) is the indistinguishable relationship corresponding to the w subset, Card(U) is the rank of the calculation set, and POSD (X) is the positive relationship between X and D. area;
用于在当γD(w)=1时,输出所述运算进程的属性子集的模块;A module for outputting a subset of attributes of the operation process when γD (w)=1;
用于在当γD(w)=0时,在所述运算进程中,根据公式(8)计算每一对“未使用”的属性与所述运算进程的属性子集的关系的模块。A module for calculating the relationship between each pair of "unused" attributes and the attribute subset of the operation process in the operation process when γD (w)=0.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201210379640.9ACN102930290B (en) | 2012-10-09 | 2012-10-09 | The sorting technique of integrated classifier and this device |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201210379640.9ACN102930290B (en) | 2012-10-09 | 2012-10-09 | The sorting technique of integrated classifier and this device |
| Publication Number | Publication Date |
|---|---|
| CN102930290Atrue CN102930290A (en) | 2013-02-13 |
| CN102930290B CN102930290B (en) | 2015-08-19 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201210379640.9AExpired - Fee RelatedCN102930290B (en) | 2012-10-09 | 2012-10-09 | The sorting technique of integrated classifier and this device |
| Country | Link |
|---|---|
| CN (1) | CN102930290B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104484404A (en)* | 2014-12-15 | 2015-04-01 | 中国科学院东北地理与农业生态研究所 | Improved processing method for geo-raster data file in distributed file system |
| CN105303470A (en)* | 2015-11-26 | 2016-02-03 | 国网辽宁省电力有限公司大连供电公司 | Electric power project planning and construction method based on big data |
| CN107203775A (en)* | 2016-03-18 | 2017-09-26 | 阿里巴巴集团控股有限公司 | A kind of method of image classification, device and equipment |
| CN111259273A (en)* | 2018-11-30 | 2020-06-09 | 顺丰科技有限公司 | Webpage classification model construction method, classification method and device |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101251896A (en)* | 2008-03-21 | 2008-08-27 | 腾讯科技(深圳)有限公司 | Object detecting system and method based on multiple classifiers |
| US7562017B1 (en)* | 2003-05-29 | 2009-07-14 | At&T Intellectual Property Ii, L.P. | Active labeling for spoken language understanding |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7562017B1 (en)* | 2003-05-29 | 2009-07-14 | At&T Intellectual Property Ii, L.P. | Active labeling for spoken language understanding |
| CN101251896A (en)* | 2008-03-21 | 2008-08-27 | 腾讯科技(深圳)有限公司 | Object detecting system and method based on multiple classifiers |
| Title |
|---|
| 潘欣等: "粗集属性划分的集成遥感分类", 《遥感学报》, 31 December 2009 (2009-12-31)* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104484404A (en)* | 2014-12-15 | 2015-04-01 | 中国科学院东北地理与农业生态研究所 | Improved processing method for geo-raster data file in distributed file system |
| CN104484404B (en)* | 2014-12-15 | 2017-11-07 | 中国科学院东北地理与农业生态研究所 | One kind improves geographical raster data document handling method in distributed file system |
| CN105303470A (en)* | 2015-11-26 | 2016-02-03 | 国网辽宁省电力有限公司大连供电公司 | Electric power project planning and construction method based on big data |
| CN107203775A (en)* | 2016-03-18 | 2017-09-26 | 阿里巴巴集团控股有限公司 | A kind of method of image classification, device and equipment |
| CN107203775B (en)* | 2016-03-18 | 2021-07-27 | 斑马智行网络(香港)有限公司 | A method, apparatus and device for image classification |
| CN111259273A (en)* | 2018-11-30 | 2020-06-09 | 顺丰科技有限公司 | Webpage classification model construction method, classification method and device |
| Publication number | Publication date |
|---|---|
| CN102930290B (en) | 2015-08-19 |
| Publication | Publication Date | Title |
|---|---|---|
| Goodwin et al. | Real-time digital twin-based optimization with predictive simulation learning | |
| Paredes et al. | Machine learning or discrete choice models for car ownership demand estimation and prediction? | |
| US20190012344A1 (en) | Distributed data transformation system | |
| Kiang et al. | An evaluation of self-organizing map networks as a robust alternative to factor analysis in data mining applications | |
| Wang et al. | Three-way ensemble clustering for incomplete data | |
| Heng et al. | How to solve combinatorial optimization problems using real quantum machines: A recent survey | |
| CN102799567A (en) | Information processing apparatus, information processing method, and program | |
| CN102930290B (en) | The sorting technique of integrated classifier and this device | |
| CN102088709A (en) | Method for predicting telephone traffic based on clustering and autoregressive integrated moving average (ARIMA) model | |
| Satyasree et al. | An exhaustive literature review on class imbalance problem | |
| Hsia et al. | Trust region subproblem with a fixed number of additional linear inequality constraints has polynomial complexity | |
| Dabou et al. | Time series-analysis based engineering of high-dimensional wide-area stability indices for machine learning | |
| CN101808339A (en) | Telephone traffic subdistrict self-adaptive classification method applying K-MEANS and prior knowledge | |
| Liu et al. | Research on big data mining technology of electric vehicle charging behaviour | |
| CN114066073A (en) | Grid Load Forecasting Method | |
| CN119691580A (en) | Method, device and storage medium for detecting abnormal event of space-time big data | |
| Maghrebi et al. | Matching experts' decisions in concrete delivery dispatching centers by ensemble learning algorithms: Tactical level | |
| Zainab et al. | Distributed tree-based machine learning for short-term load forecasting with apache spark | |
| CN115829172B (en) | Pollution prediction method, pollution prediction device, computer equipment and storage medium | |
| Zheng et al. | Modeling stochastic service time for complex on-demand food delivery | |
| CN116304393A (en) | Data processing method, device, computer equipment and storage medium | |
| JP2014115920A (en) | Multi-class identifier, method, and program | |
| Jung et al. | Efficiency improvement of classification model based on altered k-means using PCA and outlier | |
| JP6213665B2 (en) | Information processing apparatus and clustering method | |
| Manasseh et al. | Static Seeding and Clustering of LSTM Embeddings to Learn From Loosely Time-Decoupled Events |
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | Granted publication date:20150819 Termination date:20181009 | |
| CF01 | Termination of patent right due to non-payment of annual fee |