CN102930290A

Movatterモバイル変換

Info

Publication number: CN102930290A
Application number: CN2012103796409A
Authority: CN
Inventors: 张淑清; 潘欣; 张策; 姜春雷
Original assignee: Northeast Institute of Geography and Agroecology of CAS
Current assignee: Northeast Institute of Geography and Agroecology of CAS
Priority date: 2012-10-09
Filing date: 2012-10-09
Publication date: 2013-02-13
Anticipated expiration: 2032-10-09
Also published as: CN102930290B

Abstract

本发明涉及集成分类器及该装置的分类方法。本发明为了解决现有空间栅格数据监督分类领域中存在速度慢、精度低、属性子集具有偏置特性以及属性子集为非确定多项式的问题。本发明采用属性划分方式，将训练数据子集与并行计算技术结合起来，且能应用于高纬度栅格数据；并采用模糊粗糙集理论作为高纬度属性并行划分的标准，使得每个子集即有自己独立特性，又保持了决策完整性，适应于离散型、连续型的异构数据。本发明应用于遥感与地理信息系统领域。

The invention relates to an integrated classifier and a classification method of the device. The invention aims to solve the problems of slow speed, low precision, biased property of attribute subsets and non-deterministic polynomials existing in the field of supervised classification of existing spatial raster data. The present invention adopts an attribute division method, combines training data subsets with parallel computing technology, and can be applied to high-latitude raster data; and uses fuzzy rough set theory as a standard for high-latitude attribute parallel division, so that each subset has It is independent and maintains the integrity of decision-making, and is suitable for discrete and continuous heterogeneous data. The invention is applied to the fields of remote sensing and geographic information systems.

Description

Translated fromChinese

集成分类器及该装置的分类方法Integrated classifier and classification method of the device

技术领域technical field

本发明涉及遥感与地理信息系统领域。The invention relates to the fields of remote sensing and geographic information systems.

背景技术Background technique

在现有空间栅格数据监督分类领域中，主要应用的技术包括神经网、支持向量机、决策树、贝叶斯、KNN等算法。这些算法采用的主要手段就是输入训练数据算法进行学习产生“分类模型”，通过“分类模型”可以进一步预测位置数据的类别信息。对于高维度数据，通常采用“属性选取”算法，降低维度提高速度。In the field of supervised classification of existing spatial raster data, the main applied technologies include neural network, support vector machine, decision tree, Bayesian, KNN and other algorithms. The main method used by these algorithms is to input the training data algorithm to learn and generate a "classification model", through which the category information of the location data can be further predicted. For high-dimensional data, the "attribute selection" algorithm is usually used to reduce the dimension and improve the speed.

当前所采用的另外一项重要技术就是“集成分类器”，集成分类器通过异构的多个分类器组合进行投票，期望获得比单一分类器更高的分类精度。Another important technology currently used is the "integrated classifier". The integrated classifier votes through a combination of heterogeneous multiple classifiers, and is expected to obtain higher classification accuracy than a single classifier.

.在处理空间栅格数据过程中，经常需要面对海量的、超高维度的数据，如某些空间数据包含2000个以上的空间属性，数据量在几个TB以上，要快速有效的处理这些数据将会面临一些困难：.In the process of processing spatial raster data, it is often necessary to face massive and ultra-high-dimensional data. For example, some spatial data contains more than 2,000 spatial attributes, and the data volume is more than several TB. It is necessary to process these data quickly and effectively. Data will face some difficulties:

（1）速度问题：数据量过大时，尤其是维度加大的时候，算法训练分类模型的开销也将加大，当前流行的基于C++的SVM算法程序（如：LIBSVM）可能数个小时也不能获得训练结果，或者直到内存空间耗尽也无法存储分析结果。(1) Speed problem: When the amount of data is too large, especially when the dimension increases, the overhead of algorithm training classification models will also increase. The current popular C++-based SVM algorithm program (such as: LIBSVM) may take several hours. The training results cannot be obtained, or the analysis results cannot be stored until the memory space is exhausted.

（2）属性子集问题：为了提高速度，很多算法均采用“属性选取”。一方面，从一个很大的属性集选取合适的属性子集是一个非确定多项式问题，组合数目过多难以穷举；近似最优的子属性通常具有“偏置”特性，某些类目的预测精度会有一定损失。(2) Attribute subset problem: In order to improve the speed, many algorithms use "attribute selection". On the one hand, selecting a suitable subset of attributes from a large attribute set is a non-deterministic polynomial problem, and the number of combinations is too large to be exhaustive; the approximate optimal sub-attributes usually have "bias" characteristics, and some categories of There will be a certain loss in prediction accuracy.

（3）精度问题：为了解决精度问题，很多算法采用“集成分类器”技术，就是将训练数据划分为多个子集，然后在进行训练，投票。对于高维度数据栅格，一方面，由于数据量较大，所以难以保证子分类器之间的差异，而多个子分类器过于近似将达不到“集成”和“投票”的目的；另一方面，大量的属性对应部分的训练数据子集，将导致“过度拟合”现象；这两种问题均导致分类精度降低。(3) Accuracy problem: In order to solve the accuracy problem, many algorithms use the "integrated classifier" technology, which is to divide the training data into multiple subsets, and then perform training and vote. For high-dimensional data grids, on the one hand, due to the large amount of data, it is difficult to ensure the difference between sub-classifiers, and if multiple sub-classifiers are too similar, the purpose of "integration" and "voting" will not be achieved; on the other hand On the one hand, a large number of attributes correspond to a subset of training data, which will lead to the phenomenon of "overfitting"; both problems lead to a decrease in classification accuracy.

综上所述在现有空间栅格数据监督分类领域中存在速度慢、精度低、属性子集具有偏置特性以及属性子集为非确定多项式的问题。To sum up, in the field of supervised classification of existing spatial raster data, there are problems such as slow speed, low precision, biased attribute subsets and non-deterministic polynomial attribute subsets.

发明内容Contents of the invention

本发明为了解决现有空间栅格数据监督分类领域中存在速度慢、精度低、属性子集具有偏置特性以及属性子集为非确定多项式的问题，从而提出了集成分类器及该装置的分类方法。In order to solve the problems of slow speed, low precision, attribute subsets with bias characteristics and attribute subsets being non-deterministic polynomials in the field of supervised classification of existing spatial raster data, the present invention proposes an integrated classifier and the classification of the device method.

集成分类器的分类方法,它包括下述步骤：The classification method of ensemble classifier, it comprises the following steps:

步骤一、采用多进程和多线程组合的方式读取待处理的栅格数据，具体过程包括如下步骤：Step 1. Read the raster data to be processed by means of a combination of multi-process and multi-thread. The specific process includes the following steps:

A、输入集成分类器的子分类器个数n；A. Input the number n of sub-classifiers of the integrated classifier;

n为子分类器的个数，n大于等于2，通过期望算法将栅格数据的所有空间属性按照决策能力分为n份，每个分类器均具备全集全部的分类能力，n is the number of sub-classifiers, n is greater than or equal to 2, all the spatial attributes of the raster data are divided into n parts according to the decision-making ability through the expectation algorithm, each classifier has all the classification capabilities of the complete set,

B、启动n+1个进程；B. Start n+1 processes;

其中，n+1个进程为Rank 0、Rank 1…Rankn；Rank0为管理进程，Rank 1…Rankn均为运算进程，运算进程Rank 1…Rankn分别与n个子分类器一一对应，Among them, n+1 processes are Rank 0,Rank 1...Rankn; Rank0 is the management process, andRank 1...Rankn are all calculation processes, and thecalculation processes Rank 1...Rankn correspond to n sub-classifiers respectively.

C、在当前进程为管理进程Rank 0时，构造空的粗糙关系表，将待处理的栅格数据均匀划分给每个运算进程；启动n个线程，每个线程单独对应一个运算进程；C. When the current process is the management process Rank 0, construct an empty rough relational table, and evenly divide the raster data to be processed into each operation process; start n threads, and each thread corresponds to an operation process;

其中，线程包括第1线程、第2线程…第n线程，Wherein, the threads include the first thread, the second thread...the nth thread,

D、在当前进程为运算进程时，每个进程均同时读取待处理的栅格数据；D. When the current process is an operation process, each process simultaneously reads the raster data to be processed;

步骤二、管理进程Rank0维护属性离散化区间表，并将该属性离散化区间表均匀划分给多个线程，所述多个线程同时启动对相应的空间连续属性的栅格数据进行离散化；Step 2, the management process Rank0 maintains the attribute discretization interval table, and evenly divides the attribute discretization interval table into multiple threads, and the multiple threads start to discretize the grid data of corresponding spatially continuous attributes at the same time;

步骤三、管理进程Rank0将空间属性均匀分给n个运算进程处理，并收集n个运算进程的处理结果、构建完整的粗糙关系表，将该粗糙关系表发给每个运算进程，每个运算进程根据粗糙关系表建立一个属性子集；Step 3: The management process Rank0 evenly distributes the spatial attributes to n computing processes, collects the processing results of n computing processes, builds a complete rough relational table, and sends the rough relational table to each computing process. The process builds a subset of attributes based on the rough relational table;

步骤四、管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型，该子分类器为与所述的进程一一对应的子分类器，每个子分类器进行预测该子分类器对应的属性子集的类型，统计所有子分类器预测结果，以投票选举的方式选取投票最多的预测结果。Step 4, the management process Rank0 performs parallel training of each operation process according to the corresponding attribute subset to generate a model of a sub-classifier, the sub-classifier is a sub-classifier corresponding to the process one by one, and each sub-classifier performs Predict the type of the attribute subset corresponding to the sub-classifier, count the prediction results of all sub-classifiers, and select the prediction result with the most votes by voting.

集成分类器，它包括下述装置：An ensemble classifier, which includes the following means:

用于多进程和多线程组合的方式读取待处理的栅格数据的装置，该装置包括如下模块：A device for reading raster data to be processed in a combination of multi-process and multi-thread, the device includes the following modules:

用于输入集成分类器的子分类器个数n的模块；A module for inputting the number n of sub-classifiers of the integrated classifier;

其中，n为子分类器的个数，n大于等于2，通过期望算法将栅格数据的所有空间属性按照决策能力分为n份，每个分类器均具备全集全部的分类能力，Among them, n is the number of sub-classifiers, n is greater than or equal to 2, all the spatial attributes of the raster data are divided into n parts according to the decision-making ability through the expectation algorithm, and each classifier has the classification ability of the whole set,

用于启动n+1个进程的模块；A module for starting n+1 processes;

用于在当前进程为管理进程Rank 0时，构造空的粗糙关系表，将待处理的栅格数据均匀划分给每个运算进程；启动n个线程，每个线程单独对应一个运算进程的模块；It is used to construct an empty rough relational table when the current process is the management process Rank 0, and evenly divide the raster data to be processed into each operation process; start n threads, and each thread corresponds to a module of an operation process;

用于在当前进程为运算进程时，每个进程均同时读取待处理的栅格数据的模块；When the current process is an operation process, each process reads the raster data to be processed at the same time;

用于管理进程Rank0维护属性离散化区间表，并将该属性离散化区间表均匀划分给多个线程，所述多个线程同时启动对相应的空间连续属性的栅格数据进行离散化的装置；The management process Rank0 maintains the attribute discretization interval table, and evenly divides the attribute discretization interval table into multiple threads, and the multiple threads simultaneously start the device for discretizing the grid data of corresponding spatially continuous attributes;

用于管理进程Rank0将空间属性均匀分给n个运算进程处理，并收集n个运算进程的处理结果、构建完整的粗糙关系表，将该粗糙关系表发给每个运算进程，每个运算进程根据粗糙关系表建立一个属性子集的装置；It is used to manage the process Rank0 to evenly distribute the spatial attributes to n operation processes, collect the processing results of n operation processes, build a complete rough relationship table, send the rough relationship table to each operation process, and each operation process means for establishing a subset of attributes from a rough relational table;

用于管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型，该子分类器为与所述的进程一一对应的子分类器，每个子分类器进行预测该子分类器对应的属性子集的类型，统计所有子分类器预测结果，以投票选举的方式选取投票最多的预测结果的装置。For the management process Rank0, each operation process is trained in parallel according to the corresponding attribute subset to generate a sub-classifier model. The sub-classifier is a sub-classifier corresponding to the process, and each sub-classifier performs prediction. The type of the attribute subset corresponding to the sub-classifier, the prediction results of all sub-classifiers are counted, and the device with the most voted prediction result is selected by voting.

本发明具有以下优势：The present invention has the following advantages:

（1）采用属性划分方式，而不是样本划分方式构造训练数据子集。(1) Use the attribute division method instead of the sample division method to construct training data subsets.

（2）将训练数据子集与并行计算技术结合起来，应用于高纬度栅格数据。(2) Combining training data subsets with parallel computing techniques and applying them to high-latitude raster data.

（3）应用模糊粗集理论作为高纬度属性并行划分的标准，使得每个子集即有自己独立特性，又保持了决策完整性。(3) Apply fuzzy rough set theory as the standard for parallel division of high-latitude attributes, so that each subset not only has its own independent characteristics, but also maintains the integrity of decision-making.

（4）适应于离散型、连续型的异构数据。(4) Adapt to discrete and continuous heterogeneous data.

附图说明Description of drawings

图1为集成分类器的分类方法的流程图；Fig. 1 is the flowchart of the classification method of integrated classifier;

图2为采用多进程和多线程组合的方式读取待处理的栅格数据具体步骤的流程图；Fig. 2 is a flow chart of the specific steps of reading raster data to be processed in a combination of multi-process and multi-thread;

图3为每个线程启动对相应的空间连续属性的栅格数据进行离散化的具体步骤流程图；Fig. 3 is a flow chart of specific steps for each thread to start discretizing raster data with corresponding spatial continuous attributes;

图4为离散化过程中各线程之间的关系图，图中2≤l≤n，；Fig. 4 is the relationship diagram between each thread in the discretization process, in the figure 2≤l≤n,;

图5为粗糙关系表的构造和属性使用表的关系图；Fig. 5 is the structure of the rough relational table and the relationship diagram of the attribute usage table;

图6为训练产生模型阶段的流程图。Fig. 6 is a flowchart of the phase of training and generating a model.

具体实施方式Detailed ways

具体实施方式一、结合图1和图2具体说明本实施方式，本实施方式所述的集成分类器的分类方法，它包括下述步骤：The specific embodiment one, in conjunction with Fig. 1 and Fig. 2, specifically illustrate this embodiment, the classification method of the integrated classifier described in this embodiment, it comprises the following steps:

步骤一、采用多进程和多线程组合的方式读取待处理的栅格数据，具体过程包括如下步骤：Step 1. Read the raster data to be processed by means of multi-process and multi-thread combination. The specific process includes the following steps:

B、启动n+1个进程；B. Start n+1 processes;

其中，n+1个进程为Rank 0、Rank 1…Rankn；Rank0为管理进程，Rank 1…Rankn均为运算进程，运算进程Rank 1…Rankn分别与n个子分类器一一对应，Among them, n+1 processes are Rank 0,Rank 1...Rankn; Rank0 is the management process, andRank 1...Rankn are all calculation processes, and the calculation processesRank 1...Rankn correspond to n sub-classifiers respectively.

本实施方式在步骤三之后，各个进程均获得“属性子集”，各个进程通过属性子集并行训练一个指定的分类器（如：ID3，SVM，神经网此类模型为传统算法），可以用相对较小的数据量（相对数百维，本算法每个子集的大小通常10-20个，数据量缩小数十被倍）快速训练产生模型。这些模型在决策过程中可以只用投票选举的形式如图6所示，可以有效的防止过度拟合，增加分类精度。In this embodiment, after step three, each process obtains an "attribute subset", and each process trains a specified classifier in parallel through the attribute subset (such as: ID3, SVM, neural network and other models are traditional algorithms), which can be used Relatively small amount of data (compared to hundreds of dimensions, the size of each subset of this algorithm is usually 10-20, and the amount of data is reduced by tens of times) to quickly train and generate models. These models can only use voting in the decision-making process, as shown in Figure 6, which can effectively prevent overfitting and increase classification accuracy.

本实施方式所述的投票选举方式为：假如目前有n个分类器，对于一个需要预测的对象x，这n个分类器分别作出预测，期中m1个分类器决策认为是”A类型”，m2个分类器决策认为是“B”类型，。这时以投票，少数服从多数为原则，取较多分类器认同的决策为集成分类器整体的决策。就是投票选举过程。The voting method described in this embodiment is as follows: if there are currently n classifiers, for an object x that needs to be predicted, these n classifiers make predictions respectively, and m1 classifiers decide to be "type A", m2 A classifier decides to consider type "B", . At this time, voting is based on the principle that the minority obeys the majority, and the decision agreed by more classifiers is taken as the overall decision of the integrated classifier. It is the voting process.

具体实施方式二、本实施方式与具体实施方式一所述的集成分类器的分类方法的区别在于，步骤A所述的栅格数据是高维度栅格数据。Embodiment 2. The difference between this embodiment and the classification method with integrated classifiers described inEmbodiment 1 is that the raster data described in step A is high-dimensional raster data.

本实施方式对于海量的高维度的栅格数据，传统算法速度慢精度低，而本专利达到快速处理栅格数据，获取分类模型，而且由于采用异构决策机制，所以分类精度也较高。In this embodiment, for massive high-dimensional raster data, traditional algorithms are slow in speed and low in precision. However, this patent achieves fast processing of raster data and acquisition of classification models. Moreover, due to the adoption of a heterogeneous decision-making mechanism, the classification accuracy is also high.

具体实施方式三、结合图3具体说明本实施方式，本实施方式与具体实施方式一或二所述的集成分类器的分类方法的区别在于，步骤二所述每个线程启动对相应的空间连续属性的栅格数据进行离散化的具体步骤为：Specific Embodiment 3. This embodiment is described in detail in conjunction with FIG. 3 . The difference between this embodiment and the classification method of the integrated classifier described inEmbodiment 1 or 2 is that each thread described in step 2 starts the corresponding space continuous The specific steps for discretizing attribute raster data are:

步骤二一、设置聚类个数为ceil；Step 21. Set the number of clusters to ceil;

步骤二二、在该线程启动的空间连续属性的最大值和最小值之间求取均匀分布聚类初始中心；Step 22. Find the initial center of uniformly distributed clustering between the maximum value and the minimum value of the spatial continuity attribute started by the thread;

步骤二三、根据K-Means算法对均匀分布聚类初始中心进行聚类，形成ceil个聚类；Step two and three, according to the K-Means algorithm, cluster the initial center of the evenly distributed cluster to form ceil clusters;

步骤二四、对于每一个聚类输出其最小和最大值，形成ceil个值域区间；Step 24, output its minimum and maximum values for each cluster to form ceil range intervals;

步骤二五、将所述ceil个值域区间构成一个区间列表。Step 25: Construct the ceil value range intervals into an interval list.

本实施方式通过离散化，获得离散化区间，通过这组区间就可以将原有的连续数据变为有限个数的1,2,3,4等数字，明晰关系，加快比对分析速度。对于多进程情况下，所有数据的处理流程如图4。In this embodiment, discretization intervals are obtained through discretization, and through this group of intervals, the original continuous data can be changed into a limited number of numbers such as 1, 2, 3, 4, etc., so as to clarify the relationship and speed up the comparison and analysis. For the case of multi-process, the processing flow of all data is shown in Figure 4.

具体实施方式四、本实施方式与具体实施方式一或二所述的集成分类器的分类方法的区别在于，所述的步骤三中所述的粗糙关系表是一个二维表，表示二个属性直接的交叠程度，粗糙关系为1表示属性直接相关性最强，粗糙关系为0表示最不相关，粗糙关系表如下：Embodiment 4. The difference between this embodiment and the classification method of the integrated classifier described inEmbodiment 1 or 2 is that the rough relational table described in Step 3 is a two-dimensional table representing two attributes The degree of direct overlap. A rough relationship of 1 indicates that the attribute has the strongest direct correlation, and a rough relationship of 0 indicates the least correlation. The rough relationship table is as follows:

具体实施方式五、本实施方式与具体实施方式四所述的集成分类器的分类方法的区别在于，步骤三中每个运算进程根据粗糙关系表建立一个属性子集的具体步骤为：Embodiment 5. The difference between this embodiment and the classification method of the integrated classifier described in Embodiment 4 is that in step 3, each operation process establishes an attribute subset according to the rough relational table. The specific steps are as follows:

步骤三一、在所述运算进程的粗糙关系表中随机选择一对粗糙关系不相关的属性，该属性的状态为“未使用”，将该属性加入所述运算进程的属性子集中，该子集为与所述运算进程一一对应的子集，并将其标记为“已使用”，Step 31: Randomly select a pair of attributes irrelevant to the rough relationship in the rough relationship table of the operation process, and the status of the attribute is "unused", add this attribute to the attribute subset of the operation process, set as a subset that corresponds one-to-one to said computing process, and mark it as "used",

属性的状态为“已使用”或“未使用”；The status of the attribute is "used" or "not used";

步骤三二、在所述运算进程中，根据公式（8）计算每一对“未使用”的属性与所述运算进程的属性子集的关系，Step 32. In the operation process, calculate the relationship between each pair of "unused" attributes and the attribute subset of the operation process according to formula (8),

属性与属性子集的粗糙关系为：The rough relationship between attributes and attribute subsets is:

$RTD RTDs = = {Σ Σ}_{11}^{n no} {RT RT}_{((b b,, an an))} - - - - - - ((88))$

其中，b表示所述运算进程的属性子集，an表示任意一对“未使用”的属性，RT_(b，an)表示所述运算进程的属性子集与任意一对“未使用”的属性的粗糙关系；Wherein, b represents the attribute subset of the operation process, an represents any pair of "unused" attributes, RT_{(b, an)} represents the attribute subset of the operation process and any pair of "unused" attributes rough relationship;

步骤三三、选出计算结果最小的属性，将该属性加入到所述运算进程的属性子集中，并将所述运算进程的属性子集标记为“已使用”；Step 33: Select the attribute with the smallest calculation result, add this attribute to the attribute subset of the operation process, and mark the attribute subset of the operation process as "used";

步骤三四、根据公式（6）计算所述运算进程的属性子集与维度全集D的关系；Step 34: Calculate the relationship between the attribute subset of the operation process and the complete set of dimensions D according to the formula (6);

${γ γ}_{D D.} ((w w)) = = \frac{Card card ((\underset{X x &Subset; &Subset; IND IND ((w w))}{U u} {POS POS}_{D D.} ((X x))))}{Card card ((U u))} - - - - - - ((66))$

其中，w表示所述运算进程的属性子集，IND(w)为w子集所对应的不可区分关系，也就是在集合W中的元素，认为是区分不开的，不可比较的；Card(U)为计算集合的秩，为cardinal的缩写；POS_D(X)为X对应于D的正域，更一般的说法是，X集合被D集合完全包含；Wherein, w represents the attribute subset of the operation process, and IND(w) is an indistinguishable relationship corresponding to the w subset, that is, the elements in the set W are considered to be indistinguishable and incomparable; Card( U) is the rank of the calculation set, which is the abbreviation of cardinal; POS_D (X) is the positive domain of X corresponding to D. More generally, the X set is completely contained by the D set;

步骤三五、当γ_D(w)=1时，输出所述运算进程的属性子集；Step 35, when γ_D (w)=1, output the attribute subset of the operation process;

步骤三六、当γ_D(w)=0时，在所述运算进程中，根据公式（8）计算每一对“未使用”的属性与所述运算进程的属性子集的关系。Step 36: When γ_D (w)=0, in the operation process, calculate the relationship between each pair of "unused" attributes and the attribute subset of the operation process according to formula (8).

根据Pawlak的粗集理论，一个信息系统S可以被看作是一个数据表。它可以由对S＝(U，A)表示，其中：论域U是非空有限集合；A是非空有限的属性集合；对于A中的每一个元素a∈A,存在一个映射a:U→V_a,其中V_a是a取值的集合。一个决策表就是形如S＝(U，A∪{d})的信息系统,其中

是决策属性。对于任意的属性集合

存在一个不可区分关系IND(P)：According to Pawlak's rough set theory, an information system S can be regarded as a data table. It can be represented by the pair S=(U, A), where: the universe of discourse U is a non-empty finite set; A is a non-empty finite set of attributes; for each element a∈A in A, there is a mapping a: U→V_a , where V_a is a set of values of a. A decision table is an information system of the form S=(U, A∪{d}), where

is a decision attribute. For any set of attributes

There is an indistinguishable relation IND(P):

$IND IND ((P P)) = = {{((x x,, y the y)) &Element; &Element; {U u}^{22} | | &ForAll; &ForAll; a a &Element; &Element; P P,, a a ((x x)) = = a a ((y the y))}} - - - - - - ((11))$

其中，x和y均为多维空间下的，多维度矢量；Among them, both x and y are multi-dimensional vectors in multi-dimensional space;

一个基于P不可区分关系的等价类可以定义为：An equivalence class based on P-indistinguishable relations can be defined as:

[x]_p={y∈U |(x,y)∈IND(P)} （2）[x]_p ={y∈U|(x,y)∈IND(P)} (2)

根据不可区分关系可以定义上下近似集.令集合X∈U，X可以由下面两个集合近似的表示：According to the indistinguishable relationship, the upper and lower approximation sets can be defined. Let the set X∈U, X can be approximated by the following two sets:

下近似集：上近似集：

如果

那么对

就称之为粗糙集。定义正域、负域和边域：X为一个集合，

表示X的下近似集，[x]_D∩X表示X的上近似集，The next approximate set: Upper approximation set:

if

then yes

It is called a rough set. Define positive domain, negative domain and border domain: X is a set,

Represents the lower approximate set of X, [x]_D ∩X represents the upper approximate set of X,

POS_D(X)=DX (3)POS_D (X) =D X (3)

${NEG NEG}_{D D.} ((X x)) = = 11 - - \overset{&OverBar; &OverBar;}{D D.} X x - - - - - - ((44))$

BND_D(X)=NEG_D(X)-POS_D(X) (5)BND_D (X)=NEG_D (X)-POS_D (X) (5)

粗集的一个重要概念是属性之间的依赖度。属性Q对于属性D的依赖程度可以定义为(属性依赖度，由γ表示)：An important concept of rough sets is the dependency between attributes. The dependence degree of attribute Q on attribute D can be defined as (attribute dependency, represented by γ):

根据公式（6）可知当子集R为所有维度（属性）的总和时，公式（6）结果为1。当D₁={a₁},D₂={a₁,a₂}时计算Diff(D2D1）=γ_D2(Q)-γ_D1(Q)，此时如果Diff值较大，说明a1和a2维度覆盖的决策领域不同且范围较大，适合组合一起；如果Diff值较小，说明a1和a2维度覆盖的决策领域相近（极端情况a2=a1，Diff＝0两个属性没有任何区别），两个属性不适合组合在一起。所以粗糙关系表中的每一个表项计算公式为：According to formula (6), when the subset R is the sum of all dimensions (attributes), the result of formula (6) is 1. When D₁ ={a₁ }, D₂ ={a₁ ,a₂ }, calculate Diff(D2D1)=γ_D2 (Q)-γ_D1 (Q), at this time, if the Diff value is large, it means that a1 and a2 The decision-making areas covered by the dimensions are different and have a large range, and are suitable for combination; if the Diff value is small, it means that the decision-making areas covered by the a1 and a2 dimensions are similar (in extreme cases, a2=a1, Diff=0, there is no difference between the two attributes), and the two attributes attributes are not suitable for grouping together. Therefore, the calculation formula for each entry in the rough relational table is:

粗糙关系RT_(a1,a2）=1-(γ_{a1，a2}(Q)-γ_a1(Q)) (7）Rough relation RT_{(a1, a2)} =1-(γ_{{a1, a2}} (Q)-γ_a1 (Q)) (7)

对于一个属性b与一个属性集D={a₁,a₂,…,a_n},其粗糙关系为:For an attribute b and an attribute set D={a₁ ,a₂ ,…,a_n }, the rough relationship is:

$RTD RTDs = = {Σ Σ}_{11}^{n no} {RT RT}_{((b b,, an an))} - - - - - - ((88))$

粗糙关系计算量较大，需要按照图5进行并行，在计算之后可以遍历粗糙关系表来计算“属性与集合关系”该过程为一个求和统计过程，计算量较小。在粗糙关系表建立之后，获得各个属性子集。The calculation of rough relationship is relatively large, and it needs to be parallelized according to Figure 5. After the calculation, the rough relationship table can be traversed to calculate the "property and set relationship". This process is a summation and statistical process, and the calculation amount is small. After the rough relational table is established, each attribute subset is obtained.

具体实施方式六、本实施方式所述的集成分类器的分类方法，它包括下述步骤：Specific embodiment six, the classification method of the integrated classifier described in the present embodiment, it comprises the following steps:

用于启动n+1个进程的模块；A module for starting n+1 processes;

用于管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型，该子分类器为与所述的进程一一对应的子分类器的装置。The management process Rank0 performs parallel training of each operation process according to the corresponding attribute subset to generate a model of a sub-classifier, and the sub-classifier is a device corresponding to the sub-classifier of the process one by one.

用于管理进程Rank0将每个运算进程根据所对应的属性子集进行并行训练子分类器产生模型，该子分类器为与所述的进程一一对应的子分类器，每个子分类器根据模糊粗糙集理论进行预测该子分类器对应的属性子集的类型，统计所有子分类器预测结果，以投票选举的方式选取投票最多的预测结果的装置。For the management process Rank0, each operation process is trained in parallel according to the corresponding attribute subset to generate a sub-classifier model. The sub-classifier is a sub-classifier corresponding to the process, and each sub-classifier is based on the Rough set theory predicts the type of the attribute subset corresponding to the sub-classifier, counts the prediction results of all sub-classifiers, and selects the device with the most voted prediction results by voting.

具体实施方式七、本实施方式与具体实施方式六所述的集成分类器的分类方法的区别在于，所述的栅格数据是高维度栅格数据。Embodiment 7. The difference between this embodiment and the method for classifying integrated classifiers described in Embodiment 6 is that the raster data is high-dimensional raster data.

具体实施方式八、本实施方式与具体实施方式六或七所述的集成分类器的分类方法的区别在于，用于管理进程Rank0维护属性离散化区间表，并将该属性离散化区间表均匀划分给多个线程，所述多个线程同时启动对相应的空间连续属性的栅格数据进行离散化的装置，包括如下模块：Embodiment 8. The difference between this embodiment and the classification method of the integrated classifier described in Embodiment 6 or 7 is that the management process Rank0 maintains the attribute discretization interval table, and evenly divides the attribute discretization interval table For a plurality of threads, the plurality of threads simultaneously start the device for discretizing the grid data of the corresponding spatial continuous attributes, including the following modules:

用于设置聚类个数为ceil的模块；A module used to set the number of clusters to ceil;

用于在该线程启动的空间连续属性的最大值和最小值之间求取均匀分布聚类初始中心的模块；A module for finding the initial center of a uniformly distributed cluster between the maximum value and the minimum value of the spatial continuity attribute started by this thread;

用于根据K-Means算法对均匀分布聚类初始中心进行聚类，形成ceil个聚类的模块；A module for clustering the initial centers of evenly distributed clusters according to the K-Means algorithm to form ceil clusters;

用于将每一个聚类输出其最小和最大值，形成ceil个值域区间的模块；A module for outputting the minimum and maximum values of each cluster to form ceil range intervals;

用于将所述ceil个值域区间构成一个区间列表的模块。A module for forming the ceil value range intervals into an interval list.

具体实施方式九、本实施方式与具体实施方式六所述的集成分类器的分类方法的区别在于，所述的粗糙关系表是一个二维表，表示二个属性直接的交叠程度，粗糙关系为1表示属性直接相关性最强，粗糙关系为0表示最不相关，粗糙关系表如下：Embodiment 9. The difference between this embodiment and the classification method of the integrated classifier described in Embodiment 6 is that the rough relation table is a two-dimensional table, which indicates the degree of direct overlap between two attributes, and the rough relation A value of 1 indicates that the attribute has the strongest direct correlation, and a rough relationship of 0 indicates the least correlation. The rough relationship table is as follows:

具体实施方式十、本实施方式与具体实施方式六所述的集成分类器的分类方法的区别在于，用于管理进程Rank0将空间属性均匀分给n个运算进程处理并收集n个运算进程的处理结果、构建完整的粗糙关系表，将该粗糙关系表发给每个运算进程，每个运算进程根据粗糙关系表建立一个属性子集的装置，包括如下模块：Embodiment 10. The difference between this embodiment and the classification method of the integrated classifier described in Embodiment 6 is that the management process Rank0 evenly distributes the spatial attributes to n operation processes for processing and collects n operation processes. As a result, a complete rough relational table is constructed, and the rough relational table is sent to each operation process, and each operation process establishes an attribute subset device according to the rough relational table, including the following modules:

用于在所述运算进程的粗糙关系表中随机选择一对粗糙关系不相关的属性，该属性的状态为“未使用”，将该属性加入所述运算进程的属性子集中，该子集为与所述运算进程一一对应的子集，并将其标记为“已使用”的模块，It is used to randomly select a pair of attributes irrelevant to the rough relationship in the rough relationship table of the operation process, and the status of the attribute is "unused", and add the attribute to the attribute subset of the operation process, the subset is a subset of modules that correspond one-to-one to said operational processes and mark them as "used",

用于在所述运算进程中，根据公式（8）计算每一对“未使用”的属性与所述运算进程的属性子集的关系的模块，A module for calculating the relationship between each pair of "unused" attributes and the attribute subset of the operation process according to formula (8) in the operation process,

$RTD RTDs = = {Σ Σ}_{11}^{n no} {RT RT}_{((b b,, an an))} - - - - - - ((88))$

用于选出计算结果最小的属性，将该属性加入到所述运算进程的属性子集中，并将所述运算进程的属性子集标记为“已使用”的模块A module for selecting the attribute with the smallest calculation result, adding this attribute to the attribute subset of the operation process, and marking the attribute subset of the operation process as "used"

用于根据公式（6）计算所述运算进程的属性子集与维度全集D的关系的模块；A module for calculating the relationship between the attribute subset of the operation process and the full set of dimensions D according to formula (6);

其中，w表示所述运算进程的属性子集，IND(w)为w子集所对应的不可区分关系，Card(U)为计算集合的秩，POS_D(X)为X对应于D的正域；Among them, w represents the attribute subset of the operation process, IND(w) is the indistinguishable relationship corresponding to the w subset, Card(U) is the rank of the calculation set, and POS_D (X) is the positive relationship between X and D. area;

用于在当γ_D(w)=1时，输出所述运算进程的属性子集的模块；A module for outputting a subset of attributes of the operation process when γ_D (w)=1;

用于在当γ_D(w)=0时，在所述运算进程中，根据公式（8）计算每一对“未使用”的属性与所述运算进程的属性子集的关系的模块。A module for calculating the relationship between each pair of "unused" attributes and the attribute subset of the operation process in the operation process when γ_D (w)=0.

Claims

1. the sorting technique of integrated classifier is characterized in that, it comprises the steps:

Step 1, the mode that adopts multi-process and multithreading to make up read pending raster data, and detailed process comprises the steps:

The sub-classifier number n of A, input integrated classifier;

N is the number of sub-classifier, and n is more than or equal to 2, by Expectation Algorithm all space attributes of raster data is divided into n part according to decision-making capability, and each sorter all possesses the whole classification capacity of complete or collected works,

B, n+1 process of startup;

Wherein, n+1 process is Rank 0, Rank 1 ... Rankn; Rank0 is managing process, and Rank 1 ... Rankn is the computing process, computing process Rank 1 ... Rankn is corresponding one by one with n sub-classifier respectively,

C, when current process is managing process Rank 0, the coarse relation table that structure is empty evenly is allocated to each computing process with pending raster data; Start n thread, the separately corresponding computing process of each thread;

Wherein, thread comprises the 1st thread, the 2nd thread ... the n thread,

D, when current process is the computing process, each process all reads pending raster data simultaneously;

Step 2, managing process Rank0 maintain attribute discretize interval table, and this attribute discretization interval table evenly is allocated to a plurality of threads, the raster data that described a plurality of threads start simultaneously to corresponding space connection attribute carries out discretize;

Step 3, managing process Rank0 evenly give n computing process with space attribute and process, and collect the result of n computing process, the coarse relation table that structure is complete, should issue each computing process by coarse relation table, each computing process is set up an attribute set according to coarse relation table;

Step 4, managing process Rank0 carry out parallel training sub-classifier production model with each computing process according to corresponding attribute set, this sub-classifier is and described process sub-classifier one to one, each sub-classifier is predicted the type of the attribute set that this sub-classifier is corresponding, add up all sub-classifiers and predict the outcome, choose maximum the predicting the outcome of ballot in the mode of vote by ballot.

2. the sorting technique of integrated classifier according to claim 1 is characterized in that, the described raster data of steps A is high-dimensional raster data.

3. the sorting technique of integrated classifier according to claim 1 and 2 is characterized in that, described each thread of step 2 starts the concrete steps that raster data to corresponding space connection attribute carries out discretize and is:

Step 2 one, the cluster number is set is ceil;

Step 2 two, between the maximal value of the space connection attribute that this thread starts and minimum value, ask for even distributional clustering initial center;

Step 2 three, according to the K-Means algorithm even distributional clustering initial center is carried out cluster, form ceil cluster;

Step 2 four, export its minimum and maximal value for each cluster, it is interval to form ceil codomain;

Step 2 five, consist of an interval tabulation with described ceil codomain is interval.

4. the sorting technique of integrated classifier according to claim 1 and 2, it is characterized in that, coarse relation table described in the described step 3 is a bivariate table, represent two direct overlapping degree of attribute, coarse pass is that the directly related property of 1 expression attribute is the strongest, coarse pass is that 0 expression is least relevant, and coarse relation table is as follows:

5. the sorting technique of integrated classifier according to claim 4 is characterized in that, the concrete steps that each computing process is set up an attribute set according to coarse relation table in the step 3 are:

Step 3 one, in the coarse relation table of described computing process, select at random the incoherent attribute of a pair of coarse relation, the state of this attribute is " not using ", this attribute is added in the attribute set of described computing process, this subset is and described computing process subset one to one, and it is labeled as " using "

The state of attribute is " using " or " not using ";

Step 3 two, in described computing process, calculate the relation of the attribute set of the attribute of whenever a pair of " using " and described computing process according to formula (8),

The coarse pass of attribute and attribute set is:

RTD = Σ_{1}^{n} {RT}_{(b, an)} - - - (8)

Wherein, b represents the attribute set of described computing process, and an represents the arbitrarily attribute of a pair of " not using ", RT_{(b, an)}Represent the attribute set of described computing process and any coarse relation of the attribute of a pair of " use ";

Step 3 three, select the attribute of result of calculation minimum, this attribute is joined in the attribute set of described computing process, and the attribute set of described computing process is labeled as " using ";

Step 3 four, calculate the attribute set of described computing process and the relation of dimension complete or collected works D according to formula (6);

γ_{D} (w) = \frac{Card (\underset{X &Subset; IND (w)}{U} {POS}_{D} (X))}{Card (U)} - - - (6)

Wherein, w represents the attribute set of described computing process, and IND (w) is the corresponding undistinguishable relation of w subset, and Card (U) is the order of set of computations, POS_D(X) be that X is corresponding to the positive territory of D;

Step 3 five, work as γ_D(w)=1 o'clock, export the attribute set of described computing process;

Step 3 six, work as γ_D(w)=0 o'clock, in described computing process, calculate the relation of the attribute set of the attribute of whenever a pair of " using " and described computing process according to formula (8).

6. integrated classifier is characterized in that, it comprises following apparatus:

The mode that is used for the combination of multi-process and multithreading reads the device of pending raster data, and this device comprises such as lower module:

The module that is used for the sub-classifier number n of input integrated classifier;

Wherein, n is the number of sub-classifier, and n is more than or equal to 2, by Expectation Algorithm all space attributes of raster data is divided into n part according to decision-making capability, and each sorter all possesses the whole classification capacity of complete or collected works,

Be used for starting the module of n+1 process;

Be used for when current process is managing process Rank 0, the coarse relation table that structure is empty evenly is allocated to each computing process with pending raster data; Start n thread, the module of the separately corresponding computing process of each thread;

Wherein, thread comprises the 1st thread, the 2nd thread ... the n thread,

Be used for when current process is the computing process, each process all reads the module of pending raster data simultaneously;

Be used for managing process Rank0 maintain attribute discretize interval table, and this attribute discretization interval table evenly is allocated to a plurality of threads, described a plurality of threads start the device that the raster data of corresponding space connection attribute is carried out discretize simultaneously;

Be used for managing process Rank0 space attribute is evenly given n computing process processing, and collect the result of n computing process, the coarse relation table that structure is complete, should issue each computing process by coarse relation table, each computing process is set up the device of an attribute set according to coarse relation table;

Be used for managing process Rank0 each computing process is carried out parallel training sub-classifier production model according to corresponding attribute set, this sub-classifier is and described process sub-classifier one to one, each sub-classifier is predicted the type of the attribute set that this sub-classifier is corresponding, add up all sub-classifiers and predict the outcome, choose the maximum device that predicts the outcome of ballot in the mode of vote by ballot.

7. integrated classifier according to claim 6 is characterized in that, described raster data is high-dimensional raster data.

8. according to claim 6 or 7 described integrated classifiers, it is characterized in that, be used for managing process Rank0 maintain attribute discretize interval table, and this attribute discretization interval table evenly is allocated to a plurality of threads, described a plurality of thread starts the device that the raster data of corresponding space connection attribute is carried out discretize simultaneously, comprises such as lower module:

Be used for arranging the module that the cluster number is ceil;

Be used between the maximal value of the space connection attribute that this thread starts and minimum value, asking for the module of even distributional clustering initial center;

Be used for according to the K-Means algorithm even distributional clustering initial center being carried out cluster, form the module of ceil cluster;

Be used for each cluster is exported its minimum and maximal value, form the module in ceil codomain interval;

Be used for the interval module that consists of an interval tabulation of described ceil codomain.

9. integrated classifier according to claim 6 is characterized in that, described coarse relation table is a bivariate table, represent two direct overlapping degree of attribute, coarse pass is that the directly related property of 1 expression attribute is the strongest, and coarse pass is that 0 expression is least relevant, and coarse relation table is as follows:

10. the sorting technique of the integrated classifier of the classification towards raster data according to claim 9, it is characterized in that, being used for managing process Rank0 evenly gives space attribute n computing process processing and collects the result of n computing process, the coarse relation table that structure is complete, should issue each computing process by coarse relation table, each computing process is set up the device of an attribute set according to coarse relation table, comprises such as lower module:

Be used for selecting at random the incoherent attribute of a pair of coarse relation at the coarse relation table of described computing process, the state of this attribute is " not using ", this attribute is added in the attribute set of described computing process, this subset is and described computing process subset one to one, and it is labeled as the module of " using "

The state of attribute is " using " or " not using ";

Be used in described computing process, calculate the module of relation of the attribute set of the attribute of whenever a pair of " using " and described computing process according to formula (8),

The coarse pass of attribute and attribute set is:

RTD = Σ_{1}^{n} {RT}_{(b, an)} - - - (8)

Be used for selecting the attribute of result of calculation minimum, this attribute joined in the attribute set of described computing process, and the attribute set of described computing process is labeled as the module of " using "

Be used for calculating according to formula (6) module of the relation of the attribute set of described computing process and dimension complete or collected works D;

γ_{D} (w) = \frac{Card (\underset{X &Subset; IND (w)}{U} {POS}_{D} (X))}{Card (U)} - - - (6)

Wherein, w represents the attribute set of described computing process, and IND (w) is the corresponding undistinguishable relation of w subset, and Cd (U) is the order of set of computations, POS_D(X) be that X is corresponding to the positive territory of D;

Be used for working as γ_D(w)=1 o'clock, export the module of the attribute set of described computing process;

Be used for working as γ_D(w)=0 o'clock, in described computing process, calculate the module of relation of the attribute set of the attribute of whenever a pair of " using " and described computing process according to formula (8).