CN102567508A

Movatterモバイル変換

Info

Publication number: CN102567508A
Application number: CN2011104417369A
Authority: CN
Inventors: 李飞雪; 张帅; 李满春; 陈振杰; 蒲英霞; 魏金标; 陈冲; 王亚飞; 陈东
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2011-12-20
Filing date: 2011-12-27
Publication date: 2012-07-11
Anticipated expiration: 2031-12-27
Also published as: CN102567508B

Abstract

本发明公开了一种基于抽象数据模型的海量栅格数据格式转换并行方法，属于栅格数据格式转换方法领域。本发明的步骤为：借助GDAL库函数解译栅格数据文件，按行划分栅格数据使每块数据量为行数总量除以计算进程数，构建待处理栅格数据块栈表以及数据块处理情况统计表，获取空闲计算进程队列；取出待处理数据块出栈与从空闲计算进程队列中取出的计算进程组合构成操作指令，发送给该计算进程；计算进程接收指令并负责完成该数据块的格式转换操作；计算进程发送的目标数据格式块文件写到框架文件中对应的位置，直到整个框架文件被全部填充。本发明将栅格数据格式转换技术和并行计算技术结合起来，大大提高了数据转换效率，缩短了转换格式耗费的时间。

The invention discloses a massive raster data format conversion parallel method based on an abstract data model, belonging to the field of raster data format conversion methods. The steps of the present invention are: interpreting the raster data file by means of the GDAL library function, dividing the raster data by rows so that the amount of each block of data is the total amount of rows divided by the number of calculation processes, and constructing the stack table of raster data blocks to be processed and the data Statistical table of block processing, to obtain the idle computing process queue; take out the data block to be processed and pop it out of the stack, and combine it with the computing process taken out from the idle computing process queue to form an operation instruction, and send it to the computing process; the computing process receives the instruction and is responsible for completing the data Block format conversion operation; the target data format block file sent by the calculation process is written to the corresponding position in the framework file until the entire framework file is fully filled. The invention combines the grid data format conversion technology with the parallel computing technology, greatly improves the data conversion efficiency, and shortens the time spent in format conversion.

Description

Translated fromChinese

基于抽象数据模型的海量栅格数据格式转换并行方法Parallel method for format conversion of massive raster data based on abstract data model

技术领域technical field

本发明涉及一种栅格数据格式转换方法，更具体地说，涉及一种基于抽象数据模型的海量栅格数据格式转换并行方法，特别适用于海量的栅格数据进行格式转换。The present invention relates to a grid data format conversion method, more specifically, relates to a massive grid data format conversion parallel method based on an abstract data model, and is especially suitable for format conversion of massive grid data.

背景技术Background technique

随着空间信息技术的发展，地理信息系统（GIS，Geographic Information System）的应用范围已经逐渐从学术科研应用转向工业应用行业和社会化应用。与此同时，大多数GIS软件厂商纷纷推出自己的数据格式，不同行业根据自己对数据的需求也选择适合自己行业发展的空间数据格式，造成市场上的空间数据格式呈现多样化的格局。如何有效地解决空间数据的异构性，增加访问透明性，实现地理信息的有效共享一直是研究人员广泛关注的问题。With the development of spatial information technology, the application range of GIS (Geographic Information System) has gradually shifted from academic research applications to industrial applications and social applications. At the same time, most GIS software manufacturers have launched their own data formats, and different industries also choose spatial data formats suitable for their own industry development according to their own data needs, resulting in a diversified pattern of spatial data formats in the market. How to effectively solve the heterogeneity of spatial data, increase the transparency of access, and realize the effective sharing of geographic information has been widely concerned by researchers.

空间数据格式转换先后经历了三个发展阶段。最初在地理信息系统技术尚未得到大范围推广时，空间信息共享需求并不是很旺盛，数据格式转换由不同GIS系统的数据格式之间的直接转换。这种方式往往转换效率高，信息损失少，但是这种方法只有在详细掌握双方数据格式的前提下才能实现，而且格式转换两两配对结构在数据格式种类增多时使得数据共享工作变得异常繁杂。而如果相关软件数据格式不公开，或者软件数据格式发生变化，这种数据格式的直接转换往往需要投入大量的人力和财力，转换难度大大增加，转换质量无法保障。Spatial data format conversion has gone through three stages of development. Initially, when geographic information system technology has not been widely promoted, the demand for spatial information sharing is not very strong, and data format conversion consists of direct conversion between data formats of different GIS systems. This method often has high conversion efficiency and less information loss, but this method can only be realized under the premise of a detailed grasp of the data formats of both parties, and the pairwise pairing structure of format conversion makes the data sharing work extremely complicated when the types of data formats increase. . However, if the relevant software data format is not disclosed, or if the software data format changes, the direct conversion of this data format often requires a lot of manpower and financial resources, the conversion difficulty is greatly increased, and the conversion quality cannot be guaranteed.

采用空间数据交换格式标准作为中介实现不同GIS之间的数据转换代表第二阶段。由于直接转换方法的存在相应技术问题，许多国家纷纷制定了自己的空间数据交换格式标准，如美国的空间数据转换标准(Spatial Data Transfer Standard, SDTS)等，我国于1999年颁布了中华人民共和国国家标准——地球空间数据交换格式。此外国际标准化组织和OGC(Open GIS Consortium，OGC)推出的地理标识语言（Geography Markup Language, GML）格式也可以作为空间数据交换格式的标准。但从一种软件到另一种软件的数据转换一般必须经过从源数据到标准数据和从标准数据到目标数据的两次转换，可能产生大量的冗余数据，增加磁盘荷载。而且由于标准化的规范不同，所以对GIS标准格式数据的接口和转换的实现无法达到同步，而且各种各样标准的出现，使数据标准化已失去了原来的意义，不同国家和地区制定的标准之间互不兼容的情况普遍存在，标准间仍然存在地理模型和数据结构性差异的问题。The use of spatial data exchange format standards as an intermediary to achieve data conversion between different GIS represents the second stage. Due to the corresponding technical problems of the direct conversion method, many countries have formulated their own spatial data exchange format standards, such as the US spatial data transfer standard (Spatial Data Transfer Standard, SDTS). Standard - Geospatial Data Interchange Format. In addition, the Geography Markup Language (GML) format introduced by the International Organization for Standardization and OGC (Open GIS Consortium, OGC) can also be used as a standard for spatial data exchange formats. However, data conversion from one software to another generally requires two conversions from source data to standard data and from standard data to target data, which may generate a large amount of redundant data and increase disk load. Moreover, due to different standardization specifications, the realization of the interface and conversion of GIS standard format data cannot be synchronized, and the emergence of various standards has made data standardization lose its original meaning. Incompatibility between standards is common, and there are still problems of geographical model and data structure differences between standards.

基于抽象数据模型的栅格数据格式转换方法，采用统一的抽象数据模型作为转换媒介，转换过程发生在内存中，中间数据格式并不生成，支持数据格式易于扩展，能够实现任意数据格式两两转换，是一种比较理性的转换技术。北美地区Safe公司研发的FME suite程序，一种GIS的数据转换平台，能够实现上百种数据格式的相互转换，将GIS要素同构化，并向用户提供数据处理的组件模型，满足不同数据格式之间的转换需要。The raster data format conversion method based on the abstract data model uses a unified abstract data model as the conversion medium. The conversion process takes place in the memory, and the intermediate data format is not generated. The data format is easy to expand, and any data format can be converted in pairs. , is a more rational conversion technique. The FME suite program developed by Safe in North America is a GIS data conversion platform that can realize the mutual conversion of hundreds of data formats, isomorphize GIS elements, and provide users with data processing component models to meet different data formats Conversion between is required.

目前，栅格数据格式的转换研究的热点除了抽象数据模型的构建和表达以外，更多学者立足于已有数据模型库，设计实现满足自己行业需求的数据格式转换以及数据格式的扩展。1999年，陈常松在期刊《中国图像图形学报》4卷第1期发表“面向数据共享目的的GIS语义数据模型”一文，提出了一种GIS语义数据模型，试图通过该模型从技术上解决部门之间的数据共享问题；2000年，宋关福等在期刊《地理科学进展》19卷第2期发表“多源空间数据无缝集成研究”一文，阐述了多源空间数据无缝集成技术的体系结构，探讨了其在GIS软件开发中的应用；2011年，徐杨等在期刊《测绘通报》第6期发表“基于OGR的交换格式数据驱动的设计与实现”一文，提出基于OGR库开发一个交换格式数据驱动，实现交换格式数据与商用GIS格式数据之间的自由转换；2011年，陈振等在期刊《测绘科学技术学报》28卷第2期发表“自定义空间数据格式的扩展与应用”一文，提出了利用OGR自定义数据源来实现GIS软件对自定义空间数据格式支持。然而在很多时候，随着数据量的增大，普通的串行数据转换算法已经不能满足人们对海量数据格式转换的需求。At present, besides the construction and expression of abstract data models, more and more scholars design and implement data format conversions and data format extensions that meet the needs of their own industries based on existing data model libraries. In 1999, Chen Changsong published the article "GIS Semantic Data Model for Data Sharing Purpose" in the journal "Journal of Image and Graphics", Volume 4, Issue 1, and proposed a GIS semantic data model, trying to solve the problem of departmental problems technically through this model. In 2000, Song Guanfu and others published the article "Research on Seamless Integration of Multi-source Spatial Data" in the journal "Progress in Geographical Sciences" Volume 19, No. 2, expounding the architecture of seamless integration technology of multi-source spatial data , discussed its application in GIS software development; in 2011, Xu Yang et al. published the article "Design and Implementation of Data-Driven Exchange Format Based on OGR" in the journal "Surveying and Mapping Bulletin" No. 6, proposing to develop an exchange format based on OGR library Driven by format data, it realizes free conversion between exchange format data and commercial GIS format data; in 2011, Chen Zhen et al. published "Expansion and Application of Custom Spatial Data Format" in the journal "Journal of Surveying and Mapping Science and Technology", Volume 28, Issue 2 In this paper, the use of OGR custom data sources is proposed to realize the support of GIS software for custom spatial data formats. However, in many cases, with the increase of the amount of data, ordinary serial data conversion algorithms can no longer meet people's needs for massive data format conversion.

随着计算机软硬件技术的发展，采用计算方式来进行很多科学和工程领域的研究和设计，成为了一种跨学科的创新源泉。目前在GIS领域，高性能计算已经被广泛应用到栅格图像处理，尤其是大尺度的遥感图像处理领域，并取得了很好的进展。2003年，周海芳博士在其博士论文《遥感图像并行处理算法的研究与应用》中，提出了在遥感图像几何校正、自动配准以及流域分割的并行算法；2004年，易法令在期刊《计算机工程》30卷第23期发表“GIS空间数据格式并行转换的调度算法”一文，分析了GIS两种常见的空间数据格式（GRID、TIN）并行转换的可行性，并给出了两种基于Cluster结构的并行转换的任务调度算法。With the development of computer software and hardware technology, the use of computing methods to carry out research and design in many fields of science and engineering has become a source of interdisciplinary innovation. At present, in the field of GIS, high-performance computing has been widely applied to raster image processing, especially in the field of large-scale remote sensing image processing, and good progress has been made. In 2003, Dr. Zhou Haifang proposed parallel algorithms for remote sensing image geometric correction, automatic registration and watershed segmentation in his doctoral thesis "Research and Application of Parallel Processing Algorithms for Remote Sensing Images"; in 2004, Yi Faling published in the journal "Computer Engineering "Volume 30, No. 23" published the article "Scheduling Algorithms for Parallel Conversion of GIS Spatial Data Formats", which analyzed the feasibility of parallel conversion of two common GIS spatial data formats (GRID, TIN), and gave two cluster-based structures. A task scheduling algorithm for parallel transformations.

GDAL(Geospatial Data Abstraction Library)是一个开源空间数据转换库，其利用抽象数据模型（GDAL Data Model，参考OpenGIS制定的栅格数据标准）来解析所支持的各种空间数据文件格式，提供了一系列栅格数据格式的读写接口，支持数据格式如ERDAS Imagine (*.img)、GeoTIFF(*.GIF)、JPEG和BMP等。GDAL (Geospatial Data Abstraction Library) is an open source spatial data conversion library, which uses the abstract data model (GDAL Data Model, refer to the raster data standard developed by OpenGIS) to analyze the various supported spatial data file formats, providing a series of The interface for reading and writing raster data formats supports data formats such as ERDAS Imagine (*.img), GeoTIFF (*.GIF), JPEG, and BMP, etc.

发明内容Contents of the invention

发明要解决的技术问题The technical problem to be solved by the invention

本发明在于克服现有技术的不足，面对待处理数据量成几何级数上升的趋势，本发明将高性能并行计算技术用于栅格数据格式转换，提出了一种基于抽象数据模型的海量栅格数据格式转换并行方法，在总结现有基于抽象数据模型的栅格数据格式转换串行算法基础上，面向并行计算软硬件环境，重新设计基于抽象数据模型的栅格数据格式转换并行算法，实现栅格数据格式转换操作的并行处理，大大提高转换效率。The present invention overcomes the deficiencies of the prior art and faces the trend that the amount of data to be processed increases geometrically. The present invention uses high-performance parallel computing technology for raster data format conversion and proposes a massive raster data model based on an abstract data model. Parallel method for grid data format conversion, on the basis of summarizing the existing raster data format conversion serial algorithm based on the abstract data model, facing the parallel computing hardware and software environment, redesigning the raster data format conversion parallel algorithm based on the abstract data model, to achieve Parallel processing of raster data format conversion operations greatly improves conversion efficiency.

技术方案Technical solutions

为达到上述目的，本发明提供的技术方案为：In order to achieve the above object, the technical scheme provided by the invention is:

本发明的原理：Principle of the present invention:

基于抽象数据模型的栅格数据格式转换并行方法，即借助GDAL的抽象数据模型及其空间数据读写接口，在MPI（Message Passing Interface，消息传递编程模型）函数库的支持下，基于一个主从并行模式、一个管理进程、一个数据收集进程和若干计算进程，管理进程负责数据划分，维护待处理栅格块栈，维护空闲计算进程队列，维护数据块处理统计情况表，发送指令；计算进程采用基于抽象数据模型的栅格数据格式转换技术实际负责格式转换过程；收集进程负责转换结果的拼接合并，本发明首先通过源栅格数据格式驱动读取源数据，借助GDAL抽象数据模型解译数据要素，建立起源栅格数据与目标栅格数据格式之间的转换映射关系，进行数据划分，然后向各计算进程分发相应指令，各计算进程分别执行格式转换操作，最后由收集进程使用目标数据格式的驱动将转换后的栅格数据写入目标格式文件，从而实现栅格数据的格式转换操作的并行处理，特别适用于处理海量栅格数据格式转换操作。The raster data format conversion parallel method based on the abstract data model, that is, with the help of the abstract data model of GDAL and its spatial data read and write interface, with the support of the MPI (Message Passing Interface, message passing programming model) function library, based on a master-slave Parallel mode, a management process, a data collection process and several calculation processes. The management process is responsible for data division, maintaining the grid block stack to be processed, maintaining the idle calculation process queue, maintaining the data block processing statistics table, and sending instructions; the calculation process uses The raster data format conversion technology based on the abstract data model is actually responsible for the format conversion process; the collection process is responsible for the splicing and merging of the conversion results. The present invention first drives the source data through the source raster data format, and interprets the data elements with the help of the GDAL abstract data model , establish the conversion mapping relationship between the source raster data and the target raster data format, divide the data, and then distribute the corresponding instructions to each calculation process, and each calculation process performs the format conversion operation respectively, and finally the collection process uses the format of the target data format The driver writes the converted raster data into the target format file, so as to realize the parallel processing of the raster data format conversion operation, which is especially suitable for processing massive raster data format conversion operations.

本发明的基于抽象数据模型的海量栅格数据格式转换并行方法，其步骤为：The massive raster data format conversion parallel method based on the abstract data model of the present invention, its steps are:

步骤1：借助GDAL库函数解译栅格数据文件，获取源栅格数据波段、行列大小、空间参考信息、栅格点数据类型信息，按行划分栅格数据使每块数据量为行数总量除以计算进程数，但每块数据大小不超过50MB，构建待处理栅格数据块栈表以及数据块处理情况统计表，获取空闲计算进程队列；Step 1: Use the GDAL library function to interpret the raster data file, obtain the source raster data band, row and column size, spatial reference information, and raster point data type information, and divide the raster data by row so that the amount of each block of data is the total number of rows Divide the amount by the number of calculation processes, but the size of each piece of data does not exceed 50MB, build a stack table of raster data blocks to be processed and a statistical table of data block processing, and obtain the idle calculation process queue;

步骤2：根据源栅格数据波段、行列大小、空间参考信息、栅格点数据类型创建目标格式栅格数据框架文件，并将目标格式栅格框架文件信息发送给所有进程，然后从块栈中取出待处理数据块直到栈空，取出的待处理数据块出栈与从空闲计算进程队列中取出的计算进程组合构成操作指令，发送给该计算进程，同时在数据块处理统计表中记录进程运行开始时间戳；Step 2: Create the target format raster data frame file according to the source raster data band, row and column size, spatial reference information, and grid point data type, and send the target format raster frame file information to all processes, and then from the block stack Take out the data blocks to be processed until the stack is empty, and combine the data blocks to be processed out of the stack with the computing processes taken out from the idle computing process queue to form an operation command, send it to the computing process, and record the process running in the data block processing statistics table start timestamp;

步骤3：计算进程接收指令并负责完成该数据块的格式转换操作，计算进程接收到管理进程发送的指令后，首先根据数据块的起始及大小信息从源数据中读取相应的数据块，采用基于抽象数据模型的栅格数据格式转换串行算法进行数据块的格式转换，其首先需要根据源栅格数据的空间参考信息、波段信息、地面控制点信息、元数据以及用户指定的裁切矩形信息，依据抽象数据模型创建虚拟数据集，同时建立起源栅格数据与目标栅格数据格式之间的转换映射关系，最后以目标栅格数据格式分别写入目标栅格，从而使得整个转换过程灵活可控，实现了栅格数据不同数据格式间的全息转换；Step 3: The calculation process receives the instruction and is responsible for completing the format conversion operation of the data block. After the calculation process receives the instruction sent by the management process, it first reads the corresponding data block from the source data according to the start and size information of the data block. Using the raster data format conversion serial algorithm based on the abstract data model to convert the format of the data block, it first needs to be based on the source raster data's spatial reference information, band information, ground control point information, metadata and user-specified cropping Rectangular information, create a virtual dataset according to the abstract data model, and establish the conversion mapping relationship between the source raster data and the target raster data format, and finally write the target raster in the target raster data format, so that the entire conversion process Flexible and controllable, realizing the holographic conversion between different data formats of raster data;

步骤4：计算进程处理结束后，将结束时间戳发送给管理进程，同时将生成目标格式块文件信息发送给收集进程；管理进程接收到进程成功执行反馈信息之后，将进程入栈作为空闲进程准备再次使用，同时记录进程结束处理时间；如果管理进程收到计算进程反馈的失败消息或计算进程超过一定时间仍没有反馈，管理进程将强制计算进程终止操作，收回计算资源，并为该数据块重新分配计算资源加以处理；Step 4: After the processing of the calculation process is completed, send the end timestamp to the management process, and at the same time send the generated target format block file information to the collection process; after the management process receives the feedback information of the successful execution of the process, push the process into the stack as an idle process to prepare Use it again, and record the end processing time of the process at the same time; if the management process receives a failure message fed back by the calculation process or the calculation process still has no feedback after a certain period of time, the management process will force the calculation process to terminate the operation, recover the computing resources, and recreate the data block. Allocate computing resources for processing;

步骤5：收集进程根据管理进程发送的目标格式栅格框架文件信息，将计算进程发送的目标数据格式块文件写到框架文件中对应的位置，直到整个框架文件被全部填充，再无计算进程生成块文件为止，则栅格数据格式转换结束。Step 5: According to the target format raster frame file information sent by the management process, the collection process writes the target data format block file sent by the calculation process to the corresponding position in the frame file, until the entire frame file is completely filled, and no calculation process is generated block file, the raster data format conversion ends.

更进一步地，整个栅格数据格式转换并行方法基于一个主从并行模式、一个管理进程、一个数据收集进程和若干计算进程；管理进程负责数据划分，维护待处理栅格块栈，维护空闲计算进程队列，维护数据块处理统计情况表，发送指令；计算进程采用基于抽象数据模型的栅格数据格式转换技术实际负责格式转换过程；收集进程负责转换结果的拼接合并。Furthermore, the entire raster data format conversion parallel method is based on a master-slave parallel mode, a management process, a data collection process and several calculation processes; the management process is responsible for data division, maintaining the stack of raster blocks to be processed, and maintaining idle calculation processes The queue maintains the data block processing statistics table and sends instructions; the calculation process adopts the raster data format conversion technology based on the abstract data model and is actually responsible for the format conversion process; the collection process is responsible for the splicing and merging of conversion results.

更进一步地，步骤1和步骤4中使用队列、栈、表的数据结构进行进程调度，栈用于管理待处理数据块，未成功处理数据块重复入栈保障了格式转换完全进行；队列用于空闲进程管理，保障了计算进程先空闲先被使用，兼顾了进程间的负载平衡；表用于统计进程处理耗时情况，有利于监督进程执行情况，避免了无效进程长时间占用系统资源，保障了算法进程调度的有效进行，使得算法高效可靠。Furthermore, in step 1 and step 4, the data structure of queue, stack, and table is used for process scheduling. The stack is used to manage the data blocks to be processed, and the repeated stacking of unsuccessfully processed data blocks ensures that the format conversion is completely carried out; the queue is used for Idle process management ensures that computing processes are idle before they are used, taking into account the load balance between processes; the table is used to count the time-consuming process processing, which is conducive to monitoring the execution of the process, avoiding invalid processes occupying system resources for a long time, and ensuring It ensures the effective progress of the algorithm process scheduling, making the algorithm efficient and reliable.

有益效果Beneficial effect

采用本发明提供的技术方案，与已有的公知技术相比，具有如下显著效果：Compared with the existing known technology, the technical solution provided by the invention has the following remarkable effects:

（1）相比现有串行处理技术，本发明既适用于串行机也适用于多核/多处理器的并行环境，其可在单核CPU上运行，也可在多处理器或计算集群平台上使用多进程运行，本发明的基于抽象数据模型的栅格数据格式转换并行方法具有良好的跨平台适应性；(1) Compared with the existing serial processing technology, the present invention is applicable to both serial machine and multi-core/multi-processor parallel environment, which can run on a single-core CPU, or on a multi-processor or computing cluster Multi-process operation is used on the platform, and the raster data format conversion parallel method based on the abstract data model of the present invention has good cross-platform adaptability;

（2）相比现有串行处理技术，本发明将栅格数据格式转换技术和并行计算技术结合起来，实现了针对海量栅格数据格式的并行转换算法，充分利用了现有计算机多核并行处理环境的优势，大大提高了数据转换效率，缩短了转换格式耗费的时间；(2) Compared with the existing serial processing technology, the present invention combines raster data format conversion technology and parallel computing technology to realize a parallel conversion algorithm for massive raster data formats, making full use of existing computer multi-core parallel processing The advantages of the environment greatly improve the efficiency of data conversion and shorten the time spent in converting formats;

（3）相比现有串行处理技术，本发明采用多机并行处理，降低了每个计算节点计算压力，在提高效率的同时扩展了算法处理数据量的大小，使得算法特别适合处理海量栅格数据格式转换任务，处理能力大幅提高；(3) Compared with the existing serial processing technology, the present invention adopts multi-computer parallel processing, which reduces the computing pressure of each computing node, expands the size of the algorithm processing data while improving efficiency, and makes the algorithm especially suitable for processing massive grids. Grid data format conversion task, the processing ability is greatly improved;

（4）相比现有串行处理技术，本发明采用主从并行模式，整个处理过程在一个管理进程的统一指导和监督下进行，数据分块大小均匀，空闲进程先到先用，使得处理过程具有良好的均衡性和稳定性。(4) Compared with the existing serial processing technology, the present invention adopts the master-slave parallel mode. The whole processing process is carried out under the unified guidance and supervision of a management process. The process has good balance and stability.

附图说明Description of drawings

图1为本发明基于抽象数据模型的栅格数据格式转换并行方法的示意图；Fig. 1 is the schematic diagram of the raster data format conversion parallel method based on the abstract data model of the present invention;

图2为本发明中基于抽象数据模型的栅格数据格式转换计算进程处理流程图；Fig. 2 is the processing flowchart of the raster data format conversion calculation process based on the abstract data model in the present invention;

图3为本发明的多进程并行处理时间花费曲线图，其中：横坐标为进程数，纵坐标为格式转换处理花费的时间。FIG. 3 is a graph showing the multi-process parallel processing time spent in the present invention, wherein the abscissa is the number of processes, and the ordinate is the time spent in format conversion processing.

具体实施方式Detailed ways

下面结合实施例对本发明作进一步的描述。The present invention will be further described below in conjunction with embodiment.

实施例Example

本实施例采用GeoTIFF (.GIF)格式的栅格数据为源数据，将栅格数据source.GIF全幅转换为Microsoft Windows Device Independent Bitmap 格式的栅格数据，范围经纬度从102°10′E到103°2′E，从25°17′N到24°13′N，数据量大小约为600M，栅格大小为7000行×10000列，空间参考系为WGS 84_ UTM zone 48N。This embodiment adopts the raster data of GeoTIFF (.GIF) format as the source data, and converts the raster data source.GIF into the raster data of the Microsoft Windows Device Independent Bitmap format, and the range of longitude and latitude is from 102°10′E to 103° 2′E, from 25°17′N to 24°13′N, the data size is about 600M, the grid size is 7000 rows×10000 columns, and the spatial reference system is WGS 84_ UTM zone 48N.

本实例是在Visual Studio 2008平台下采用标准C++语言，调用GDAL开源库数据读写驱动，以OpenMPI作为并行函数库研发实现。程序运行环境为IBM System x3500-M3X系列服务器，硬件配置为CPU 两颗，规格为Intel Xeon Quad Core E5620(主频2.4GHz，12MB Cache，四核）；内存为8GB（2根4GB内存条，规格为DDR3 1333MHz LP RDIMM）；硬盘为2TB（4个500GB硬盘，规格为7.2K 6Gbps NL SAS 2.5-inch SFF Slim-HS HD），网络为集成的双口千兆以太网。软件配置：操作系统为Centos Linux 6.0，其中MPI的实现产品选择OpenMPI 1.4.1，GDAL 选择1.8.1版本。This example uses standard C++ language under the platform of Visual Studio 2008, calls the GDAL open source library data read and write driver, and uses OpenMPI as a parallel function library to develop and implement. The operating environment of the program is IBM System x3500-M3X series server, the hardware configuration is two CPUs, the specification is Intel Xeon Quad Core E5620 (main frequency 2.4GHz, 12MB Cache, quad-core); the memory is 8GB (two 4GB memory sticks, specifications DDR3 1333MHz LP RDIMM); the hard disk is 2TB (four 500GB hard disks, the specification is 7.2K 6Gbps NL SAS 2.5-inch SFF Slim-HS HD), and the network is integrated dual-port Gigabit Ethernet. Software configuration: the operating system is Centos Linux 6.0, among which the MPI implementation product selects OpenMPI 1.4.1, and the GDAL version 1.8.1 is selected.

结合图1，说明本实施例的具体实施步骤如下：In conjunction with Fig. 1, the specific implementation steps of this embodiment are illustrated as follows:

步骤1：调用GDAL库函数中GTiff格式栅格数据读写接口，使用GDALOpenShared()方法打开源栅格文件，获取源栅格数据波段、行列大小、空间参考信息、栅格点数据类型信息，以每10行作为一个数据块，设进程0为管理进程，进程1为收集进程，其余为计算进程；管理进程还有负责构建空闲进程队列，如采用4个进程处理则空闲进程队列中有进程2和进程3，若将数据块分割成4块则待处理数据块栈中含有块1-4，以及数据块处理情况统计表；主要代码如下：Step 1: Call the GTiff format raster data reading and writing interface in the GDAL library function, use the GDALOpenShared() method to open the source raster file, obtain the source raster data band, row and column size, spatial reference information, and grid point data type information, and Every 10 rows is used as a data block, process 0 is the management process, process 1 is the collection process, and the rest are calculation processes; the management process is also responsible for building an idle process queue. If 4 processes are used for processing, there is process 2 in the idle process queue And process 3, if the data block is divided into 4 blocks, the pending data block stack contains blocks 1-4, and the data block processing statistics table; the main code is as follows:

int np,cp; //np为进程个数，cp为当前进程编号int np,cp; //np is the number of processes, cp is the number of the current process

int nRasterXSize，nRasterYSize;//栅格行列数int nRasterXSize，nRasterYSize;//Number of grid rows and columns

int anSrcWin[4];//数据块行列坐标 int anSrcWin[4];//Data block row and column coordinates

int nYChunkSize,iY; int nYChunkSize, iY;

nYChunkSize =ceil(nRasterYSize /(double)np);//ceil向上取整nYChunkSize = ceil(nRasterYSize /(double)np);//ceil is rounded up

iY=cp*nYChunkSize; iY=cp*nYChunkSize;

if( nYChunkSize + iY >nRasterYSize )if( nYChunkSize + iY >nRasterYSize )

nYChunkSize = nRasterYSize - iY;nYChunkSize = nRasterYSize - iY;

anSrcWin[0]=0;anSrcWin[0]=0;

anSrcWin[1]=iY;anSrcWin[1]=iY;

anSrcWin[2]=nRasterXSize;AnSrcWin[2]=nRasterXSize;

anSrcWin[3]=nYChunkSize;AnSrcWin[3]=nYChunkSize;

步骤2：根据源栅格数据波段、行列大小、空间参考信息、栅格点数据类型参数调用CreateOutputDataset()函数复制创建目标格式栅格数据框架文件，并将此文件全路径广播给所有进程；待处理数据块出栈与队列空闲进程封装入消息，发送给相应空闲进程，同时在数据块处理情况统计表生成当前处理指令编号PID，目标计算进程RID，目标待处理数据块BID以及当前时间Start；消息命令发出直到待处理数据块栈为空；主要代码如下：Step 2: Call the CreateOutputDataset() function to copy and create the target format raster data frame file according to the source raster data band, row and column size, spatial reference information, and grid point data type parameters, and broadcast the full path of the file to all processes; Process the data block out of the stack and queue the idle process to encapsulate the message, send it to the corresponding idle process, and generate the current processing instruction number PID, target calculation process RID, target pending data block BID, and current time Start in the data block processing statistics table; The message command is issued until the pending data block stack is empty; the main code is as follows:

MPI_Init(&argc,&argv);MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD,&myid);MPI_Comm_rank(MPI_COMM_WORLD,&myid);

MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

master=0; master=0;

MPI_Bcast(destfilename,master,MPI_COMM_WORLD);MPI_Bcast(destfilename, master, MPI_COMM_WORLD);

While(Stack_BID is not empty)While(Stack_BID is not empty)

{{

RID=Queue_RID.Dequeue();RID=Queue_RID. Dequeue();

BID=Stack_BID.Pop();BID=Stack_BID.Pop();

MPI_Send(RID,BID,master,MPI_COMM_WORLD);MPI_Send(RID,BID,master,MPI_COMM_WORLD);

}}

步骤3：计算进程负责实际栅格格式转换，算法首先根据源栅格数据的空间参考信息、波段信息、地面控制点信息、元数据以及用户指定的裁切矩形等信息，依据抽象数据模型创建虚拟数据集，同时建立起源栅格数据与目标栅格数据格式之间的转换映射关系，最后以目标栅格数据格式分别写入目标栅格，从而实现栅格数据不同数据格式间的全息转换，参见附图2 基于抽象数据模型的栅格数据格式转换计算进程算法流程图。数据块转换后生成BMP格式数据文件，发送处理成功消息到管理进程，同时告知收集进程转换结果文件相关信息，释放计算资源，等待接受新任务；主要代码如下：Step 3: The calculation process is responsible for the actual raster format conversion. The algorithm first creates a virtual grid based on the abstract data model based on the spatial reference information, band information, ground control point information, metadata, and user-specified clipping rectangle of the source raster data. At the same time, establish the conversion and mapping relationship between the source raster data and the target raster data format, and finally write the target raster in the target raster data format respectively, so as to realize the holographic conversion between different data formats of raster data, see Figure 2 is the algorithm flow chart of the raster data format conversion calculation process based on the abstract data model. Generate a BMP format data file after the data block conversion, send a processing success message to the management process, and at the same time inform the collection process of the conversion result file related information, release computing resources, and wait for new tasks to be accepted; the main code is as follows:

VRTDataset *poVDS;VRTDataset *poVDS;

poVDS = (VRTDataset *) VRTCreate( nOXSize, nOYSize );poVDS = (VRTDataset *) VRTCreate( nOXSize, nOYSize );

poVDS->SetMetadata( ((GDALDataset*)hDataset)->GetMetadata() );poVDS->SetMetadata( ((GDALDataset*)hDataset)->GetMetadata() );

for( i = 0; i < nBandCount; i++ )//nBandCount波段数for( i = 0; i < nBandCount; i++ )//nBandCount band number

{{

VRTSourcedRasterBand *poVRTBand;VRTSourcedRasterBand *poVRTBand;

GDALRasterBand *poSrcBand;GDALRasterBand *poSrcBand;

poSrcBand=SourceDataset->GetRasterBand(i);PoSrcBand=SourceDataset->GetRasterBand(i);

eBandType = poSrcBand->GetRasterDataType();eBandType = poSrcBand->GetRasterDataType();

poVDS->AddBand( eBandType, NULL ); poVDS->AddBand( eBandType, NULL );

poVRTBand = poVDS->GetRasterBand( i+1 );PoVRTBand = poVDS->GetRasterBand( i+1 );

poVRTBand->AddComplexSource( poSrcBand, poVRTBand->AddComplexSource( poSrcBand,

anSrcWin[0], anSrcWin[1], AnSrcWin[0], anSrcWin[1],

anSrcWin[2], anSrcWin[3], AnSrcWin[2], anSrcWin[3],

0, 0, nOXSize, nOYSize,0, 0, nOXSize, nOYSize,

dfOffset, dfScale,dfOffset, dfScale,

VRT_NODATA_UNSET,VRT_NODATA_UNSET,

nComponent );nComponent );

CopyBandInfo( poSrcBand, poVRTBand,CopyBandInfo( poSrcBand, poVRTBand,

!bStats && !bFilterOutStatsMetadata,!bStats && !bFilterOutStatsMetadata,

!bUnscale,!bUnscale,

!bSetNoData && !bUnsetNoData );!bSetNoData && !bUnsetNoData );

}}

if (eMaskMode == MASK_AUTO &&If (eMaskMode == MASK_AUTO &&

(GDALGetMaskFlags(GDALGetRasterBand(hDataset, 1)) (GDALGetMaskFlags(GDALGetRasterBand(hDataset, 1))

& GMF_PER_DATASET) == 0 &&&& GMF_PER_DATASET) == 0 &&

(poSrcBand->GetMaskFlags() & (GMF_ALL_VALID |(poSrcBand->GetMaskFlags() & (GMF_ALL_VALID |

GMF_NODATA)) == 0)GMF_NODATA)) == 0)

{{

if (poVRTBand->CreateMaskBand( If (poVRTBand->CreateMaskBand(

poSrcBand->GetMaskFlags()) == CE_None) poSrcBand->GetMaskFlags()) == CE_None)

{{

VRTSourcedRasterBand* hMaskVRTBand =VRTSourcedRasterBand* hMaskVRTBand =

poVRTBand->GetMaskBand(); poVRTBand->GetMaskBand();

hMaskVRTBand->AddMaskBandSource(poSrcBand, hMaskVRTBand->AddMaskBandSource(poSrcBand,

anSrcWin[0], anSrcWin[1],anSrcWin[0], anSrcWin[1],

anSrcWin[2], anSrcWin[3],anSrcWin[2], anSrcWin[3],

0, 0, nOXSize, nOYSize );0, 0, nOXSize, nOYSize );

}}

步骤5：收集进程根据目标格式栅格框架文件信息，将计算进程发送的目标数据格式块文件写到框架文件中对应的位置，直到整个框架文件被全部填充，再无计算进程生成块文件为止，则栅格数据格式转换结束。Step 5: The collection process writes the target data format block file sent by the calculation process to the corresponding position in the frame file according to the information of the target format raster frame file, until the entire frame file is completely filled and no block files are generated by the calculation process. Then the raster data format conversion ends.

通过指定不同的进程数(3,4,……16)，在实验采用的IBM System x3500-M3X上执行以上实施过程，获得每次执行的最长时间如表1所示，花费时间曲线图见图3。By specifying different number of processes (3, 4, ... 16), the above implementation process is executed on the IBM System x3500-M3X used in the experiment, and the maximum time of each execution is obtained as shown in Table 1. The time spent curve is shown in image 3.

表1 多进程并行处理花费时间统计表Table 1 Multi-process parallel processing time spent statistics table

。 .

本发明相比现有串行处理技术，将栅格数据格式转换技术和并行计算技术结合起来，实现了针对海量栅格数据格式的并行转换算法，充分利用了现有计算机多核并行处理环境的优势，大大提高了数据转换效率，缩短了转换格式耗费的时间。Compared with the existing serial processing technology, the present invention combines the raster data format conversion technology and parallel computing technology, realizes the parallel conversion algorithm for massive raster data formats, and fully utilizes the advantages of the existing computer multi-core parallel processing environment , which greatly improves the efficiency of data conversion and shortens the time spent in converting formats.

Claims

Translated fromChinese

1.一种基于抽象数据模型的海量栅格数据格式转换并行方法，其步骤为：1. A massive raster data format conversion parallel method based on an abstract data model, the steps of which are:

步骤2：根据源栅格数据波段、行列大小、空间参考信息、栅格点数据类型创建目标格式栅格数据框架文件，并将目标格式栅格框架文件信息发送给所有进程，然后从块栈中取出待处理数据块直到栈空，取出的待处理数据块出栈与从空闲计算进程队列中取出的计算进程组合构成操作指令，发送给该计算进程，同时在数据块处理统计表中记录进程运行开始时间戳；Step 2: Create the target format raster data frame file according to the source raster data band, row and column size, spatial reference information, and grid point data type, and send the target format raster frame file information to all processes, and then from the block stack Take out the data blocks to be processed until the stack is empty, and combine the data blocks to be processed out of the stack with the computing processes taken from the idle computing process queue to form an operation command, send it to the computing process, and record the process running in the data block processing statistics table start timestamp;

步骤3：计算进程接收指令并负责完成该数据块的格式转换操作，计算进程接收到管理进程发送的指令后，首先根据数据块的起始及大小信息从源数据中读取相应的数据块，采用基于抽象数据模型的栅格数据格式转换串行算法进行数据块的格式转换，其首先需要根据源栅格数据的空间参考信息、波段信息、地面控制点信息、元数据以及用户指定的裁切矩形信息，依据抽象数据模型创建虚拟数据集，同时建立起源栅格数据与目标栅格数据格式之间的转换映射关系，最后以目标栅格数据格式分别写入目标栅格；Step 3: The calculation process receives the instruction and is responsible for completing the format conversion operation of the data block. After the calculation process receives the instruction sent by the management process, it first reads the corresponding data block from the source data according to the start and size information of the data block. Using the raster data format conversion serial algorithm based on the abstract data model to convert the format of the data block, it first needs to be based on the source raster data's spatial reference information, band information, ground control point information, metadata and user-specified cropping Rectangular information, create a virtual dataset according to the abstract data model, and establish the conversion mapping relationship between the source raster data and the target raster data format, and finally write the target raster in the target raster data format;

2.根据权利要求1所述的基于抽象数据模型的海量栅格数据格式转换并行方法，其特征在于：整个栅格数据格式转换并行方法基于一个主从并行模式、一个管理进程、一个数据收集进程和若干计算进程；管理进程负责数据划分，维护待处理栅格块栈，维护空闲计算进程队列，维护数据块处理统计情况表，发送指令；计算进程采用基于抽象数据模型的栅格数据格式转换技术实际负责格式转换过程；收集进程负责转换结果的拼接合并。2. The massive raster data format conversion parallel method based on an abstract data model according to claim 1, characterized in that: the entire raster data format conversion parallel method is based on a master-slave parallel mode, a management process, and a data collection process and several calculation processes; the management process is responsible for data division, maintaining the grid block stack to be processed, maintaining the idle calculation process queue, maintaining the data block processing statistics table, and sending instructions; the calculation process adopts the raster data format conversion technology based on the abstract data model It is actually responsible for the format conversion process; the collection process is responsible for the splicing and merging of conversion results.

3.根据权利要求2所述的基于抽象数据模型的海量栅格数据格式转换并行方法，其特征在于：步骤1和步骤4中使用队列、栈、表的数据结构进行进程调度，栈用于管理待处理数据块，未成功处理数据块重复入栈保障了格式转换完全进行；队列用于空闲进程管理，保障了计算进程先空闲先被使用，兼顾了进程间的负载平衡；表用于统计进程处理耗时情况，监督进程执行情况。3. the massive raster data format conversion parallel method based on abstract data model according to claim 2, is characterized in that: in step 1 and step 4, use the data structure of queue, stack, table to carry out process scheduling, and stack is used for management Data blocks to be processed and unsuccessfully processed data blocks are repeatedly pushed onto the stack to ensure complete format conversion; the queue is used for idle process management, ensuring that the computing process is idle before being used, taking into account the load balance between processes; the table is used for statistical processes Handle time-consuming situations and monitor process execution.