CN107679146A

Movatterモバイル変換

Info

Publication number: CN107679146A
Application number: CN201710876201.1A
Authority: CN
Inventors: 黄文琦; 许爱东; 陈晓; 陈华军; 李果; 蒋屹新; 杨航; 张福铮
Original assignee: China South Power Grid International Co ltd
Current assignee: China South Power Grid International Co ltd
Priority date: 2017-09-25
Filing date: 2017-09-25
Publication date: 2018-02-09

Abstract

Translated fromChinese

本发明涉及一种电网数据质量的校验方法和系统，获取电网原始数据记录并存储在分布式存储系统中，分布式存储系统中存储电网原始数据记录的第一查询索引表；创建多个并行任务，在每个并行任务中，获取目标校验规则的校验字段，根据所述校验字段在第一查询索引表中查找获取与所述校验字段对应的第一电网原始数据记录，提取所述第一电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验。将电网数据记录进行分布式存储可以使校验过程具有良好的扩展性，通过校验规则涉及的校验字段与数据记录的查询索引的关系，支持校验规则执行时进行高效查询处理。

The invention relates to a method and system for verifying the quality of power grid data. The original data records of the power grid are obtained and stored in a distributed storage system, and the first query index table for storing the original data records of the power grid is stored in the distributed storage system; multiple parallel data records are created. task, in each parallel task, obtain the verification field of the target verification rule, search in the first query index table according to the verification field to obtain the first grid original data record corresponding to the verification field, and extract The comparison data and reference data in the first power grid raw data record, and the extracted comparison data are verified according to the extracted reference data. Distributed storage of power grid data records can make the verification process have good scalability. Through the relationship between the verification fields involved in the verification rules and the query index of the data records, it supports efficient query processing when the verification rules are executed.

Description

Translated fromChinese

电网数据质量的校验方法和系统Method and system for checking data quality of power grid

技术领域technical field

本发明涉及电网技术领域，特别是涉及一种电网数据质量的校验方法和系统。The invention relates to the technical field of power grids, in particular to a method and system for checking data quality of a power grid.

背景技术Background technique

传统的关系数据管理系统追求高度的一致性和正确性，在面向海量数据的分析需求时，采用纵向扩展(scale up)的方法，即通过升级硬件(CPU、内存、硬盘等)提升单个节点的能力，使其可扩展性和性能受到了很大的限制。The traditional relational data management system pursues a high degree of consistency and correctness. When facing the analysis requirements of massive data, it adopts the method of vertical expansion (scale up), that is, by upgrading the hardware (CPU, memory, hard disk, etc.) to improve the performance of a single node. capacity, so its scalability and performance are greatly limited.

随着电网业务数据规模和数据质量监控规则复杂度的不断增大，目前现有的基于传统数据管理和计算平台的数据质量监控系统的处理能力出现严重的瓶颈，处理效率低下，难以快速完成数据质量的监控和校验，越来越难以满足日常的生产管理和经营决策的需求。With the continuous increase in the scale of power grid business data and the complexity of data quality monitoring rules, the existing data quality monitoring system based on traditional data management and computing platforms has serious bottlenecks in processing capabilities, low processing efficiency, and difficulty in quickly completing data. Quality monitoring and verification is increasingly difficult to meet the needs of daily production management and business decision-making.

发明内容Contents of the invention

基于此，有必要针对基于传统数据管理和计算平台的数据质量监控系统数据对质量的监控和校验的效率低下的问题，提供一种电网数据质量的校验方法和系统。Based on this, it is necessary to provide a power grid data quality verification method and system for the low efficiency of data quality monitoring and verification in data quality monitoring systems based on traditional data management and computing platforms.

一种电网数据质量的校验方法，包括以下步骤：A method for verifying power grid data quality, comprising the following steps:

获取电网原始数据记录，将所述电网原始数据记录存储在分布式存储系统中；其中，所述电网原始数据记录包括待校验的比对数据记录和用于校验的基准数据记录；Acquiring grid original data records, storing the grid original data records in a distributed storage system; wherein, the grid original data records include comparison data records to be verified and reference data records for verification;

创建多个并行任务，在每个并行任务中，执行如下操作：获取目标校验规则的校验字段，根据所述校验字段在所述电网原始数据记录的第一查询索引表中进行查找，获取与所述校验字段对应的第一电网原始数据记录，提取所述第一电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验；其中，所述第一查询索引表存储在所述分布式存储系统中；Create a plurality of parallel tasks, and in each parallel task, perform the following operations: obtain the verification field of the target verification rule, and search in the first query index table of the grid raw data record according to the verification field, Obtaining the first grid raw data record corresponding to the verification field, extracting the comparison data and benchmark data in the first grid raw data record, and verifying the extracted comparison data according to the extracted benchmark data; wherein , the first query index table is stored in the distributed storage system;

输出多个并行任务的校验结果。Output the verification results of multiple parallel tasks.

一种电网数据质量的校验系统，包括：A verification system for grid data quality, comprising:

数据存储单元，用于获取电网原始数据记录，将所述电网原始数据记录存储在分布式存储系统中；其中，所述电网原始数据记录包括待校验的比对数据记录和用于校验的基准数据记录；The data storage unit is used to obtain the original data record of the power grid, and store the original data record of the power grid in the distributed storage system; wherein, the original data record of the power grid includes a comparison data record to be verified and a benchmark data records;

任务创建单元，用于创建多个并行任务；A task creation unit is used to create multiple parallel tasks;

查询索引单元，用于在每个并行任务中，获取目标校验规则的校验字段，根据所述校验字段在所述电网原始数据记录的第一查询索引表中进行查找，获取与所述校验字段对应的第一电网原始数据记录；The query index unit is used to obtain the verification field of the target verification rule in each parallel task, search in the first query index table of the grid original data record according to the verification field, and obtain the The first grid raw data record corresponding to the verification field;

提取比对单元，用于提取所述第一电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验；其中，所述第一查询索引表存储在所述分布式存储系统中；An extraction and comparison unit, configured to extract the comparison data and reference data in the first power grid raw data record, and verify the extracted comparison data according to the extracted reference data; wherein, the first query index table stores In the distributed storage system;

结果输出单元，用于输出多个并行任务的校验结果。The result output unit is used for outputting verification results of multiple parallel tasks.

根据上述本发明的电网数据质量的校验方法和系统，其是获取电网原始数据记录并存储在分布式存储系统中，分布式存储系统中存储电网原始数据记录的第一查询索引表；创建多个并行任务，在每个并行任务中，获取目标校验规则的校验字段，根据所述校验字段在第一查询索引表中查找获取与所述校验字段对应的第一电网原始数据记录，提取所述第一电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验。在此方案中，将电网数据记录进行分布式存储可以使校验过程具有良好的扩展性，通过校验规则涉及的校验字段与数据记录的查询索引的关系，支持校验规则执行时进行高效查询处理，另外，通过多个并行任务，使得每条校验规则都可以并行化处理，从而提高电网数据质量的校验效率。According to the verification method and system of power grid data quality of the present invention, it is to obtain the original data record of the power grid and store it in the distributed storage system, and store the first query index table of the original data record of the grid in the distributed storage system; create multiple parallel tasks, in each parallel task, obtain the verification field of the target verification rule, and search in the first query index table according to the verification field to obtain the first power grid raw data record corresponding to the verification field and extracting the comparison data and reference data in the first grid raw data record, and verifying the extracted comparison data according to the extracted reference data. In this scheme, the distributed storage of power grid data records can make the verification process have good scalability. Through the relationship between the verification fields involved in the verification rules and the query index of the data records, it supports efficient execution of the verification rules. Query processing. In addition, through multiple parallel tasks, each verification rule can be processed in parallel, thereby improving the verification efficiency of power grid data quality.

一种可读存储介质，其上存储有可执行程序，该程序被处理器执行时实现上述的电网数据质量的校验方法的步骤。A readable storage medium, on which an executable program is stored, and when the program is executed by a processor, the steps of the above-mentioned method for checking the quality of power grid data are realized.

一种校验设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的可执行程序，处理器执行程序时实现上述的电网数据质量的校验方法的步骤。A verification device includes a memory, a processor, and an executable program stored on the memory and operable on the processor. When the processor executes the program, the steps of the above-mentioned power grid data quality verification method are realized.

根据上述本发明的电网数据质量的校验方法，本发明还提供一种可读存储介质和校验设备，用于通过程序实现上述电网数据质量的校验方法。According to the method for verifying the quality of power grid data of the present invention, the present invention also provides a readable storage medium and a verification device for realizing the above method for verifying the quality of power grid data through a program.

附图说明Description of drawings

图1为本发明一个实施例中的电网数据质量的校验方法的流程示意图；Fig. 1 is a schematic flow chart of a verification method for grid data quality in an embodiment of the present invention;

图2为本发明一个实施例中的电网数据质量的校验系统的结构示意图；Fig. 2 is a schematic structural diagram of a verification system for grid data quality in an embodiment of the present invention;

图3为本发明一个实施例中的电网数据质量的校验系统的结构示意图；Fig. 3 is a schematic structural diagram of a verification system for grid data quality in an embodiment of the present invention;

图4为本发明一个实施例中的电网数据质量的校验系统的结构示意图；FIG. 4 is a schematic structural diagram of a verification system for grid data quality in an embodiment of the present invention;

图5为本发明一个实施例中的电网数据质量的校验系统的结构示意图；FIG. 5 is a schematic structural diagram of a verification system for grid data quality in an embodiment of the present invention;

图6为本发明一个具体实施例中的校验总体示意图；FIG. 6 is an overall schematic diagram of verification in a specific embodiment of the present invention;

图7为本发明一个具体实施例中的增量数据存储与索引示意图；Fig. 7 is a schematic diagram of incremental data storage and indexing in a specific embodiment of the present invention;

图8为本发明一个具体实施例中的批量历史数据存储与索引示意图；Fig. 8 is a schematic diagram of batch historical data storage and indexing in a specific embodiment of the present invention;

图9为本发明一个具体实施例中的校验规则并行化处理示意图。FIG. 9 is a schematic diagram of parallel processing of verification rules in a specific embodiment of the present invention.

具体实施方式detailed description

为使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步的详细说明。应当理解，此处所描述的具体实施方式仅仅用以解释本发明，并不限定本发明的保护范围。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, and do not limit the protection scope of the present invention.

参见图1所示，为本发明一个实施例的电网数据质量的校验方法的流程示意图。该实施例中的电网数据质量的校验方法包括以下步骤：Referring to FIG. 1 , it is a schematic flowchart of a method for verifying power grid data quality according to an embodiment of the present invention. The verification method of the grid data quality in this embodiment includes the following steps:

步骤S110：获取电网原始数据记录，将所述电网原始数据记录存储在分布式存储系统中；其中，所述电网原始数据记录包括待校验的比对数据记录和用于校验的基准数据记录；Step S110: Obtain the original grid data record, and store the grid original data record in the distributed storage system; wherein, the grid original data record includes a comparison data record to be verified and a reference data record for verification ;

在本步骤中，分布式存储系统可以分布存储电网原始数据，便于电网数据的增加或删减，使校验过程具有良好的扩展性；用于校验的基准数据记录是待校验的比对数据记录的校验标准；In this step, the distributed storage system can distribute and store the original data of the power grid, which facilitates the addition or deletion of grid data, and makes the verification process have good scalability; the reference data record for verification is the comparison to be verified Calibration criteria for data records;

步骤S120：创建多个并行任务，在每个并行任务中，执行如下操作：获取目标校验规则的校验字段，根据所述校验字段在所述电网原始数据记录的第一查询索引表中进行查找，获取与所述校验字段对应的第一电网原始数据记录，提取所述第一电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验；其中，所述第一查询索引表存储在所述分布式存储系统中；Step S120: Create a plurality of parallel tasks, and in each parallel task, perform the following operations: obtain the verification field of the target verification rule, and use the verification field in the first query index table of the grid raw data record performing a search, obtaining the first grid raw data record corresponding to the verification field, extracting the comparison data and benchmark data in the first grid raw data record, and correcting the extracted comparison data according to the extracted benchmark data test; wherein, the first query index table is stored in the distributed storage system;

在本步骤中，在每个并行任务中根据目标校验规则进行校验，通过查找查询索引，可以得到与校验字段相对应的比对数据和基准数据，从而进行校验；In this step, the verification is performed according to the target verification rules in each parallel task, and the comparison data and benchmark data corresponding to the verification field can be obtained by searching the query index, so as to perform verification;

步骤S130：输出多个并行任务的校验结果。Step S130: Output the verification results of multiple parallel tasks.

在本实施例中，获取电网原始数据记录并存储在分布式存储系统中，分布式存储系统中存储电网原始数据记录的第一查询索引表；创建多个并行任务，在每个并行任务中，获取目标校验规则的校验字段，根据所述校验字段在第一查询索引表中查找获取与所述校验字段对应的第一电网原始数据记录，提取所述第一电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验。在此方案中，将电网数据记录进行分布式存储可以使校验过程具有良好的扩展性，通过校验规则涉及的校验字段与数据记录的查询索引的关系，支持校验规则执行时进行高效查询处理，另外，通过多个并行任务，使得每条校验规则都可以并行化处理，从而提高电网数据质量的校验效率。In this embodiment, the grid raw data records are obtained and stored in the distributed storage system, and the distributed storage system stores the first query index table of the grid raw data records; multiple parallel tasks are created, and in each parallel task, Obtain the verification field of the target verification rule, search in the first query index table according to the verification field to obtain the first grid original data record corresponding to the verification field, and extract the first power grid original data record The comparison data and benchmark data of the extracted comparison data are verified according to the extracted benchmark data. In this scheme, the distributed storage of power grid data records can make the verification process have good scalability. Through the relationship between the verification fields involved in the verification rules and the query index of the data records, it supports efficient execution of the verification rules. Query processing. In addition, through multiple parallel tasks, each verification rule can be processed in parallel, thereby improving the verification efficiency of power grid data quality.

可选的，分布式存储系统可以是HBase分布式存储系统，HBase分布式存储系统提供基于列存储模式的大数据表管理能力，可存储管理数十亿以上的数据记录，每个记录可包含百万以上的数据列；HBase提供随机和实时的数据读写访问能力，并具有高可扩展性、高可用性、容错处理能力、负载平衡能力、以及实时数据查询能力。Optionally, the distributed storage system can be the HBase distributed storage system. The HBase distributed storage system provides large data table management capabilities based on the column storage mode, and can store and manage more than billions of data records. Each record can contain hundreds of More than 10,000 data columns; HBase provides random and real-time data read and write access capabilities, and has high scalability, high availability, fault-tolerant processing capabilities, load balancing capabilities, and real-time data query capabilities.

在其中一个实施例中，电网数据质量的校验方法还包括以下步骤：In one of the embodiments, the verification method of grid data quality also includes the following steps:

在所述分布式存储系统中建立所述第一查询索引表，所述第一查询索引表的主键为各种校验规则的校验字段，所述第一查询索引表的列值为所述电网原始数据记录的主键。Establish the first query index table in the distributed storage system, the primary key of the first query index table is the verification field of various verification rules, and the column value of the first query index table is the The primary key of the grid raw data record.

在本实施例中，可以在分布式存储系统中建立所述第一查询索引表，将各种校验规则的校验字段作为第一查询索引表的主键，电网原始数据记录的主键作为第一查询索引表的列值，通过第一查询索引表，可以根据校验字段快速查找到对应的第一电网原始数据记录。In this embodiment, the first query index table can be established in the distributed storage system, and the verification fields of various verification rules are used as the primary key of the first query index table, and the primary key of the grid original data record is used as the first query index table. Query the column values of the index table, and through the first query index table, the corresponding first grid original data record can be quickly found according to the check field.

需要说明的是，在获取到与校验字段对应的电网原始数据记录的主键以后，可以根据对应的电网原始数据记录的主键查找到对应的第一电网原始数据记录，从对应的第一电网原始数据记录中提取比对数据和基准数据。It should be noted that after obtaining the primary key of the grid original data record corresponding to the verification field, the corresponding first grid original data record can be found according to the corresponding primary key of the grid original data record, and the corresponding first grid original data record The comparison data and benchmark data are extracted from the data records.

可选的，校验字段可以是电网原始数据记录的主键或任意属性列，提取的比对数据是与校验字段对应的实际字段，可以是校验字段本身或者其他数据字段。Optionally, the verification field can be the primary key or any attribute column of the original grid data record, and the extracted comparison data is the actual field corresponding to the verification field, which can be the verification field itself or other data fields.

在每个并行任务中，获取目标校验规则的时间戳范围，根据所述时间戳范围在所述电网原始数据记录的第二查询索引表中进行查找，获取与所述时间戳范围对应的第二电网原始数据记录，提取所述第二电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验；其中，所述第二查询索引表存储在所述分布式存储系统中。In each parallel task, obtain the time stamp range of the target verification rule, search in the second query index table of the grid raw data record according to the time stamp range, and obtain the first time stamp corresponding to the time stamp range Two power grid raw data records, extracting the comparison data and benchmark data in the second power grid raw data records, and verifying the extracted comparison data according to the extracted benchmark data; wherein, the second query index table is stored in In the distributed storage system.

在本实施例中，还可以通过时间戳查找第二电网原始数据记录，提取其中的比对数据和基准数据进行校验，实现基于时间窗口的电网数据质量的校验。In this embodiment, the second power grid raw data record can also be searched through the time stamp, and the comparison data and reference data therein can be extracted for verification, so as to realize the verification of the power grid data quality based on the time window.

需要说明的是，通过时间戳查找第二电网原始数据记录时，时间戳与第二电网原始数据记录中的比对数据相对应，基准数据与比对数据相对应，基准数据与时间戳并无直接联系。It should be noted that when searching for the original data records of the second grid through the timestamp, the timestamp corresponds to the comparison data in the original data record of the second grid, the reference data corresponds to the comparison data, and there is no difference between the reference data and the timestamp. contact directly.

在所述分布式存储系统中建立所述第二查询索引表，所述第二查询索引表的主键为各种校验规则的时间戳，所述第二查询索引表的列值为所述电网原始数据记录的主键。Establish the second query index table in the distributed storage system, the primary key of the second query index table is the timestamp of various verification rules, and the column value of the second query index table is the grid The primary key of the original data record.

在本实施例中，可以在分布式存储系统中建立所述第二查询索引表，将各种校验规则的时间戳作为第二查询索引表的主键，电网原始数据记录的主键作为第二查询索引表的列值，通过第二查询索引表，可以根据时间戳快速查找到对应的第二电网原始数据记录。In this embodiment, the second query index table can be established in the distributed storage system, and the timestamps of various verification rules are used as the primary key of the second query index table, and the primary key of the original data record of the power grid is used as the second query For the column values of the index table, through the second query index table, the corresponding second power grid raw data record can be quickly found according to the time stamp.

需要说明的是，在获取到与时间戳对应的电网原始数据记录的主键以后，可以根据对应的电网原始数据记录的主键查找到对应的第二电网原始数据记录，从对应的第二电网原始数据记录中提取比对数据和基准数据。It should be noted that after obtaining the primary key of the grid raw data record corresponding to the time stamp, the corresponding second grid raw data record can be found according to the primary key of the corresponding grid raw data record, and from the corresponding second grid raw data record The comparison data and benchmark data are extracted from the records.

根据所述第一查询索引表和所述第二查询索引表建立分布式文件系统的索引文件，读取所述索引文件至内存，读取所述分布式文件系统对所述索引文件的操作日志文件，将所述操作日志文件中的操作记录应用到内存索引中，将所述内存索引写入所述索引文件中，根据写入内存索引的索引文件加载批量的电网原始数据记录，分别根据批量的电网原始数据记录中的基准数据对比对数据进行校验。Establish an index file of the distributed file system according to the first query index table and the second query index table, read the index file to the memory, and read the operation log of the distributed file system on the index file file, apply the operation records in the operation log file to the memory index, write the memory index into the index file, load batches of power grid raw data records according to the index file written into the memory index, and respectively according to the batch The reference data in the grid raw data records are verified against the reference data.

在本实施例中，可以在分布式文件系统中建立索引文件，在电网原始数据质量校验时，将索引文件读入内存，读取操作日志应用到内存索引中，将内存索引写入索引文件，基于写入内存索引的索引文件进行校验，通过上述方式实现在进行批量电网原始数据质量校验时能够快速加载校验数据，提升校验性能。In this embodiment, an index file can be established in the distributed file system, and the index file can be read into the memory when the quality of the raw data of the power grid is verified, the read operation log can be applied to the memory index, and the memory index can be written into the index file , based on the index file written into the memory index to perform verification, through the above method, the verification data can be quickly loaded when the quality verification of the batch power grid raw data is performed, and the verification performance is improved.

可选的，分布式文件系统可以是HDFS(Hadoop Distributed File System，即Hadoop分布式文件系统)，HDFS具备良好的数据多副本存储机制，以及强大的数据节点出错检测和节点失效恢复机制。Optionally, the distributed file system can be HDFS (Hadoop Distributed File System, that is, Hadoop distributed file system). HDFS has a good data multi-copy storage mechanism, and a powerful data node error detection and node failure recovery mechanism.

可选的，在将内存索引写入索引文件后，可以将操作日志文件删除，释放存储空间，提高校验速度。Optionally, after the memory index is written into the index file, the operation log file can be deleted to free up storage space and improve the verification speed.

在检测到电网增量数据记录时，生成所述电网增量数据记录的基于校验字段的第一查询索引并添加至所述第一查询索引表，生成所述电网增量数据记录的基于时间戳的第二查询索引并添加至所述第二查询索引表。When the grid incremental data record is detected, generate the first query index based on the verification field of the grid incremental data record and add it to the first query index table, and generate the time-based query index of the grid incremental data record The second query index of the stamp is added to the second query index table.

在本实施例中，在检测到电网增量数据记录时，可以将相应的第一查询索引添加至第一查询索引表中，将相应的第二查询索引添加到第二查询索引表中，确保索引表的完整性，实现电网数据质量的全面校验。In this embodiment, when the power grid increment data record is detected, the corresponding first query index can be added to the first query index table, and the corresponding second query index can be added to the second query index table to ensure that The integrity of the index table realizes the comprehensive verification of the grid data quality.

可选的，在对电网增量数据进行校验时，由于增量数据与原始数据的时间戳明显不同，可以根据时间戳范围查询电网数据记录中的电网增量数据进行校验。Optionally, when verifying the incremental data of the power grid, since the time stamps of the incremental data and the original data are obviously different, the incremental data of the power grid in the grid data records can be queried for verification according to the time stamp range.

在其中一个实施例中，创建多个并行任务的步骤以下步骤：In one of the embodiments, the steps of creating multiple parallel tasks are as follows:

在MapReduce并行计算框架中创建多个并行任务，在分布式文件系统中对所有的校验规则建立指示文件，读取相应的指示文件至每个并行任务，根据所述相应的指示文件为每个并行任务配置执行校验规则的参数和处理逻辑。Create multiple parallel tasks in the MapReduce parallel computing framework, set up instruction files for all verification rules in the distributed file system, read the corresponding instruction files to each parallel task, and create each parallel task according to the corresponding instruction files Parallel task configuration executes the parameters and processing logic of the verification rules.

在本实施例中，可以利用MapReduce并行计算框架来创建并行任务，MapReduce并行计算框架中的所有Map节点可以并发地执行不同的校验规则，如果执行过程中有失效发生，MapReduce并行计算框架会自动地在其他节点启动新的任务来重新尝试执行失效的校验规则，可以有效解决整个并行过程中的负载均衡和容错等问题，校验规则的参数和处理逻辑保存在指示文件中，可以从分布式文件系统中进行调用，以指示文件为依据可以快速建立并行任务。In this embodiment, the MapReduce parallel computing framework can be used to create parallel tasks. All Map nodes in the MapReduce parallel computing framework can execute different verification rules concurrently. If failures occur during execution, the MapReduce parallel computing framework will automatically Start new tasks on other nodes to retry to execute the failed verification rules, which can effectively solve the problems of load balancing and fault tolerance in the entire parallel process. The parameters and processing logic of the verification rules are stored in the instruction file, which can be downloaded from the distribution It can be called in the standard file system, and parallel tasks can be quickly established based on the indicated file.

可选的，在并行任务执行之前，还可以读取配置文件，配置文件中设置有校验类型和校验时间戳范围，可以在校验时确定具体的校验类型和时间戳范围。Optionally, before the parallel task is executed, the configuration file can also be read. The configuration file is set with a verification type and a verification time stamp range, and the specific verification type and time stamp range can be determined during verification.

在其中一个实施例中，指示文件对应一条或者多条校验规则。In one embodiment, the indicated file corresponds to one or more verification rules.

在本实施例中，指示文件可以对应一条校验规则或多条校验规则，若是对应一条校验规则，并行任务可以针对该校验规则进行校验，若是对应多条校验规则，并行任务可以针对多条校验规则进行并行校验，提高校验规则的处理效率。In this embodiment, the instruction file can correspond to one verification rule or multiple verification rules. If it corresponds to one verification rule, the parallel task can verify the verification rule. If it corresponds to multiple verification rules, the parallel task Parallel verification can be performed for multiple verification rules to improve the processing efficiency of the verification rules.

可选的，一个指示文件中的多条校验规则属于同一属性类型的校验规则。Optionally, a validation rule indicating that multiple validation rules in the file belong to the same attribute type.

在其中一个实施例中，指示文件对应一个并行任务。In one embodiment, the indicated file corresponds to a parallel task.

在本实施例中，指示文件对应一个并行任务，一个指示文件由一个并行任务进行处理，实现每个指示文件可以并行化处理，提高指示文件的处理效率。In this embodiment, the instruction file corresponds to one parallel task, and one instruction file is processed by one parallel task, so that each instruction file can be processed in parallel, and the processing efficiency of the instruction file is improved.

根据上述电网数据质量的校验方法，本发明还提供一种电网数据质量的校验系统，以下就本发明的电网数据质量的校验系统的实施例进行详细说明。According to the above method for verifying the quality of power grid data, the present invention also provides a system for verifying the quality of power grid data. The embodiments of the system for verifying the quality of power grid data of the present invention will be described in detail below.

参见图2所示，为本发明一个实施例的电网数据质量的校验系统的结构示意图，该实施例中的电网数据质量的校验系统包括：Referring to Fig. 2, it is a schematic structural diagram of a power grid data quality verification system according to an embodiment of the present invention. The power grid data quality verification system in this embodiment includes:

数据存储单元210，用于获取电网原始数据记录，将所述电网原始数据记录存储在分布式存储系统中；其中，所述电网原始数据记录包括待校验的比对数据记录和用于校验的基准数据记录；The data storage unit 210 is used to obtain the original data record of the power grid, and store the original data record of the power grid in the distributed storage system; wherein, the original data record of the power grid includes a comparison data record to be verified and a benchmark data records;

任务创建单元220，用于创建多个并行任务；A task creation unit 220, configured to create multiple parallel tasks;

查询索引单元230，用于在每个并行任务中，获取目标校验规则的校验字段，根据所述校验字段在所述电网原始数据记录的第一查询索引表中进行查找，获取与所述校验字段对应的第一电网原始数据记录；The query index unit 230 is configured to obtain the verification field of the target verification rule in each parallel task, search in the first query index table of the grid original data record according to the verification field, and obtain the The first grid raw data record corresponding to the verification field;

提取比对单元240，用于提取所述第一电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验；其中，所述第一查询索引表存储在所述分布式存储系统中；Extracting and comparing unit 240, configured to extract the comparison data and benchmark data in the first power grid raw data record, and verify the extracted comparison data according to the extracted benchmark data; wherein, the first query index table stored in the distributed storage system;

结果输出单元250，用于输出多个并行任务的校验结果。The result output unit 250 is configured to output verification results of multiple parallel tasks.

在其中一个实施例中，如图3所示，电网数据质量的校验系统还包括索引建立单元260，用于在所述分布式存储系统中建立所述第一查询索引表，所述第一查询索引表的主键为各种校验规则的校验字段，所述第一查询索引表的列值为所述电网原始数据记录的主键。In one of the embodiments, as shown in FIG. 3 , the power grid data quality verification system further includes an index establishment unit 260, configured to establish the first query index table in the distributed storage system, and the first The primary key of the query index table is the verification field of various verification rules, and the column value of the first query index table is the primary key of the grid original data record.

在其中一个实施例中，查询索引单元230还用于在每个并行任务中，获取目标校验规则的时间戳范围，根据所述时间戳范围在所述电网原始数据记录的第二查询索引表中进行查找，获取与所述时间戳范围对应的第二电网原始数据记录，提取所述第二电网原始数据记录中的比对数据和基准数据，根据提取的基准数据对提取的比对数据进行校验；其中，所述第二查询索引表存储在所述分布式存储系统中。In one of the embodiments, the query index unit 230 is also used to obtain the time stamp range of the target verification rule in each parallel task, and according to the time stamp range in the second query index table recorded in the grid raw data Search in, obtain the second power grid raw data record corresponding to the time stamp range, extract the comparison data and benchmark data in the second power grid raw data record, and perform the comparison on the extracted comparison data according to the extracted benchmark data Verification; wherein, the second query index table is stored in the distributed storage system.

在其中一个实施例中，索引建立单元260还用于在所述分布式存储系统中建立所述第二查询索引表，所述第二查询索引表的主键为各种校验规则的时间戳，所述第二查询索引表的列值为所述电网原始数据记录的主键。In one of the embodiments, the index establishment unit 260 is further configured to establish the second query index table in the distributed storage system, the primary key of the second query index table is the timestamp of various verification rules, The column value of the second query index table is the primary key of the grid original data record.

在其中一个实施例中，如图4所示，电网数据质量的校验系统还包括文件索引单元270，用于根据所述第一查询索引表和所述第二查询索引表建立分布式文件系统的索引文件，读取所述索引文件至内存，读取所述分布式文件系统对所述索引文件的操作日志文件，将所述操作日志文件中的操作记录应用到内存索引中，将所述内存索引写入所述索引文件中，根据写入内存索引的索引文件加载批量的电网原始数据记录，分别根据批量的电网原始数据记录中的基准数据对比对数据进行校验。In one of the embodiments, as shown in FIG. 4 , the power grid data quality verification system further includes a file index unit 270, configured to establish a distributed file system according to the first query index table and the second query index table index file, read the index file to the memory, read the operation log file of the distributed file system to the index file, apply the operation records in the operation log file to the memory index, and apply the The memory index is written into the index file, the batch of grid raw data records is loaded according to the index file written into the memory index, and the data is verified according to the benchmark data in the batch of grid raw data records.

在其中一个实施例中，如图5所示，电网数据质量的校验系统还包括索引调整单元280，用于在检测到电网增量数据记录时，生成所述电网增量数据记录的基于校验字段的第一查询索引并添加至所述第一查询索引表，生成所述电网增量数据记录的基于时间戳的第二查询索引并添加至所述第二查询索引表。In one of the embodiments, as shown in FIG. 5 , the power grid data quality verification system further includes an index adjustment unit 280, configured to generate a calibration-based The first query index of the verification field is added to the first query index table, and the second query index based on the time stamp of the power grid incremental data record is generated and added to the second query index table.

在其中一个实施例中，任务创建单元220在MapReduce并行计算框架中创建多个并行任务，读取在分布式文件系统中建立的校验规则的指示文件至每个并行任务，根据相应的指示文件为每个并行任务配置执行校验规则的参数和处理逻辑。In one of the embodiments, the task creation unit 220 creates multiple parallel tasks in the MapReduce parallel computing framework, reads the instruction file of the verification rule established in the distributed file system to each parallel task, and according to the corresponding instruction file Configure parameters and processing logic for executing validation rules for each parallel task.

本发明的电网数据质量的校验系统与本发明的电网数据质量的校验方法一一对应，在上述电网数据质量的校验方法的实施例阐述的技术特征及其有益效果均适用于电网数据质量的校验系统的实施例中。The power grid data quality verification system of the present invention corresponds to the power grid data quality verification method of the present invention one by one, and the technical features and beneficial effects described in the embodiments of the power grid data quality verification method are applicable to power grid data Example of a quality check system.

根据上述电网数据质量的校验方法，本发明实施例还提供一种可读存储介质和一种校验设备。可读存储介质上存储有可执行程序，该程序被处理器执行时实现上述电网数据质量的校验方法的步骤；校验设备包括存储器、处理器及存储在存储器上并可在处理器上运行的可执行程序，处理器执行程序时实现上述电网数据质量的校验方法的步骤。According to the above method for verifying the quality of grid data, an embodiment of the present invention further provides a readable storage medium and a verification device. An executable program is stored on the readable storage medium, and when the program is executed by the processor, the steps of the verification method for the above-mentioned power grid data quality are realized; the verification device includes a memory, a processor, and is stored on the memory and can run on the processor An executable program, the steps of the method for verifying the quality of power grid data are realized when the processor executes the program.

在一个具体的实施例中，电网数据质量的校验方法是一种基于分布存储和并行处理的电网数据质量的校验方法，解决了现有的基于关系数据库系统方法的计算延时大，难于扩展，系统性价比低的问题。In a specific embodiment, the method for verifying the quality of power grid data is a method for verifying the quality of power grid data based on distributed storage and parallel processing, which solves the problem of large calculation delays and difficulties in existing methods based on relational database systems. Expansion, the problem of low cost performance of the system.

本发明采用的技术方案的主要思路是：The main train of thought of the technical scheme that the present invention adopts is:

采用一种分布存储方法对所有原始数据记录进行存储；Use a distributed storage method to store all raw data records;

采用基于非主键的索引方法对校验字段进行索引，校验时根据校验规则涉及的校验字段查找索引表，获取对应的原始数据记录主键，再根据获取到的原始数据记录表主键查找原始数据记录表获取原始数据记录，然后提取比对字段进行比对；Use the non-primary key-based index method to index the verification field. When verifying, look up the index table according to the verification field involved in the verification rule, obtain the corresponding primary key of the original data record, and then search the original data according to the obtained primary key of the original data record table. The data record table obtains the original data record, and then extracts the comparison field for comparison;

采用HBase对原始数据记录建立时间戳索引，在增量数据质量校验或者基于时间窗口的细时间粒度的数据质量校验时，根据时间戳范围查询原始数据记录表，确定需校验的数据范围后进行校验；Use HBase to establish a time stamp index for the original data records. When verifying the quality of incremental data or fine-grained data quality verification based on the time window, query the original data record table according to the time stamp range to determine the data range to be verified. check after

采用HDFS存储数据记录的辅助索引文件和操作日志文件，以便全量原始数据质量校验时能够快速加载校验数据，提升校验性能，在全量原始数据质量校验时，将辅助索引文件读入内存，读取操作日志应用到内存索引上，然后基于内存索引进行校验；Use HDFS to store auxiliary index files and operation log files of data records, so that the verification data can be quickly loaded during the quality verification of the full amount of original data, and the verification performance can be improved. When the quality verification of the full amount of original data is performed, the auxiliary index files are read into the memory. , read the operation log and apply it to the memory index, and then perform verification based on the memory index;

采用基于MapReduce的并行化方式完成校验规则的快速执行。The fast execution of the verification rules is completed by using the parallelization method based on MapReduce.

进一步地，所述分布存储方法为基于HBase的分布存储方法，可支持海量校验数据的存储，并能根据需求方便扩展。进一步地，所述校验规则为基于MapReduce的并行化校验规则。可以根据校验数据量和校验规则数量方便扩展，响应性能可控，性价比高。Further, the distributed storage method is an HBase-based distributed storage method, which can support the storage of massive verification data and can be easily expanded according to requirements. Further, the verification rule is a parallel verification rule based on MapReduce. It can be easily expanded according to the amount of verification data and the number of verification rules, the response performance is controllable, and the cost performance is high.

进一步地，采用基于非主键索引的方法对校验字段进行索引，以便实现基于非主键字段的校验规则查询处理。Further, the verification field is indexed by using a method based on a non-primary key index, so as to realize query processing of verification rules based on a non-primary key field.

进一步地，校验字段是原始数据记录主键或者任意属性列；比对字段是与所述校验字段对应的某一字段，可以是校验字段本身或者其它字段。Further, the verification field is the primary key of the original data record or any attribute column; the comparison field is a certain field corresponding to the verification field, which may be the verification field itself or other fields.

进一步地，对原始数据记录建立时间戳索引，在增量数据质量校验或者基于时间窗口的细时间粒度数据质量校验时，根据时间戳索引查询时间戳索引表以获取原始数据记录主键，再查询原始数据记录表以获取原始数据记录进行校验。Further, a time stamp index is established for the original data records, and when incremental data quality verification or time window-based fine-grained data quality verification is performed, the time stamp index table is queried according to the time stamp index to obtain the primary key of the original data record, and then Query the raw data record table to obtain raw data records for verification.

进一步地，为全量原始数据建立HDFS辅助索引文件，为增量数据建立操作日志，在全量历史数据校验时，读取HDFS辅助索引文件到内存，将操作日志应用到内存索引上，然后基于内存索引进行校验。Further, create an HDFS auxiliary index file for the full amount of original data, and create an operation log for the incremental data. When verifying the full amount of historical data, read the HDFS auxiliary index file into the memory, apply the operation log to the memory index, and then based on the memory Indexes are validated.

进一步地，对所有的校验规则建立指示文件，指示文件内容包含所有执行校验规则需要的参数，包括规则名称，规则执行逻辑标识，输入数据表，输出数据表等参数，Map任务读取相应的指示文件，获取执行相应校验规则需要的参数，调用相应的处理逻辑进行校验。Furthermore, create an instruction file for all verification rules, and the content of the instruction file contains all the parameters required to execute the verification rules, including rule name, rule execution logic identifier, input data table, output data table and other parameters, and the Map task reads the corresponding Instruction file, obtain the parameters needed to execute the corresponding verification rules, and call the corresponding processing logic for verification.

更进一步地，每个指示文件对应一条或者多条校验规则，校验规则的执行参数写在指示文件中，所述执行参数包括校验规则名称，规则执行逻辑表示，输入数据表，输出数据表等参数。Furthermore, each instruction file corresponds to one or more verification rules, and the execution parameters of the verification rules are written in the instruction file. The execution parameters include the name of the verification rule, the logical representation of the rule execution, the input data table, and the output data Table and other parameters.

更进一步地，每个指示文件由一个Map任务处理，Furthermore, each instruction file is processed by a Map task,

本发明的方案能够高效可扩展地进行电网数据质量的校验：第一，将电网数据进行分布存储，使系统具有良好的可扩展性；第二，通过为校验规则涉及的字段建立辅助查询索引，以支持校验规则执行时进行高效查询处理；第三，设计了一个基于MapReduce的校验规则并行处理方法，使得每条校验规则都可以并行化处理，有效提升了系统响应性能。The scheme of the present invention can verify the quality of power grid data in an efficient and scalable manner: first, the power grid data is stored in a distributed manner, so that the system has good scalability; second, by establishing an auxiliary query for the fields involved in the verification rules Index to support efficient query processing when the verification rules are executed; thirdly, a parallel processing method for verification rules based on MapReduce is designed, so that each verification rule can be processed in parallel, which effectively improves the system response performance.

HBase是Hadoop生态环境中的一个分布式存储系统。针对分布式文件系统HDFS缺少结构化半结构化数据存储访问和随机读写能力的缺陷，在HDFS(Hadoop DistributedFile System，即Hadoop分布式文件系统)之上，HBase提供了一个分布式数据管理系统，解决大规模的结构化和半结构化数据存储访问问题。HBase提供基于列存储模式的大数据表管理能力，可存储管理数十亿以上的数据记录，每个记录可包含百万以上的数据列；HBase试图提供随机和实时的数据读写访问能力，并具有高可扩展性、高可用性、容错处理能力、负载平衡能力、以及实时数据查询能力。HBase is a distributed storage system in the Hadoop ecosystem. In view of the lack of structured and semi-structured data storage access and random read and write capabilities of the distributed file system HDFS, HBase provides a distributed data management system on top of HDFS (Hadoop Distributed File System, that is, Hadoop Distributed File System). Solve large-scale structured and semi-structured data storage access problems. HBase provides large data table management capabilities based on column storage mode, which can store and manage more than billions of data records, and each record can contain more than one million data columns; HBase tries to provide random and real-time data read and write access capabilities, and It has high scalability, high availability, fault-tolerant processing capabilities, load balancing capabilities, and real-time data query capabilities.

HBase的底层数据是存储在HDFS中的，因而HBase是完全依赖于底层的HDFS工作的。由于HDFS采用了良好了数据多副本存储机制、以及强大的数据节点出错检测和节点失效恢复机制，基于HDFS的HBase在数据存储时自然继承了HDFS的这种数据存储的高可靠性和容错处理能力。The underlying data of HBase is stored in HDFS, so HBase is completely dependent on the underlying HDFS. Since HDFS adopts a good data multi-copy storage mechanism, as well as a powerful data node error detection and node failure recovery mechanism, HDFS-based HBase naturally inherits the high reliability and fault-tolerant processing capabilities of HDFS when storing data. .

Hadoop MapReduce提供了一个庞大但设计精良的分布式数据存储和并行计算软件构架，能自动完成分布式海量数据的存储管理，能自动划分计算数据并调度计算任务，在集群节点上自动分配和执行子任务以及收集计算结果，将数据分布存储、数据通信、容错处理等并行计算中的很多复杂细节交由系统负责处理，大大减少了软件开发人员的负担。Hadoop MapReduce provides a large but well-designed distributed data storage and parallel computing software architecture, which can automatically complete the storage management of distributed massive data, automatically divide computing data and schedule computing tasks, and automatically allocate and execute sub-computing tasks on cluster nodes. Tasks and collection of calculation results, and many complex details in parallel computing such as data distribution storage, data communication, and fault-tolerant processing are handled by the system, which greatly reduces the burden on software developers.

如图6所示，本发明采用分布数据存储和管理系统HBase存储数据，将原始数据记录存储到HBase中，以便根据主键快速查询访问；为校验规则涉及的校验字段建立查询索引，以便根据校验字段值快速查询访问；为原始数据记录建立基于时间戳的辅助索引，支持基于时间窗口的数据质量校验；对于历史积累的全量数据，同时建立索引文件存储在分布式文件系统HDFS上，以便进行批量数据质量校验时快速加载，避免了对HBase的全表扫描；而对于实时流入的增量数据建立操作日志，解决了数据记录增加、删除、修改时索引文件的维护问题,定时地合并操作日志和索引文件，降低批量数据质量校验时的合并开销；采用校验规则的并行化执行，一个并行任务处理一条至多条校验规则。As shown in Figure 6, the present invention adopts distributed data storage and management system HBase to store data, and original data record is stored in HBase, so that according to primary key fast query access; Verify field values for fast query and access; establish a timestamp-based auxiliary index for original data records, and support data quality verification based on time windows; for the full amount of historically accumulated data, index files are also created and stored on the distributed file system HDFS. In order to quickly load the batch data quality verification, avoiding the full table scan of HBase; and for the real-time inflow of incremental data to establish an operation log, to solve the maintenance problem of the index file when the data record is added, deleted, and modified, regularly Merge operation logs and index files to reduce the merge overhead during batch data quality verification; adopt parallel execution of verification rules, and one parallel task processes one or more verification rules.

将批量数据进行存储和索引的流程包括以下步骤：The process of storing and indexing batch data includes the following steps:

(1)将待校验的CSV格式的基准数据表和比对数据表存入HBase中，原始数据记录主键作为HBase表的主键，原始数据记录的非主键属性作为HBase表的一列，不同的列属于不同的列族，利用HBase的面向列存储(同一列族的数据统一存储)提高查询某列数据时的响应性能；(1) Store the benchmark data table and comparison data table in CSV format to be verified in HBase, the primary key of the original data record is used as the primary key of the HBase table, and the non-primary key attributes of the original data record are used as a column of the HBase table, different columns Belonging to different column families, use HBase's column-oriented storage (unified storage of data in the same column family) to improve the response performance when querying a certain column of data;

(2)将基于校验规则校验字段的查询索引表存入HBase中，校验字段作为HBase查询索引表的主键，原始数据记录主键作为查询索引表的列名，所有主键属于同一个列族，采用这种数据模式方便对查询索引表记录的增加、删除、修改和查询；(2) Store the query index table based on the verification rule verification field into HBase, the verification field is used as the primary key of the HBase query index table, the original data record primary key is used as the column name of the query index table, and all primary keys belong to the same column family , adopting this data mode facilitates the addition, deletion, modification and query of records in the query index table;

(3)将基于数据记录时间戳的查询索引表存入HBase中，数据记录时间戳作为HBase查询索引表的主键，原始数据记录主键作为查询索引表的列值存储。(3) The query index table based on the data record timestamp is stored in HBase, the data record timestamp is used as the primary key of the HBase query index table, and the original data record primary key is stored as the column value of the query index table.

(4)将基于校验规则校验字段的查询索引表存入HBase中时，同时将查询索引表存入HDFS的索引文件中。(4) When storing the query index table based on the verification rule verification field into HBase, the query index table is also stored in the index file of HDFS at the same time.

将增量数据进行存储和索引的流程包括以下步骤：The process of storing and indexing incremental data includes the following steps:

(1)将增量数据记录插入HBase的原始数据记录表中；(1) Incremental data records are inserted into the original data record table of HBase;

(2)将增量数据记录的基于校验规则校验字段的查询索引插入HBase的查询索引中；(2) Insert the query index based on the verification rule verification field of the incremental data record into the query index of HBase;

(3)将增量数据记录的基于数据记录时间戳的查询索引表插入HBase的辅助索引中；(3) Insert the query index table based on the data record timestamp of the incremental data record into the auxiliary index of HBase;

(4)将增量数据记录的操作日志追加到HDFS上的操作日志文件中。(4) Append the operation log of the incremental data record to the operation log file on HDFS.

将操作日志合并到索引文件的流程包括以下步骤：The process of merging operation logs into index files includes the following steps:

(1)读取HDFS上的索引文件到内存中；(1) Read the index file on HDFS into memory;

(2)读取HDFS上操作日志文件，逐一将操作应用到内存索引中；(2) Read the operation log files on HDFS, and apply the operations to the memory index one by one;

(3)将内存索引重新写入到HDFS上的索引文件中；(3) Rewrite the memory index to the index file on the HDFS;

(4)删除HDFS上的操作日志文件。(4) Delete the operation log files on HDFS.

并行化校验规则处理流程：Parallel verification rule processing flow:

(1)将校验类型，校验时间戳范围写入到配置文件中；(1) Write the verification type and verification timestamp range into the configuration file;

(2)启动MapReduce作业开始执行数据质量校验；(2) Start the MapReduce job and start to perform data quality verification;

(3)每个Map任务读取一个指示文件，获取规则名称，规则执行逻辑标识，输入数据表，输出数据表等参数；并且读取配置文件中的校验类型和校验范围时间戳；(3) Each Map task reads an instruction file, obtains the rule name, rule execution logic identifier, input data table, output data table and other parameters; and reads the verification type and verification range timestamp in the configuration file;

(4)对于批量校验，根据批量数据单规则校验流程进行校验；(4) For batch verification, verify according to the batch data sheet rule verification process;

(5)对于基于时间窗口的校验，根据时间戳范围进行增量数据单规则校验流程进行校验。(5) For the verification based on the time window, the incremental data single-rule verification process is performed according to the time stamp range for verification.

批量数据单规则校验流程：Batch data sheet rule verification process:

(1)读取HDFS上的查询索引表到内存，读取操作日志将其应用到内存中的查询索引表，删除操作日志文件；(1) Read the query index table on the HDFS to the memory, read the operation log and apply it to the query index table in the memory, and delete the operation log file;

(2)遍历内存中的查询索引表进行规则校验。(2) Traversing the query index table in the memory for rule verification.

增量数据单规则校验流程：Incremental data sheet rule verification process:

(1)根据起始时间戳和终止时间戳，查询时间戳索引表，获取时间增量时间窗口内的所有记录ID；查询原始数据记录表，获取相应的校验字段集合；(1) According to the start timestamp and the end timestamp, query the timestamp index table to obtain all record IDs in the time increment time window; query the original data record table to obtain the corresponding check field set;

(2)根据校验字段集合内的字段值，查询辅助索引表，获取比对字段值进行校验。(2) According to the field values in the verification field set, query the auxiliary index table to obtain the comparison field values for verification.

上述增量数据单规则校验流程也适用于原始数据的校验。The above incremental data single rule verification process is also applicable to the verification of original data.

如图7所示，本发明所涉及到的分布存储和索引方法的实施方式为：为了完成对大量数据记录和大量校验规则的快速处理，除了将原数据表存储到HBase中之外，我们需要针对校验规则所涉及到的字段，设计专门的快速数据索引表并存储到HBase中。例如，在原数据表1和表2中，主键(rowkey字段)为各个记录的ID。如果需要对原数据表1的A字段(记为字段A)和原数据表2的B字段(记为字段B)进行校验，那么我们需要分别建立字段A和字段B的索引表以在校验的时候快速查找。为了实现基于时间窗口的增量数据质量校验和细时间粒度的数据质量校验，为原始数据记录表建立了时间戳查询索引，以便根据时间戳范围界定进行质量校验的数据范围。如图8所示，为了提升全量历史数据的质量校验性能，为数据记录表建立辅助HDFS索引文件和操作日志，以便在全量数据校验时快速加载校验数据到内存中进行校验。As shown in Figure 7, the implementation of the distributed storage and indexing method involved in the present invention is: in order to complete the rapid processing of a large number of data records and a large number of verification rules, in addition to storing the original data table in HBase, we It is necessary to design a special fast data index table and store it in HBase for the fields involved in the verification rules. For example, in original data tables 1 and 2, the primary key (rowkey field) is the ID of each record. If it is necessary to verify the field A of the original data table 1 (denoted as field A) and the field B of the original data table 2 (denoted as field B), then we need to create an index table for field A and field B respectively to check Quickly find when testing. In order to realize incremental data quality verification based on time window and fine-grained data quality verification, a timestamp query index is established for the original data record table, so as to define the data range for quality verification according to the time stamp range. As shown in Figure 8, in order to improve the quality verification performance of the full amount of historical data, an auxiliary HDFS index file and operation log are established for the data record table, so that the verification data can be quickly loaded into the memory for verification during the full data verification.

本发明中针对校验规则的并行化处理的实施方式为：为了完成对大量数据记录和大量校验规则的快速处理，采用基于MapReduce的并行化执行机制。如图9所示：首先将各个校验规则的ID和参数等写入到一个个独立的HDFS文件中(称为指示文件)，MapReduce作业中包含了所有的这些校验规则的处理模块的实现。根据Hadoop MapReduce的默认运行机制，每个Map任务只会读取一个指示文件并进行处理，这里具体的处理模块的选择则由该任务所读取的指示文件决定。The embodiment of the parallel processing of verification rules in the present invention is: in order to complete the rapid processing of a large number of data records and a large number of verification rules, a parallel execution mechanism based on MapReduce is adopted. As shown in Figure 9: First, write the ID and parameters of each verification rule into individual HDFS files (called instruction files), and the MapReduce job includes the implementation of the processing modules of all these verification rules . According to the default operating mechanism of Hadoop MapReduce, each Map task will only read and process one instruction file, and the selection of the specific processing module here is determined by the instruction file read by the task.

通过这种方法就能使得集群中所有的Map节点在并发地执行不同的校验规则。如果执行过程中有失效发生，Hadoop MapReduce会自动地在其他节点启动新的Map任务来重新尝试执行这些校验规则。整个并行过程的负载均衡和容错等问题都由Hadoop MapReduce框架一并解决了。In this way, all Map nodes in the cluster can execute different verification rules concurrently. If a failure occurs during execution, Hadoop MapReduce will automatically start new Map tasks on other nodes to retry to execute these verification rules. The load balancing and fault tolerance of the entire parallel process are solved by the Hadoop MapReduce framework.

本发明基于已有的一些开源软件实现了一个原型系统。其中分布存储和索引采用HBase、校验规则并行化处理采用HDFS和MapReduce，这三个软件本身不属于本发明的内容。通过使用现实电网业务数据和校验规则对本发明实现的原型系统和现有的关系数据管理系统进行测试对比，本发明实现的原型系统在响应性能，可扩展性优于传统关系数据管理系统，证明了本发明的基于分布存储和并行处理的电网数据质量检测方法的有效性。The present invention implements a prototype system based on some existing open source software. Wherein the distributed storage and indexing adopt HBase, and the parallel processing of verification rules adopts HDFS and MapReduce. These three softwares themselves do not belong to the content of the present invention. By using actual power grid business data and verification rules to test and compare the prototype system realized by the present invention with the existing relational data management system, the prototype system realized by the present invention is better than the traditional relational data management system in response performance and scalability, proving The effectiveness of the grid data quality detection method based on distributed storage and parallel processing of the present invention is proved.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, should be considered as within the scope of this specification.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成。所述的程序可以存储于可读取存储介质中。该程序在执行时，包括上述方法所述的步骤。所述的存储介质，包括：ROM/RAM、磁碟、光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the methods of the above embodiments can be implemented by instructing related hardware through programs. The program can be stored in a readable storage medium. When the program is executed, it includes the steps described in the above method. The storage medium includes: ROM/RAM, magnetic disk, optical disk, etc.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present invention, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the patent scope of the invention. It should be pointed out that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.