Disclosure of Invention
Aiming at the technical problems existing at present, the invention provides a data quality guarantee method under a big data scene, which can complete all data quality guarantee works in one step throughout the life cycle of data.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the data quality guarantee method under the big data scene comprises the following steps:
step 1, carrying out quality evaluation on data in a pre-online stage based on field information and data volume;
step 2, pushing verification is carried out on data in the stage before online based on the total data before and after pushing;
Step 3, analyzing and repairing the core data in the online stage based on the automated test of the report interface;
and 4, monitoring and analyzing the data in the online stage.
Further, the method further comprises the following steps: all database information in the data warehouse and all data table information in the database are acquired before step 1.
Further, step 1 includes: step 11, acquiring any data table of a database by using a model analysis tool, and counting the names and types of all fields of the current data table; step 12, circulating all fields of the current data table, and counting field information of field value range of all types of fields including null value quantity and null value rate, duplication removal quantity of character type field, field maximum value and minimum value and value type field; step 13, counting the data quantity of the current data table; step 14, cycling all data tables of a database, repeating the steps 11-13, writing the statistical results of the steps 11-13 into an Excel file, and storing the statistical result of one data table in each worksheet in the Excel file; and 15, carrying out quality evaluation on the data in the database according to the statistical result.
Further, step 2 includes: step 21, judging whether the total data amount of a source data table in a data warehouse before pushing is consistent with that of a target data table after pushing by adopting a data comparison tool; and 22, if the data total amounts of the source data table and the target data table are consistent, randomly selecting N rows of data corresponding to the N rows of data and the target data table in the source data table to perform field-to-field comparison, wherein N is a positive integer, and if the comparison of each field is the same, the push verification is successful.
Further, the following monitoring method is adopted in step 4 for monitoring analysis: firstly, adopting enumerated value monitoring to analyze and repair a data table of an original data layer in a data warehouse; analyzing and repairing a report generated by a report interface by adopting interface result monitoring; monitoring the data extraction of the original data layer in the data warehouse by adopting layered data monitoring, and monitoring the scheduling task of the data detail layer in the data warehouse; monitoring the data change of the original data layer in the data warehouse by adopting incremental monitoring; monitoring misuse of the transaction table by adopting global transaction table monitoring; the consistency monitoring of the total data amount before and after pushing the data table in the data warehouse is adopted to monitor the consistency of the pushing ES/CK/MYSQL; monitoring whether the current-day partition in the appointed data table has data or not by adopting the appointed table; the method ten is that the data uniqueness with the uniqueness configuration data table is monitored by adopting dimension table uniqueness monitoring; and the eleventh method is to monitor the data null value of the configuration data table with the core field by adopting index null value rate monitoring.
Further, the method for monitoring the enumeration value comprises the following steps: monitoring fixed fields of a data table of an original data layer, writing enumeration values of the fixed fields into a database, checking whether the enumeration values change or not at regular intervals, triggering an alarm when the enumeration values change, analyzing the influence of the enumeration value change on the data table and repairing the data table.
Further, the method for monitoring the interface result comprises the following steps: and a test plan is established on the automation platform, all the automation cases input by the interfaces of the automation platform are called and tested, the test result is monitored, an alarm is triggered when the automation cases do not pass, the test result is analyzed and a report is restored.
Further, in step 3, the interface result is adopted to monitor and test only the automation use cases input by the interface of the core report of the automation platform, and the test result is analyzed and the core report is repaired.
Further, in step 3, if the interface of the core report is not input with an automation case, the method one to the method eleven in step 4 are adopted to analyze and repair the core data in the online stage.
Further, the method further comprises the following steps: the method comprises the steps of obtaining SQL logic of interface configuration by analyzing interface configuration associated with a report, wherein the SQL logic is configured by a single or a plurality of associated data tables, extracting interfaces associated with the abnormal data tables when the abnormal data tables are obtained by a monitoring method from a first method to a method eleventh, extracting the report of the interface, and analyzing the affected range of the report.
Compared with the prior art, the invention has the following beneficial effects:
the invention takes the life cycle of data as a time line, and effectively correlates a plurality of systems, thereby providing a set of data quality assurance methods in different stages and different directions. The invention realizes the data quality guarantee of pre-prevention, in-process quick processing and post-remediation through the data monitoring of the pre-online stage, the in-online stage and the post-online stage respectively.
Detailed Description
Term interpretation:
ES, collectively referred to as elastic search, is a distributed, scalable, real-time search and analysis engine.
CK, collectively referred to as ClickHouse, is an open-source column store database.
MYSQL is an open-source relational database management system.
Doris is an MPP analysis type database.
SQL logic is an operational rule and principle for data querying and filtering.
The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the data quality guarantee method under a big data scene provided by the invention comprises the following steps:
step 1, carrying out quality evaluation on data in a pre-online stage based on field information and data volume;
step 2, pushing verification is carried out on data in the stage before online based on the total data before and after pushing;
Step 3, analyzing and repairing the core data in the online stage based on the automated test of the report interface;
and 4, monitoring and analyzing the data in the online stage.
The invention takes the life cycle of data as a time line, and effectively correlates a plurality of systems, thereby providing a set of data quality assurance methods in different stages and different directions. The invention realizes the data quality guarantee of pre-prevention, in-process quick processing and post-remediation through the data monitoring of the pre-online stage, the in-online stage and the post-online stage respectively.
Before step 1, all database information and all data table information in the database are acquired, wherein the database information comprises the names of the databases, the data tables under the databases and remark information of the databases, the data table information comprises the names of the databases to which the data tables belong, the names of the data tables, responsible persons of the data tables, whether the data tables are partition tables, partition numbers, partition types, partition field types, whether the data tables are transaction tables and remark information of the data tables, and the database information and the data table information form the data warehouse information. The asset status of the data platform, such as the database and the number of data tables, can be known in its entirety by the data warehouse information. Meanwhile, the configuration of the subsequent monitoring can use the library name of the database and the table name of the data table, so that the response time of the database to the local interface is reduced. Some necessary information of the database or the data table cannot be directly obtained, and the necessary information for maintaining the prepared later period is needed, for example, the responsible person in the data warehouse can change after the data table structure is edited. Some data tables often require manual maintenance correction because of the lack of specification of data at an early stage, resulting in non-specification of partition type, partition format.
Firstly, verification of data correctness is required to be carried out on data in an online pre-stage, namely a research and development testing stage, and quality evaluation and push verification are also required to be carried out.
The quality assessment is analysis of the data scale, and specifically comprises the following steps: step 11, acquiring any data table of a database by using a model analysis tool, and counting the names and types of all fields of the current data table; step 12, circulating all fields of the current data table, and counting field information of field value range of all types of fields including null value quantity and null value rate, duplication removal quantity of character type field, field maximum value and minimum value and value type field; step 13, counting the data quantity of the current data table; step 14, cycling all data tables of a database, repeating the steps 11-13, writing the statistical results of the steps 11-13 into an Excel file, and storing the statistical result of one data table in each worksheet in the Excel file; and 15, carrying out quality evaluation on the data in the database according to the statistical result, for example, judging whether the field types corresponding to the amounts and the quantity reasonably use the numerical value type field, judging whether the field value range of the amount of the order is consistent with the service, and carrying out quality evaluation on the data and repairing the data which do not accord with the quality evaluation.
Because of the different data usage, the data in the data warehouse may also be pushed to different types of data storage media (MYSQL, CK, ES, doris, etc.) by the pushing tool, where the correctness of the pushing needs to be verified. The method specifically comprises the following steps: step 21, judging whether the total data amount of a source data table (pushed data) in a data warehouse before pushing is consistent with the total data amount of a target data table (pushed data in a target storage medium and a database) after pushing by adopting a data comparison tool, performing count statement of SQL (structured query language) through a link database to complete statistics of the total data amount, and then comparing the total data amount; step 22, if the data total amounts of the source data table and the target data table are consistent, randomly selecting N rows of data in the source data table through a set unique key (namely, determining only one piece of data through one or more fields in the source data table), inquiring corresponding data in the target data table according to the unique key, if the inquiry fails to indicate push verification, comparing the fields in the N rows of data in sequence if the inquiry fails, if the comparison of each field is the same, indicating that push verification is successful, otherwise, failing to push verification, namely, abnormal push program exists.
And then the dispatching task is executed at regular time every day by the dispatching platform, and data processing is carried out one day before the expiration of the morning at regular time every day, and in this stage, due to the fact that the dispatching task of the dispatching platform is highly complex and the requirements on all dependence items such as environment, network, equipment resources and the like are high, failure of any link can cause failure of the data processing task in the day, and therefore on-time and high-quality output of all reports is affected. Based on the scene, the whole big data team needs to ensure the report correctness of the cores preferentially, and the accuracy of some core data is ensured. Therefore, the scheduling platform calls the method for monitoring the interface result, and the monitoring of the interface result can ensure that the report system has correct functionality and unchanged logic, and is specifically as follows: and (3) adopting interface result monitoring to test the automation use cases input by the interfaces of the core report of the automation platform to judge whether the core data is processed normally, and finally analyzing the test result and repairing the core report. If the automatic use cases do not pass, an alarm is triggered, the alarm comprises the failure number and the failure proportion of the automatic use cases, and the alarm is provided to a query detail link address for monitoring the interface result. This ensures that problems are found before the user, leaving sufficient time for repairing the problem.
The scheduling platform is a post-online stage after executing all tasks, and at the moment, the data after online are monitored and analyzed by adopting the following monitoring method:
The method one, adopt the enumerated value to monitor and analyze and repair the data sheet of the original data layer in the data warehouse. The data warehouse can be roughly classified into an original data layer (ods layer), a data detail layer (dwd layer), a data service layer (dws layer), a data application layer (ads layer), and the like. The data sources in the original data layer (ods layer) are synchronized from the service system database to the data warehouse through various synchronization programs, all the data sources of big data are developed from the ods layer, when the enumeration value in the service system is changed, the correctness of the data is affected, for example, when an else statement to case white is used, if the changed or newly added enumeration value is not specially processed, the correctness of the report result is affected by the abnormal data.
The method for monitoring the enumeration value comprises the following steps: monitoring fixed fields of a data table of an original data layer (ods layer), writing enumeration values of the fixed fields into a database, checking whether the enumeration values change or not at regular intervals, triggering an alarm when the enumeration values change, analyzing the influence of the enumeration value change on the data table and repairing the data table. For example, an enumeration value (a/B/C) of a fixed field of a certain type is written into the database, when the enumeration value is changed to (a/B/C/E), an enumeration value D is newly added, and a corresponding development engineer needs to analyze what influence the newly added enumeration value E will cause and how to repair data through the fixed field and the newly added enumeration value E.
And secondly, analyzing and repairing the report generated by the report interface by adopting interface result monitoring. And a test plan is established on the automation platform, all the automation cases input by the interfaces of the automation platform are called and tested, the test result is monitored, an alarm is triggered when the automation cases do not pass, the test result is analyzed and a report is restored. The automation use case can ensure that the report interface is invoked normally and the report and the data logic are not changed. The logic of the interface is realized by an interface path transmitted by the dispatching platform, a corresponding automation use case is found by interface definition dolphinRunApi (List < String > apiList), then the automation use case is executed, the information of the person on duty is directly obtained by a data table stored in a data warehouse when the automation use case is abnormal, and finally an alarm is triggered.
And thirdly, monitoring data extraction of an original data layer (ods layer) in the data warehouse and monitoring a dispatching task of a data detail layer (dwd layer) in the data warehouse by adopting layered data monitoring. If the ods layer is monitored, whether the source service system data are normally extracted to big data or not can be found, and the data integrity can be verified; if dwd layers are configured, whether the dispatching task of the data warehouse is normally performed or not can be found, and the link integrity is verified.
And fourthly, monitoring the data change of the original data layer (ods layer) in the data warehouse by adopting incremental monitoring. The method is generally used for monitoring the data tables of the ods layer and synchronizing the data by a full-scale synchronization scheme, and aims to monitor the change condition of the data, the service system does not allow physical deletion, the data only increases or decreases, if the data decreases, the existing report forms can be influenced, and aiming at some tables with large service growth, if the situation that no growth or little growth exists, the investigation is needed, and the purpose is to ensure the data consistency.
And fifthly, monitoring misuse of the transaction table by adopting global transaction table monitoring so as to prevent misuse of the transaction table.
And the consistency of the total data before and after pushing the data table in the data warehouse is monitored by adopting the consistency monitoring of the pushing ES/CK/MYSQL. The three kinds of monitoring are used for verifying the consistency of the total data before and after pushing.
And step nine, monitoring whether the partition of the specified data table has data or not by adopting specified table monitoring. The method is the extension of layered data monitoring, the data table information can be obtained without according to the layering rule, and whether the partition in the data table has data in the current day can be directly verified through the appointed data table.
According to the method, the data uniqueness of the data table with the uniqueness configuration is monitored by adopting dimension table uniqueness monitoring, the data table with the uniqueness configuration is configured, the uniqueness configuration can be single-field or multi-field, and the purpose is to ensure the data uniqueness and no repeated data.
And the eleventh method is to monitor the data null value of the configuration data table with the core field by adopting index null value rate monitoring. Configuring the core fields of some data tables, if the configured null value rate is null, the null value rate cannot appear, and if the configured null value rate is not null, the actual null value rate must be smaller than the configured value; the null rate calculation formula is: the field is null number/table number, and if the field is character type, the formula is: (field is null number + field is blank number)/table number.
The monitoring methods can be configured to be executed at fixed time according to actual needs, and a plurality of monitoring tasks of each type are supported and configured according to actual situations.
In the online stage, if the interface of the core report is not input with an automatic case, the above eleven monitoring methods are adopted to analyze and repair the core data.
The invention also includes: the method comprises the steps of obtaining SQL logic of interface configuration by analyzing interface configuration associated with a report, wherein the SQL logic is configured by a single or a plurality of associated data tables, extracting interfaces associated with the abnormal data tables when the abnormal data tables are obtained by a monitoring method from a first method to a method eleventh, extracting the report of the interface, and analyzing the affected range of the report.
The report processing logic comprises interface registration and report development. The interface registration is that a BI development engineer finishes SQL script (SQL logic) development through monitored single or multiple associated data on an interface platform according to the explicit report demand, then the SQL script is registered as an interface, and the relation between the data and the interface is established and stored in a data warehouse. Report development is that a front-end development engineer obtains a report by adding front-end low-code configuration to data returned by a registration interface on a report platform, establishes a relation between the report and the interface and stores the relation in a data warehouse. Through the operation, the relation of the data, the interface and the report is constructed, when the data is monitored to be abnormal, an affected interface is found on the interface platform according to the abnormal data, then the affected report is quickly positioned on the report platform according to the affected interface, and the affected report range can be analyzed at the moment.
After all monitoring tasks are executed, monitoring results are recorded, monitoring details are output according to days, and a billboard of the whole quality platform is formed, so that problem positioning and troubleshooting are facilitated.
Finally, it should be noted that: the above embodiments are merely preferred embodiments of the present invention for illustrating the technical solution of the present invention, but not limiting the scope of the present invention; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions; that is, even though the main design concept and spirit of the present invention is modified or finished in an insubstantial manner, the technical problem solved by the present invention is still consistent with the present invention, and all the technical problems are included in the protection scope of the present invention; in addition, the technical scheme of the invention is directly or indirectly applied to other related technical fields, and the technical scheme is included in the scope of the invention.