Disclosure of Invention
Based on the defects in the prior art, the application provides a data cleaning processing method and device, electronic equipment and storage medium, so as to solve the problem that the existing mode is too complicated.
In order to achieve the above object, the present application provides the following technical solutions:
The first aspect of the present application provides a method for data cleaning processing, including:
acquiring data of an original data source;
converting the data of the original data source into data to be processed with uniform caliber according to a preset conversion rule;
the data to be processed is cleaned according to a preset logic flow arrangement and a data logic configuration through a flow calculation model, and result data is obtained;
based on pre-configured alarm information and verification information, analyzing whether data which does not meet a corresponding expected result exists in the result data or not;
If the result data is analyzed to have the data which does not meet the corresponding expected result, outputting the result data to a target table;
and visually displaying the result data in the target table.
Optionally, in the above method for data cleansing processing, the acquiring data of the original data source includes:
Collecting data of a plurality of data sources;
and storing the data of each data source according to a preset storage format to obtain a plurality of original data sources.
Optionally, in the above method for data cleaning processing, the converting the data of the original data source into the data to be processed with a uniform caliber according to a preset conversion rule includes:
Screening the original data sources to be processed from a plurality of original data sources;
Determining the data type of the data of the original data source to be processed according to the metadata of the data of the original data source to be processed;
And converting the data types of the data of each original data source to be processed according to the pre-established mapping relation between each field of each original data source to be processed and the data types of each field of the target data source, so as to obtain the data to be processed with uniform caliber.
Optionally, in the above method for data cleaning processing, after converting the data of the original data source into the data to be processed with a uniform caliber according to a preset conversion rule, the method further includes:
and carrying out local persistence processing on the data to be processed.
Optionally, in the above method for cleaning data, the cleaning the data to be processed by the flow calculation model according to a pre-configured logic flow arrangement and a data logic configuration, before the result data, the method further includes:
configuring the logic flow arrangement, the data logic configuration and the alarm information of the flow calculation model in response to configuration operation of a user;
Triggering any flow node of the flow computing model, and cleaning the debugging data according to pre-configured logic flow arrangement and data logic configuration to obtain a cleaning result of the any flow node;
Displaying the cleaning result of any flow node, and analyzing whether the cleaning result of any flow node meets the corresponding expected result or not based on the alarm information and the verification information;
and if the cleaning result of any one of the flow nodes does not meet the corresponding expected result, generating alarm prompt information based on the cleaning result of any one of the flow nodes, and visualizing the alarm prompt information.
The second aspect of the present application provides an apparatus for data cleansing processing, comprising:
The acquisition unit is used for acquiring the data of the original data source;
The conversion unit is used for converting the data of the original data source into data to be processed with uniform caliber according to a preset conversion rule;
The data processing model is used for cleaning the data to be processed according to a preset logic flow arrangement and a data logic configuration through the stream calculation model to obtain result data;
the data monitoring unit is used for analyzing whether data which does not meet the corresponding expected result exists in the result data or not based on the pre-configured alarm information and the verification information;
The output unit is used for outputting the result data to a target table if the result data is analyzed to have no data which does not meet the corresponding expected result;
and the visualization unit is used for carrying out visual display on the result data in the target table.
Optionally, in the apparatus for data cleansing processing described above, the acquiring unit includes:
the acquisition unit is used for acquiring data of a plurality of data sources;
And the storage unit is used for storing the data of each data source according to a preset storage format to obtain a plurality of original data sources.
Optionally, in the apparatus for data cleansing processing described above, the conversion unit includes:
the screening unit is used for screening the original data sources to be processed from a plurality of original data sources;
a type determining unit, configured to determine a data type of the data of the original data source to be processed according to metadata of the data of the original data source to be processed;
The conversion subunit is used for converting the data types of the data of each original data source to be processed according to the pre-established mapping relation between the data types of each field of each original data source to be processed and each field of the target data source, so as to obtain the data to be processed with uniform caliber.
Optionally, the apparatus for data cleansing processing further includes:
And the persistence unit is used for carrying out local persistence processing on the data to be processed.
Optionally, the apparatus for data cleansing processing further includes:
a configuration unit, configured to configure the logic flow arrangement, the data logic configuration, and the alarm information of the flow calculation model in response to a configuration operation of a user;
the debugging unit is used for triggering any flow node of the flow calculation model, and cleaning the debugging data according to the pre-configured logic flow arrangement and data logic configuration to obtain a cleaning result of the any flow node;
the analysis unit is used for displaying the cleaning result of any one of the flow nodes and analyzing whether the cleaning result of any one of the flow nodes meets the corresponding expected result or not based on the alarm information and the verification information;
And the prompting unit is used for generating alarm prompting information based on the cleaning result of any one of the flow nodes and visualizing the alarm prompting information if the cleaning result of any one of the flow nodes does not meet the corresponding expected result.
A third aspect of the present application provides an electronic device, comprising:
A memory and a processor;
Wherein the memory is used for storing programs;
the processor is configured to execute the program, and when the program is executed, the program is specifically configured to implement a method for cleaning data according to any one of the above-described methods.
A fourth aspect of the present application provides a computer storage medium storing a computer program which, when executed, is adapted to carry out a method of data cleansing processing as claimed in any one of the preceding claims.
According to the data cleaning processing method provided by the application, the data of the original data source is obtained, and then the data of the original data source is converted into the data to be processed with uniform caliber according to the preset conversion rule, so that the process of analyzing the data type in the program execution process is simplified. And cleaning the data to be processed through a stream calculation model according to a pre-configured logic flow arrangement and a data logic configuration to obtain result data. Therefore, the data is cleaned through the flow calculation, the flow calculation is divided into the flow nodes to clean the data, and the flow nodes can be adjusted only. Then, based on the pre-configured alarm information and the verification information, whether data which does not meet the corresponding expected result exists in the result data or not is analyzed. If the result data is analyzed to be free from the data which does not meet the corresponding expected result, the result data is output to the target table, so that the accuracy of the result is ensured. And finally, carrying out visual display on the result data in the target table. Thereby realizing a simple and accurate data cleaning processing method.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the present application, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The embodiment of the application provides a data cleaning processing method, as shown in fig. 1, comprising the following steps:
s101, acquiring data of an original data source.
Specifically, in the embodiment of the present application, firstly, the data source needs to be accessed, and then the data of the data source is processed.
Alternatively, the data of one or more raw data sources may be acquired in the same manner as from a collector or manually entered. Wherein the format of the data of the respective original data sources may be different.
As shown in fig. 2, one embodiment of step S101 includes the following steps:
S201, collecting data of a plurality of data sources.
In particular, the corresponding data may be collected from a plurality of data sources that produce the data.
S202, storing the data of each data source according to a preset storage format to obtain a plurality of original data sources.
And storing the data of each data source according to a preset storage format, for example, the data of the same type exists in the same field, so that the mapping of the data type and the other processing can be conveniently carried out.
S102, converting the data of the original data source into the data to be processed with the uniform caliber according to a preset conversion rule.
It should be noted that, because the formats of different original data sources or the data of the same data source may not be uniform, so that uniform processing is inconvenient, in the embodiment of the present application, a preset conversion rule is preset to convert the data of the original data source into the data to be processed with a uniform caliber according to the preset conversion rule.
Specifically, the preset conversion rules may include conversion rules corresponding to different types of data, so as to respectively convert the different types of data.
Optionally, in another embodiment of the present application, after performing step S102, the method may further perform local persistence processing on the data to be processed.
In an embodiment of the application, the access to the data source has four parts including acquisition, storage, access, and local persistence of the data source.
Optionally, in another embodiment of the present application, the original data source includes a plurality of corresponding, in one implementation of step S102 in the embodiment of the present application, as shown in fig. 3, including:
S301, screening out to-be-processed original data sources from a plurality of original data sources.
It should be noted that, in order to avoid the need for different docking data sources, all the original data sources are often docked. However, in the data to be processed, the requirements of the service and the like are different correspondingly, so that the raw data sources to be processed need to be screened out from a plurality of raw data sources according to the requirements.
S302, determining the data type of the data of the original data source to be processed according to the metadata of the data of the original data source to be processed.
The metadata is data describing the data, namely descriptive information of the information resource of the data set. In the embodiment of the application, the data type of the data is converted according to the data type of the data, so that the data type of the data of the original data source to be processed is determined according to the metadata of the data.
S303, converting the data types of the data of each original data source to be processed according to the pre-established mapping relation between each field of each original data source to be processed and the data types of each field of the target data source, so as to obtain the data to be processed with uniform caliber.
In the embodiment of the application, the mapping relation between each field of each original data source and each field of the target data source is pre-established. The target data source refers to a data source for storing generated data to be processed.
The established mapping relation is to process the data in different record formats or different storage media through conversion rules to form a data source with uniform caliber.
Optionally, for structured data, such as MySql, orcal, etc., a data type conversion rule is employed, such as conversion of MySql data types in table 1.
TABLE 1
| Data type of original data source | Conversion rule | Data type of target data source |
| char | ... | String |
| Date | ... | Date |
| BOOL | ... | Boolean |
And for semi-structured data, such as JSON, XML and the like, a visual operation interface is adopted to perform serialization display on the semi-structured data, and the data type of the data of the target data source and the data type of the data of the original data source are mapped through the specification of the data type of the data field by a user.
S103, arranging and configuring data logic according to a pre-configured logic flow through a flow calculation model, and cleaning the data to be processed to obtain result data.
In the implementation of the present application, after the stream calculation model is configured in advance, the stream calculation model performs stream calculation processing on the to-be-processed stream. Specifically, logic flow arrangement, data logic configuration and alarm information are required to be configured.
It should be noted that, when the flow calculation model cleans the data to be processed, it generally needs to process the data through multiple flow links, for example, filtering, sorting, and missing value processing. Therefore, in order to ensure the accuracy of the result, each flow node of the flow calculation model can be debugged when the flow calculation model is configured, namely in the embodiment of the application, the flow calculation model processing comprises data logic flow arrangement, data logic configuration, alarm information configuration and data flow debugging.
Optionally, in another embodiment of the present application, a method for configuring a stream computation model, as shown in fig. 4, includes:
s401, responding to configuration operation of a user, and configuring logic flow arrangement, data logic configuration and alarm information of a flow calculation model.
The data logic configuration is essentially the individual processing rules of the data to be processed. The data flow arrangement refers to a flow that a target data source with uniform caliber passes through a series of logic assumptions of a user, so as to achieve the aim of data cleaning. The method specifically comprises three parts of data input, arrangement of data rule processing and data output.
The data input means selecting a target data source with uniform caliber, which is accessed according to metadata and is processed, and further, after the input target data source is subjected to field name duplication removal processing, the stream is transferred to a specified flow environment, and the data to be processed is processed by utilizing corresponding processing rules. Alternatively, the input target data sources may be one or more different types of target data sources.
The data rule processing means that the input data to be processed is subjected to data cleaning according to a certain processing rule, and the data is parallel-flow transferred to the next flow node for processing. The processing rules refer to the data cleaning mode, such as data filtering, sorting, missing value processing and the like.
The alarm information configuration is mainly used for controlling the data quality and safety of the input data to be processed by utilizing the configured alarm information and a data verification mechanism inside the system through the configured data logic flow and a data result generated by logic configuration.
S402, triggering any flow node of the flow calculation model, cleaning the debug data according to the pre-configured logic flow arrangement and the data logic configuration, obtaining the cleaning result of the any flow node and displaying the cleaning result of the any flow node.
Specifically, in the data flow debugging process, any node debugging operation is triggered to process the debugging data in real time, a cleaning result of the flow node can be returned in real time through WebSocket, and then the request result is visually displayed to provide debugging personnel.
S403, analyzing whether the cleaning result of any one of the process nodes meets the corresponding expected result or not based on the alarm information and the verification information.
The alarm information comprises a self-defined threshold value, upper and lower limits, precision and the like. The verification information has rules including verification mode and the like.
Optionally, in order to enable the user to know the current alarm information and verification information, the alarm information and verification information may also be displayed on the user interface.
It should be noted that, if the cleaning result of any one of the process nodes does not meet the corresponding expected result, step S404 is executed.
S404, generating alarm prompt information based on the cleaning result of any process node, and visualizing the alarm prompt information.
Optionally, the alarm prompt information may include the analysis result in step S403, specific error data, etc., so that the user can accurately understand the existing problem and adjust the flow node. After the process node is adjusted, subsequent debugging can be continued from the process node until the final result reaches the expected effect.
Alternatively, the result data obtained by executing step S103 may include only the final cleaning result, that is, only the input of the last flow node, or may include the input of each flow node.
S104, analyzing whether data which does not meet the corresponding expected result exists in the result data or not based on the pre-configured alarm information and the verification information.
In order to ensure the accuracy of the final displayed result data, it is therefore necessary to analyze whether there is data that does not satisfy the corresponding expected result in the result data based on the pre-configured alarm information and the verification information.
If it is analyzed that the data does not exist in the result data, the step S105 is executed. If it is analyzed that the result data includes data that does not satisfy the corresponding expected result, the flow nodes corresponding to the result that does not satisfy the corresponding expected result are adjusted according to the manner shown in fig. 4.
S105, outputting the result data to a target table.
Specifically, the result data is output to the target table to be stored according to a specified format, so that charts with different formats can be generated according to the requirement for subsequent display.
S106, visually displaying the result data in the target table.
Optionally, the result data in the target table is visually displayed, which may specifically be in a report form, a table form, a chart form, or the like.
According to the data cleaning processing method provided by the embodiment of the application, the data of the original data source is obtained and then converted into the data to be processed with the uniform caliber according to the preset conversion rule, so that the corresponding data rule is not required to be configured for different data sources, and the configuration process is simplified. And cleaning the data to be processed through a stream calculation model according to a pre-configured logic flow arrangement and a data logic configuration to obtain result data. Therefore, the data is cleaned through the flow calculation, the flow calculation is divided into the flow nodes to clean the data, and the flow nodes can be adjusted only. Then, based on the pre-configured alarm information and the verification information, whether data which does not meet the corresponding expected result exists in the result data or not is analyzed. If the result data is analyzed to be free from the data which does not meet the corresponding expected result, the result data is output to the target table, so that the accuracy of the result is ensured. And finally, carrying out visual display on the result data in the target table. Thereby realizing a simple and accurate data cleaning processing method.
Another embodiment of the present application provides an apparatus for data cleansing processing, as shown in fig. 5, including:
an acquisition unit 501, configured to acquire data of an original data source.
The conversion unit 502 is configured to convert the data of the original data source into data to be processed with a uniform caliber according to a preset conversion rule.
The data processing model 503 is configured to clean the data to be processed according to a preset logic flow arrangement and a data logic configuration through the stream computing model, and obtain result data.
And a data monitoring unit 504, configured to analyze whether data that does not satisfy the corresponding expected result exists in the result data based on the pre-configured alarm information and the verification information.
And an output unit 505, configured to output the result data to the target table if it is analyzed that there is no data that does not satisfy the corresponding expected result in the result data.
And the visualization unit 506 is configured to perform visual display on the result data in the target table.
Optionally, in the apparatus for data cleaning processing provided in another embodiment of the present application, the acquiring unit includes:
and the acquisition unit is used for acquiring data of a plurality of data sources.
The storage unit is used for storing the data of each data source according to a preset storage format to obtain a plurality of original data sources.
Optionally, in the apparatus for data cleansing processing provided in another embodiment of the present application, the converting unit includes:
And the screening unit is used for screening the raw data sources to be processed from the plurality of raw data sources.
And the type determining unit is used for determining the data type of the data of the original data source to be processed according to the metadata of the data of the original data source to be processed.
The conversion subunit is used for converting the data types of the data of each original data source to be processed according to the pre-established mapping relation between the data types of each field of each original data source to be processed and each field of the target data source, so as to obtain the data to be processed with uniform caliber.
Optionally, in the apparatus for data cleaning processing provided in another embodiment of the present application, the apparatus further includes:
And the persistence unit is used for carrying out local persistence processing on the data to be processed.
Optionally, in the apparatus for data cleaning processing provided in another embodiment of the present application, the apparatus further includes:
and the configuration unit is used for responding to the configuration operation of a user, and configuring logic flow arrangement, data logic configuration and alarm information of the stream calculation model.
The debugging unit is used for triggering any flow node of the flow calculation model, and cleaning the debugging data according to the pre-configured logic flow arrangement and the data logic configuration to obtain a cleaning result of any flow node.
And the analysis unit is used for displaying the cleaning result of any flow node and analyzing whether the cleaning result of any flow node meets the corresponding expected result or not based on the alarm information and the verification information.
And the prompting unit is used for generating alarm prompting information based on the cleaning result of any one of the flow nodes and visualizing the alarm prompting information if the cleaning result of any one of the flow nodes does not meet the corresponding expected result.
It should be noted that, the specific working process of each unit provided in the above embodiment of the present application may refer to the corresponding steps in the above method embodiment, which is not described herein.
Another embodiment of the present application provides an electronic device, as shown in fig. 6, including:
A memory 601 and a processor 602.
Wherein the memory 601 is used for storing programs.
The processor 602 is configured to execute a program stored in the memory 601, and when the program is executed, the program is specifically configured to implement a method of data cleansing processing provided in any one of the above embodiments.
It should be noted that, the specific implementation process may refer to the specific steps of the data cleaning processing method provided in each embodiment, which are not described herein.
Another embodiment of the present application provides a computer storage medium storing a computer program for implementing a method of data cleansing processing as provided in any one of the above embodiments when the computer program is executed.
Computer storage media, including both non-transitory and non-transitory, removable and non-removable media, may be implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.