Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. While the invention is susceptible of embodiment in the drawings, it is to be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustration purposes only and are not intended to limit the scope of the present invention.
It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.
The term "comprising" and variations thereof as used herein is meant to be open-ended, i.e., "including but not limited to," based at least in part on, "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "some embodiments" means "at least some embodiments," and "optional" means "optional embodiment. Related definitions of other terms will be given in the description below. It should be noted that the concepts of "first", "second", etc. mentioned in this disclosure are only used to distinguish between different devices, modules or units, and are not intended to limit the order or interdependence of functions performed by these devices, modules or units.
It should be noted that references to "a" and "an" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of such messages or information.
It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals related to the present application are all authorized by the user or fully authorized by the parties, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region, and provide corresponding operation entries for the user to select authorization or rejection.
Referring to fig. 1, a report data anomaly monitoring and quality assessment system provided by an embodiment of the present invention includes:
The data acquisition module is used for acquiring report data of enterprises and analyzing the report data to obtain data points corresponding to each data type in the report data.
Specifically, the data acquisition module serves as a basic link of the whole system, and is used for collecting report data from a plurality of data sources inside an enterprise, wherein the data sources comprise an Enterprise Resource Planning (ERP) system, a financial database, a business operation system and the like. The data acquisition module communicates with the data sources through data interfaces (such as API interfaces, database connections, etc.), and periodically or in real time acquires report data. After the data is obtained, the report data is subjected to preliminary cleaning and preprocessing, including operations such as removing repeated data, correcting data format errors, filling missing values and the like, so as to ensure the integrity and consistency of the data. And then analyzing the preprocessed report data to identify the data point corresponding to each data type in the report. For example, in a financial statement, assets, liabilities, revenues, costs, etc. are all of the different data types, each data type corresponding to a particular series of data points, such as fixed asset amounts, mobile liability amounts, etc. By distinguishing and identifying the data types and the data points, a clear data structure is provided for subsequent outlier analysis.
The abnormal point analysis module is used for analyzing the data points of the data types and judging whether abnormal points exist in the data points according to preset judgment rules corresponding to the data types.
The abnormal point analysis module is used for carrying out abnormal detection by applying various data analysis technologies and algorithms to data points of each data type, and firstly, corresponding preset judgment rules are set for each data type according to the characteristics of the data type. The preset judgment rules can be based on statistical methods, such as setting statistical indexes of the mean value, standard deviation, percentile and the like of the data as judgment bases. For example, for some item of cost data, if the value of its data point is beyond the range of the calculated mean plus or minus 3 times the standard deviation over the past period of time, the data point may be preliminarily determined as an outlier. In addition, certain thresholds or conditions may be set in connection with business rules, such as based on historical experience of the business and business logic. For example, with respect to sales data, sales on a day may be considered abnormal if they are significantly reduced (e.g., over 50%) compared to the previous day or the same period, and there is no reasonable business interpretation (e.g., promotional event, holiday, etc.). In the analysis process, the abnormal point analysis module checks each data point one by one, judges according to a preset judgment rule, finally determines whether abnormal points exist or not, records the abnormal points, and provides clues for follow-up influence path tracking.
And the influence path tracking module is used for tracking the path of the abnormal point through a path analysis algorithm when the abnormal point exists in the data point, and determining a propagation path corresponding to the abnormal point.
Specifically, after the outlier analysis module detects the outlier, the influence path tracking module determines a propagation path of the outlier by using a path analysis algorithm, wherein the path analysis algorithm constructs a data dependency graph based on a data flow and business logic of the enterprise. For example, in a manufacturing enterprise, anomalies in raw material procurement costs may propagate along a path such as "raw material procurement-manufacturing process-product cost-sales cost" affecting the data of subsequent links. In a specific implementation, the path analysis algorithm issues from the determined abnormal points, and gradually tracks other data points possibly affected by the path analysis algorithm according to the association relation between the data (such as the calculation relation of the data, the sequence in the business process and the like). Taking the data in the financial statement as an example, if an anomaly in an asset project involves depreciation calculations, then the anomaly may propagate along the path of "asset raw-depreciation cost-cumulative depreciation-net asset". The influence path tracking module can comprehensively and systematically determine the propagation path of the abnormal point in the report data through the algorithm, reveal the potential association between the abnormal point and other data points, and provide directions and ranges for subsequent affected data analysis.
And the affected data analysis module is used for analyzing the affected degree of the data points in the propagation path and determining the affected data points.
Specifically, the affected data analysis module establishes an evaluation model of the affected degree according to the relationship among different data points and the characteristics of data propagation. For example, for data points that depend directly on outliers, the degree of influence may be high, while for data points that depend indirectly or are related to outliers through multiple layers of computational relationships, the degree of influence may be relatively low due to factors such as attenuation of data propagation. In the analysis process, various methods can be employed to quantify the degree of influence, such as calculating the deviation rate of the data points (deviation amplitude from the expected value under normal conditions), correlation analysis (analyzing the strength of correlation between the data points and the outliers), and the like. Through the analysis, the influence of abnormal points and the influence degree of the abnormal points on the data points can be accurately determined, so that the affected data points are accurately identified from a plurality of data, and a key basis is provided for further evaluating the overall quality of report data.
And the data evaluation module is used for evaluating the overall quality level of the report data according to the affected data points and combining the abnormal points, and generating a quality evaluation report of the report data according to the overall quality level.
Specifically, the data evaluation module comprehensively considers the situations of abnormal points and affected data points, and evaluates the overall quality level of report data from multiple dimensions. In the evaluation process, according to a preset quality evaluation index and a preset weight system, factors such as severity of abnormal points, range and degree of affected data points and the like are quantized and synthesized. For example, the severity of an outlier may be scored according to its deviation from the normal range, the range of affected data points may be quantified according to the number of data types involved, the number of data points, etc., and the affected extent may be measured according to the deviation rate, etc., from the previous analysis. And obtaining the evaluation result of the overall quality level of the report data by means of weighting calculation and the like. Based on this evaluation result, the system is able to generate detailed, accurate quality assessment reports. The report content not only includes the description of the overall quality level, but also includes specific information of abnormal points (such as abnormal positions, abnormal types, abnormal degrees and the like), distribution conditions of affected data points (related to which data types, specific data points and the like), basis and method of quality assessment and the like. In addition, the report may also provide corresponding improvement suggestions, such as further investigation directions for outliers and affected data points, data corrective measures, etc., providing powerful support for enterprise optimization data management and decision-making.
According to the report data anomaly monitoring and quality assessment system, the report data of an enterprise is acquired through the data acquisition module, and the report data is analyzed to obtain data points corresponding to each data type, so that a foundation is laid for subsequent analysis. Analyzing data points of the data types through an abnormal point analysis module, judging whether abnormal points exist in the data points according to preset judging rules corresponding to the data types, primarily identifying local abnormal conditions, and when the abnormal points exist, carrying out path tracking on the abnormal points by using a path analysis algorithm by an influence path tracking module to determine a propagation path corresponding to the abnormal points. By tracking the propagation path, other data points that may be affected by the outlier can be found, not just the outlier itself. The affected data analysis module analyzes the affected degree of the data points in the transmission path, determines the affected data points, and further digs the actual influence of the abnormal condition on other data, so that the problem caused by neglecting the whole association by only looking at the local characteristics is avoided. The data evaluation module evaluates the overall quality level of the report data according to the affected data points and the abnormal points, and generates a quality evaluation report. The evaluation result is more comprehensive and accurate, and the mutual influence among the data is comprehensively considered.
According to the method, the association between the data is comprehensively considered from the discovery of the abnormal points to the tracking of the influence path and then to the analysis of the influenced data. Compared with the traditional statistical method based on local features, the method can effectively avoid the conditions of missing report and false report, thereby greatly improving the accuracy of report data quality evaluation results. Meanwhile, not only the abnormal condition of the data point is concerned, but also the influence of the abnormal point on other data is also concerned, so that the report data quality is evaluated more comprehensively, and the reliability and the effectiveness of the data in an actual service scene can be reflected more truly. Accurate quality assessment results can provide more powerful support for enterprise decisions. The problems of missing report and false report caused by abnormality judgment from local features in the traditional technology are effectively solved, and the accuracy of report data quality evaluation results is improved.
Optionally, the data acquisition module is specifically configured to:
acquiring initial report data from a database of the enterprise;
Performing format conversion and preprocessing on the initial report data to obtain standardized report data;
Identifying according to metadata information of the report data to obtain a plurality of data types in the report data;
And extracting the data points corresponding to the data types according to the data types.
Specifically, the data acquisition module first acquires initial report data by establishing a connection with the enterprise database. In embodiments of the present invention, this may be accomplished in a variety of ways, such as using a database connection protocol (e.g., JDBC, ODBC, etc.), or by calling an API interface internal to the enterprise. Specifically, the system establishes communication with the database server using the corresponding database driver according to pre-configured database connection information (including database address, port number, user name, password, etc.). Report data is then extracted from the specified database table or view by executing the SQL query statement. The report data is stored in a financial database, a business operation database or other related databases of the enterprise, and covers information in various business fields, such as financial reports, sales data, inventory data, production data and the like.
Since the acquired initial report data may come from different data sources, there are a variety of data formats. For example, some data may be stored in the form of text files, where the data fields are separated by commas, and some data may be stored in XML or JSON format. In order to unify the data formats for subsequent processing, the system needs to convert these different formats of data. For data stored in the form of text files, in this embodiment, the data is read row by parsing the format rules of the text file, and the data is split into individual fields according to field separators, and then the fields are stored in a unified data structure, such as a data table or a data frame. For XML or JSON format data, the hierarchical structure of the data is parsed by using a corresponding parsing library, and data elements are extracted and converted into a planar format or a structured format consistent with other data.
Based on format conversion, preprocessing is carried out on the data to improve the data quality, and the preprocessing step comprises data cleaning and data verification. Data cleansing is mainly to remove duplicate data, correct data format errors (such as date format inconsistencies, numerical format errors, etc.), and fill in missing values. For example, for duplicate data records, duplicate items may be identified and deleted by comparing unique identifiers or other key fields of the data records. For missing values, appropriate methods may be used to fill in according to business rules and data characteristics, such as using mean, median, mode, or other reasonable estimates. Data validation is the checking of whether data meets business rules and data integrity constraints. For example, it is checked whether the data is within a reasonable range (e.g., the age cannot be negative, the sales cannot exceed the capacity of the business, etc.), and whether the logical relationship between the data is correct (e.g., the inventory cannot exceed the purchase amount, etc.). For data which does not accord with the verification rule, the system records related error information and processes the error information according to a preset processing strategy, such as notifying a data manager to correct or automatically correct the data.
Metadata information of report data is descriptive information about the data, including the name of the data, the definition of the data type, the source of the data, the meaning of the data, and the like. The system utilizes the metadata information to identify different data types in the report data. For example, metadata may define that a certain field in a report is "sales," its data type is numerical, unit is meta, and another field is "sales date," its data type is date. In the implementation process, the system can read the metadata file of the report data or acquire the metadata information from the database, and analyze the report data according to the metadata definitions. By analyzing metadata information such as the name, data type, etc. of each data field, the system is able to accurately identify a plurality of data types in report data, such as financial data types (including assets, liabilities, incomes, costs, etc.), sales data types (including sales, sales volume, sales area, etc.), production data types (including yield, raw material consumption, production efficiency, etc.), etc.
After determining each data type in the report data, the system extracts the corresponding data point according to the definition of each data type. For each data type, the report data is traversed, looking up the data records that conform to the data type definition. For example, for the data type "sales," the system looks up all sales related data fields in the sales report data and extracts the values of these fields as the data points for that data type. At the same time, the extracted data points may be stored in a separate data structure, such as a dictionary or list, for subsequent analysis and processing.
In the optional embodiment, the method realizes the integration and unification of data with different sources and different formats by acquiring the initial report data from the enterprise database and performing format conversion and preprocessing, and provides a consistent data basis for subsequent anomaly monitoring and quality evaluation. Data analysis errors or difficulties caused by inconsistent data formats are avoided, and the efficiency and accuracy of data processing are improved. The data cleaning and verifying steps in the preprocessing process effectively improve the quality of report data, remove repeated data and correct format errors, reduce data redundancy and inconsistency, fill missing values and verify data, ensure the integrity and accuracy of the data, so that subsequent abnormal analysis and quality assessment can be performed on the basis of more reliable data, and improve the reliability of assessment results.
The metadata information is utilized to identify the data types, so that the report data can be accurately divided into a plurality of data types, and a clear data category is provided for subsequent targeted analysis. The system can perform more accurate abnormal point detection and affected data point analysis according to different data characteristics and business rules, and the accuracy and the effectiveness of abnormal monitoring are improved. Corresponding data points are extracted according to the data types, specific data required by analysis can be rapidly positioned and acquired, and the processing efficiency of the system is improved. The method saves time and calculation resources for subsequent operations such as outlier analysis, path tracking influence and the like, and enables the whole report data anomaly monitoring and quality evaluation system to operate more efficiently.
Optionally, the outlier analysis module is specifically configured to:
determining the preset judging rule according to the data type, wherein the preset judging rule comprises a statistics sub-rule, a business logic sub-rule and a machine learning sub-rule;
screening the data points of the data type according to the statistical sub-rule, the business logic sub-rule and the machine learning sub-rule in sequence;
and judging whether abnormal points exist in the data points of the data type according to the screening result.
Specifically, the outlier analysis module first determines a corresponding preset judgment rule according to different data types, where the preset judgment rule includes a statistics sub-rule, a business logic sub-rule and a machine learning sub-rule.
Specifically, for various data types, the statistical sub-rules calculate statistical indexes such as mean, standard deviation, percentile and the like according to historical data, and set a normal range of the data. For example, for normal distribution data, the normal range may be defined as mean ± 3 standard deviations, and for asymmetric distribution data such as sales data, the normal interval may be set using percentiles. And formulating a specific business logic sub-rule according to business characteristics and experience of the enterprise. In the aspect of inventory data, the end-of-month inventory quantity is in the safety inventory range, and the rules are combined with business reality to provide business basis for abnormal point judgment. The machine learning sub-rules utilize machine learning algorithms (e.g., isolated forests, support vector machines, etc.) to train models to distinguish between normal and abnormal data patterns. Taking production data as an example, an isolated forest may be trained based on historical production data to identify outliers that deviate from a normal production pattern.
And if the data points exceed the set statistical normal range, if the data points exceed the mean value plus or minus 3 times of standard deviation, marking the data points as suspected abnormal points. And carrying the data points after preliminary screening into business logic sub-rules. If the cost data marked as suspected anomaly is still higher than the income data, the cost data is further confirmed as a business logic anomaly point. And inputting the screened data points into a machine learning model, judging according to the data point characteristics and the historical mode if the screened data points are judged to be deviated from the normal mode, and determining the data points to be abnormal points of the machine learning.
And (5) integrating statistics, business logic and machine learning screening results, and judging whether abnormal points exist in the data points. If the data point is determined to be abnormal in the statistics, business logic or machine learning screening, the data point is determined to be an abnormal point. For example, if the data point is out of the normal range in the statistical screening and violates the business rule in the business logic screening, the data point is directly determined to be an outlier. And weights are set for different types of rules, and according to the strictness and importance of the rules, for example, the business logic sub-rule weights are set to 0.5, and the statistical sub-rule and the machine learning sub-rule weights are set to 0.25 respectively. After the comprehensive weight calculation, if the total score exceeds a threshold value, the abnormal point is determined.
In the optional embodiment, by combining three sub-rules of statistics, business logic and machine learning, the statistical characteristics, business characteristics and data modes of the data are comprehensively considered, erroneous judgment and missed judgment caused by a single rule are avoided, and the accuracy of abnormal point judgment is improved. The machine learning sub-rule enables the system to automatically learn and update according to the historical data, adapt to the change of data distribution and business logic, and improve the adaptability of the system to new data and new business conditions. The data points are screened from different angles, so that the abnormal points are judged more comprehensively and deeply, a reliable basis is provided for follow-up influence path tracking and affected data analysis, and the whole quality of report data can be estimated more accurately.
Optionally, the outlier analysis module is specifically further configured to:
Judging whether deviating data points exceeding a preset normal distribution exist in the data points or not according to the statistical sub-rule;
if the deviation data point exists in the data points, judging whether the deviation data point exceeds a service range corresponding to the data type according to the service logic sub-rule;
if the deviated data point exceeds the service range, carrying out depth analysis on the deviated data point according to the machine learning sub-rule to obtain an abnormal grade of the deviated data point;
and judging whether the deviated data point is the abnormal point or not according to the abnormal grade.
Specifically, the outlier analysis module applies statistical principles to identify offset ones of the data points. First, historical data of each data type in report data is analyzed, and key statistical indexes are calculated. For data types that fit a normal distribution, the mean (μ) and standard deviation (σ) are calculated and data points outside this range are initially determined as offset data points according to rules of thumb (e.g., 99.7% of the data is centered within the mean ± 3σ). For example, for monthly sales data for an enterprise product, the average is calculated to be 100 ten thousand yuan and the standard deviation is calculated to be 10 ten thousand yuan, then when the sales for a month reach 130 ten thousand yuan (100+3×10) or more, or are lower than 70 ten thousand yuan (100-3×10), the data point is regarded as a deviation data point. For data which does not conform to normal distribution, such as customer complaint data of biased distribution, the system adopts a percentile method to judge. Typically, the 25 th percentile (Q1) and the 75 th percentile (Q3) are set, the quartile range (iqr=q3-Q1) is calculated, and data points below Q1-kxiqr (k typically takes 1.5 or 3) or above q3+kxiqr are determined as offset data points.
If the deviation data points exist, the abnormal point analysis module further judges whether the deviation data points exceed the service range corresponding to the data types according to the service logic sub-rules, and the service range is a reasonable interval set based on the actual service operation rules and experience of enterprises. Taking cost data as an example, business logic of an enterprise specifies that the cost of a product should not exceed 70% of its sales price. Assuming that the sales price of a product is 100 yuan and its cost data shows an offset data point of 80 yuan, the offset data point is out of business by calculating 80/100=80% > 70%. For another example, if the inventory data indicates that the end of the month inventory for a product should not be less than 100 pieces of safety inventory, if a deviation from the data point of 50 pieces occurs, then it also exceeds the lower business window limit.
For deviated data points beyond the business scope, the module will call a model trained based on machine learning subroutines for deep analysis to determine the anomaly level thereof. First, the system needs to collect a large amount of historical data, including normal data and marked abnormal data, perform feature engineering processing on the data, and extract data features which are helpful for distinguishing the abnormality. Taking production data as an example, the characteristics may include production time, raw material usage, equipment operating parameters, and the like. Then, a suitable machine learning algorithm, such as a Support Vector Machine (SVM), random forest (RandomForest), or neural network, is selected to train the data. In the training process, the model learns the characteristic modes of normal data and abnormal data, and establishes a classification boundary or a prediction model. When a new deviated data point is input, the model analyzes according to the characteristics of the data point, and the probability of the data point belonging to abnormality or the degree of deviation from a normal mode is calculated, so that an abnormality score is obtained. The anomaly level of the offset data point is determined according to a predetermined anomaly level classification criteria (e.g., anomaly score of 0-0.3 is slightly anomaly, 0.3-0.6 is moderately anomaly, 0.6-1 is severely anomaly).
Finally, whether the deviated data point is an abnormal point is determined according to the abnormal level, and the system sets an abnormal level threshold, for example, the data point with the abnormal level of medium or more (the abnormal score is more than or equal to 0.3) is determined as the abnormal point. If the abnormal level of the deviated data point reaches or exceeds the threshold value, the deviated data point is determined to be an abnormal point, and relevant information of the abnormal point is recorded, such as data type, specific value of the data point, abnormal level, triggered judgment rule and the like, so as to facilitate subsequent further analysis and processing, otherwise, if the deviated data point is lower than the threshold value, the deviated data point can be considered to deviate in statistics and business logic, but after the deep analysis of comprehensive machine learning, the abnormal degree is insufficient to form a real abnormal point, and the abnormal point is possibly caused by reasonable fluctuation of business or other normal factors. The system marks and archives these off-set data points below the threshold to continuously observe their trend during subsequent data monitoring. If the data points have abnormal grade rising or other abnormal characteristics in a future period, the system can timely reevaluate the states of the data points, avoid potential missing report risks and simultaneously help to continuously optimize the judgment rules and parameter settings of the abnormal point analysis module.
In the optional embodiment, from the general rule of statistics, the actual business rule of the enterprise is combined for screening, and then the powerful mode recognition capability of machine learning is utilized for deep analysis, so that the real abnormal point can be accurately recognized. Misjudgment and missed judgment caused by judgment according to only a single dimension are avoided, for example, some data points can deviate from a normal range statistically, but are reasonable in business logic (such as sales surge caused by holidays), or even though the business logic looks abnormal, the data points can be actually the start of a new business trend, and whether the data points are truly abnormal or not can be judged more accurately through deep analysis of machine learning.
Meanwhile, a plurality of factors including statistics, business logic and machine learning are comprehensively considered, so that the system can adapt to complex and changeable business scenes. The business logic of different enterprises is quite different, the data distribution is quite different, and the module can be customized and configured according to the actual conditions of the enterprises. For example, for the financial industry, the business logic has extremely high requirements on risk control, the abnormality of data can involve huge economic loss, so that stricter abnormality level thresholds and business range rules can be set, while for some innovative internet enterprises, the business data fluctuation is large, the business mode update is quick, and new data modes and business logic changes can be adapted more quickly through machine learning sub-rules.
By assigning anomaly levels to the offset data points, quantitative support is provided for enterprise decisions. The enterprise manager can reasonably allocate resources to carry out data checking and problem processing according to the level of the abnormal level. For example, for data points with serious anomalies, professional team can be organized preferentially to conduct deep investigation, analyze the reasons of anomalies and take corresponding corrective measures, for data points with slight anomalies, observation can be conducted first, and then whether to take action or not can be determined according to the change condition of subsequent data. The quantification mode is beneficial to improving the operation efficiency of enterprises and avoiding the problem of blindly processing data anomalies.
Optionally, the influence path tracking module is specifically configured to:
acquiring a service association rule and a data association rule of the abnormal point corresponding to the data type;
Taking the abnormal point as an initial node and adding the initial node into a propagation path queue;
determining a downstream association node of the initial node according to the service association rule and the data association rule;
sequentially adding the downstream associated nodes into a propagation path queue to obtain a preliminary propagation path;
and recursively expanding the preliminary propagation path until the propagation path cannot be expanded continuously, so as to obtain the propagation path.
Specifically, the influence path tracking module firstly extracts a business association rule and a data association rule of a data type corresponding to an abnormal point from business logic documents, data dictionaries, data flowcharts and other materials of an enterprise. Business association rules are formulated based on the actual business flow and management requirements of enterprises, and describe the interrelationship of different data types at a business level. For example, in a financial business, revenue data together with cost data affects profit data, i.e., profit = revenue-cost, i.e., a business association rule. From the perspective of data storage and processing, the data association rule describes the association relation of data in structures such as database tables, data warehouses and the like, for example, in the database, an order table is associated with a client table through a client number field, and the order table is associated with a product table through a product number field.
When the service association rule and the data association rule are acquired, the influence path tracking module takes the determined abnormal point as an initial node and puts the initial node into a propagation path queue. The propagation path queue is a data structure for storing and managing propagation path nodes, and adopts a first-in first-out (FIFO) principle. For example, it is assumed that the monthly sales data of a certain product is found as an abnormal point in sales data, and the initial node corresponding to the abnormal point contains information such as a data type (sales), a data point (specific sales value), an association rule (association rule with sales, sales price, etc. data), and the like, and after the information is added to the propagation path queue, only the initial node is contained in the queue at this time. Next, the downstream association node of the initial node is analyzed according to the business association rule and the data association rule. The downstream associated node refers to other data nodes affected by the initial node data in the business process or the data processing process. Based on the business association rule, profit data is affected by sales data, so that a node corresponding to the profit data is a downstream association node of sales abnormal points. From the data association rule perspective, in the database, if sales line data is stored in a sales detail table, and the sales detail table is associated with a sales summary table through a date field and a product number field, the corresponding profit data node in the sales summary table is a downstream association node. The module searches the downstream association nodes of the initial nodes on the service and data layers by analyzing the association rules, and records the related information of the nodes, such as data types, data points, association paths and the like.
After determining the downstream associated nodes of the initial node, the downstream associated nodes are added to the propagation path queue in turn. Taking the sales abnormal point as an example, after the downstream associated node corresponding to the profit data is added to the queue, the propagation path queue includes the sales abnormal point (initial node) and the profit data node (downstream associated node). At this time, the influence path tracking module checks whether the order of the nodes in the queue accords with the actual order of the business process and the data processing process, and appropriately adjusts the nodes to form a preliminary propagation path. For example, according to the business process, sales affect profit first, profit then other financial indicators (e.g., net profit, etc.), nodes in the queue are arranged in this order, resulting in a preliminary propagation path from sales outliers, sequentially through profit data nodes.
Finally, the influence path tracking module recursively expands the preliminary propagation path, wherein the recursion expansion refers to the process of determining the downstream associated node and adding the downstream associated node into the queue by taking each node in the propagation path queue as a new initial node. Taking profit data nodes in the above example as an example, the influence path tracking module searches downstream association nodes of the profit data nodes, such as tax payment data nodes (profit influences tax payment amount) and the like, according to the business association rules and the data association rules, and adds the downstream association nodes to the propagation path queue, and continues the process until a new downstream association node cannot be found. At this time, the propagation path queue includes all relevant nodes extending from the initial abnormal point along the association relationship between the service and the data, so as to form a complete propagation path. For example, the final propagation path comprises sales abnormal points, profit data nodes, tax data nodes, financial statement summary data nodes and the like, and all potential influence ranges of the abnormal points in the business process and the data processing process are covered.
In the optional embodiment, by combining the service association rule and the data association rule, not only the logic relationship of the data in the service layer is considered, but also the technical association of the data in the storage and processing processes is considered, so that all data nodes possibly affected by abnormal points are accurately tracked. The influence range omission caused by only focusing on local association is avoided, and a more reliable basis is provided for enterprises to comprehensively evaluate the data quality.
The embodiment starts from an initial abnormal point, and gradually determines the downstream associated nodes in a recursion expansion mode, so that each data node affected by the abnormal point can be accurately positioned. Compared with the traditional influence analysis method based on the simple association rule, the method can be used for more finely combing complex relations among data, particularly for a multi-level and net-shaped data association structure, affected nodes on each level can be accurately identified, and accuracy of influence path tracking is improved. By constructing the propagation path queue and processing the nodes according to a certain sequence, the path tracking process is influenced with higher efficiency. The first-in first-out queue principle ensures the sequence of node processing, avoids repeated processing and confusion, and accelerates the generation speed of a propagation path. When facing large-scale report data and complex association relations, the orderly and efficient tracking mode can remarkably improve the efficiency of data quality evaluation and timely provide support for decision making of enterprises.
Optionally, the influence path tracking module is specifically further configured to:
determining a node corresponding to a subsequent link of the initial node in a service flow according to the service association rule;
Determining nodes corresponding to relevant data points of the initial nodes in a data structure according to the data association rule;
And taking the node corresponding to the subsequent link and the node corresponding to the relevant data point as the downstream associated node of the initial node.
Specifically, the influence path tracking module extracts business association rules from business flow documents and business operation specifications of an enterprise, wherein the rules describe the sequence and interrelationships among different business links in detail. For example, in the production flow of an enterprise, after the raw material purchasing link is completed, a production and processing link is followed, after the production and processing are completed, a quality inspection link is followed, and finally, a product warehousing link is followed. There is a definite business association relationship between these links. In particular, the module invokes an interface provided by an Enterprise Resource Planning (ERP) system or a Business Process Management (BPM) system to obtain detailed information of the business process. These systems typically define the various links of a business process and their flow rules in the form of a workflow. By analyzing the workflow definitions, the module can determine the business links where the initial nodes are located, and further determine the nodes corresponding to the subsequent links. For example, if the initial node is a purchase order data abnormal point in the raw material purchasing link, the node corresponding to the subsequent link is a production plan data node in the production processing link according to the business process definition, because the production plan is affected by the raw material purchasing condition.
After determining the node corresponding to the subsequent link of the initial node in the business flow, the module further determines the node corresponding to the relevant data point of the initial node in the data structure according to the data association rule. The data association rules are mainly derived from database design documents and data dictionaries of enterprises. Taking the financial database of an enterprise as an example, in the data structure, an association relationship is established among the sales order table, the client information table and the product information table through foreign key constraint. The customer number field in the sales order table is associated with the customer number field in the customer information table and the product number field is associated with the product number field in the product information table. Meanwhile, inside the sales order table, there is a calculation relationship between the sales amount field and the sales number field and the product unit price field, that is, sales amount=sales number×product unit price. Specifically, the module will retrieve the data point associated with the initial node by a database query statement. For example, if the initial node is a special point of sales order data in the sales order table, the module may determine the node corresponding to the relevant data points by executing the SQL query statement to query all other data records (such as customer history orders, customer credit, etc.) identical to the customer number of the sales order, and the data points of product inventory information, product cost information, etc. related to the product number. After the nodes corresponding to the subsequent links and the nodes corresponding to the related data points are determined from the angles of the business association rule and the data association rule respectively, the module unifies the nodes into a downstream association node, and the downstream association node covers all related nodes which are possibly influenced by the initial node in two dimensions of business process advancing and data structure association.
Taking sales data abnormal points as an example, the nodes corresponding to the follow-up links comprise production plan data nodes (follow-up links of the business process), and the nodes corresponding to the related data points comprise customer history order data nodes, customer credit line data nodes, product inventory data nodes, product cost data nodes and the like (data structure association). These downstream associated nodes are consolidated and stored in a data structure for subsequent recursive expansion processing.
In the alternative embodiment, the influence scope of the initial node on the two layers of the business process and the data structure can be covered comprehensively by combining the business association rule and the data association rule. The business association rule ensures that nodes possibly affected by abnormal points in subsequent business links are identified from the perspective of actual business operation flow, and the data association rule excavates nodes corresponding to all data points directly or indirectly associated with the initial node from the perspective of data storage and processing. The application of the double rules avoids the omission of the influence range possibly caused by single-dimension analysis and improves the integrity of influence path tracking.
And determining the nodes corresponding to the follow-up links and the related data points in two dimensions of the business process and the data structure respectively, and accurately positioning each downstream associated node possibly affected by the initial node. Compared with a method for analyzing according to a service flow or a data structure, the comprehensive method of the embodiment of the invention can more accurately identify the affected nodes in complex service and data environments. For example, a certain abnormal point of sales order data can find the potential influence of the sales order data on a subsequent production plan through the business association rule, and further find the influence of the sales order data on multiple aspects such as client credit assessment, inventory management and cost accounting through the data association rule, so that reliable basis is provided for subsequent accurate analysis and processing.
And the nodes corresponding to the follow-up links and the nodes corresponding to the related data points are used as downstream related nodes, so that powerful guarantee is provided for the accuracy of data quality assessment. In the data quality evaluation process, the influence degree of the abnormal points on the overall quality of the report data can be comprehensively evaluated from multiple angles. By comprehensively considering the downstream associated nodes, the influence of the data abnormality on business decisions, financial reports, operation performance and the like can be evaluated more accurately, so that an enterprise manager is helped to make more scientific and reasonable decisions, the data management and quality control strategies of the enterprise are optimized, and the operation efficiency and the competitiveness of the enterprise are improved.
Optionally, the affected data analysis module is specifically configured to:
Acquiring the data points in the propagation path according to the propagation path;
analyzing the affected degree of the data points to obtain an affected degree value of each data point;
Determining an affected data point according to the affected degree value;
the affected data points are labeled and a set of affected data points is generated.
Specifically, the affected data analysis module obtains data points in a propagation path according to the propagation path generated by the affected path tracking module, wherein the propagation path comprises a series of data nodes possibly affected by abnormal points. The affected data analysis module extracts corresponding data points from the database or data warehouse of the enterprise by calling the data access interface according to the detailed information (such as data types, data point positions, association rules and the like) of each node recorded in the propagation path. For example, if the travel path includes sales data nodes, inventory data nodes, and financial data nodes, the module will extract sales data points from the sales database table, inventory data points from the inventory database table, and financial data points from the financial database table, respectively. These data points will be the subject of subsequent impact level analysis. After the data points in the propagation path are acquired, the module performs a degree of influence analysis on each data point.
In a preferred embodiment of the invention, the analysis method comprises a plurality of ways of quantifying the extent of influence, in particular comprising:
Deviation calculation, which calculates the deviation between the actual value and the expected value of each data point. The expected value may be determined based on an average of historical data, trend prediction values, or reasonable values defined by business rules. For example, for inventory data points, the expected value may be an ideal inventory level calculated from historical sales data and inventory turnover rates. The difference between the actual value and the expected value is the deviation, and the absolute value or the relative value of the deviation can be used as an index for measuring the affected degree.
Correlation analysis for analyzing the correlation strength between data points and outliers. The linear or nonlinear relationship between the data points and outlier data is quantified by calculating their correlation coefficients (e.g., pearson correlation coefficients, spearman correlation coefficients, etc.). The closer the absolute value of the correlation coefficient is to 1, the greater the probability that the data point is affected by the outlier, and the higher the degree of the influence. For example, if the correlation coefficient between the sales data outlier and the profit data point is 0.8, indicating a strong correlation between the two, the profit data point is likely to be significantly affected by the sales data outlier.
In some cases, causal analysis methods (e.g., granger causal test, structural equation model, etc.) are utilized to determine whether causal relationships exist between data points and outliers. If causal relationships exist, the strength of the causal relationships is further evaluated to determine the degree of influence. For example, in production data, raw material quality anomalies may affect product yield data points by causal relationships, and by Granger causal inspection, it may be verified whether such causal relationships exist and the extent of their impact quantified. Combining the above analysis methods, a degree of influence value is calculated for each data point. This value is a composite score that is weighted by a number of factors such as bias, relevance, causality, etc. For example, a deviation weight of 0.4, a correlation weight of 0.3, and a causal relation weight of 0.3 are set, and a final affected degree value is calculated from the quantized result of each factor.
Based on the calculated affected extent values, the module sets an affected extent threshold to determine the affected data points, which in this embodiment may be based on historical data analysis, business experience, or statistical principles. For example, a reasonable threshold is determined by analyzing the distribution of affected extent values in both normal and abnormal situations in the historical data. The influence degree threshold is set to 0.5, and the data points with the calculated influence degree value larger than or equal to 0.5 are judged to be influenced data points, and the data points with the influence degree lower than 0.5 are considered to be smaller, so that the data points can be temporarily not classified as influenced data points. The determined affected data points are marked, and the marking information comprises detailed information such as identification of the data points, the affected degree value, the position on the propagation path and the like. For example, for an affected profit data point, it is labeled "affected data point-profit-affected extent value 0.6-located at propagation path 2 nd link". These labeled affected data points are then collected to generate a set of affected data points. This set may be stored in the form of a list, data table, or data structure that facilitates overall quality assessment and generation of quality assessment reports by subsequent data assessment modules.
In this alternative embodiment, the degree of influence of the outliers on each data point in the propagation path can be accurately quantified by comprehensively calculating the influence degree value through a plurality of analysis methods (deviation calculation, correlation analysis, causality inference, etc.). Compared with a single analysis method, the method of the embodiment reflects the actual affected situation of the data more comprehensively and objectively and provides a reliable data basis for subsequent data quality evaluation. The affected data points are determined according to the affected degree threshold, so that key data points which are affected by abnormal points more and data points which are affected less can be effectively distinguished. This helps the enterprise concentrate on and handle those data problems that actually have a significant impact on business and decision making, improving the efficiency and pertinence of data management. The affected data points are labeled and a set is generated, providing a clear view of the data impact. Through the set, the propagation influence range and degree of the abnormal points in the report data can be intuitively known, and reasonable data correction strategies and business countermeasures can be formulated conveniently.
Optionally, the data evaluation module is specifically configured to:
establishing a data quality evaluation model;
determining, by the data quality assessment model, a severity value for the outlier and each of the affected data points and a range of influence of the outlier;
performing an overall quality assessment on the report data using the outlier and the severity value for each of the affected data points and the range of influence of the outlier;
determining an abnormal degree value of the report data according to the evaluation result, and generating an abnormal detail list of the report data;
and generating the quality assessment report according to the abnormal detail list and the quality score.
Specifically, the historical information of report data is collected and arranged, multidimensional characteristics such as accuracy, completeness, consistency, timeliness and the like of the data are covered, and a database containing a large number of data samples is constructed and used for training and optimizing a data quality assessment model. The samples are learned and analyzed by using machine learning algorithms, such as decision trees, neural networks, etc., so that the model can identify the rules and patterns of data quality.
In a preferred embodiment of the present invention, the collected report data samples are first preprocessed before starting to apply the machine learning algorithm. The data preprocessing comprises the steps of data cleaning, data integration, data conversion, data reduction and the like. The data cleaning is used for removing noise in the data and processing missing values, the data integration is used for merging the data from different data sources together, the data conversion is used for converting the data into a format suitable for algorithm processing, such as normalization, standardization and the like, and the data reduction is used for reducing the data quantity and improving the algorithm efficiency. Through preprocessing, the quality and consistency of the data are ensured, and a reliable data base is provided for subsequent machine learning algorithm application.
And performing feature selection and engineering on the preprocessed data, and extracting key features related to the data quality. These characteristics may include an accuracy indicator of the data (e.g., deviation of the data from an actual value), an integrity indicator (e.g., proportion of missing values), a consistency indicator (e.g., degree of contradiction between data), a timeliness indicator (e.g., update frequency of the data), etc. According to the business requirements and data characteristics, the characteristics with the most influence on the data quality evaluation are selected by using methods such as statistical analysis and correlation analysis, and the characteristics are combined and constructed so as to improve the performance and generalization capability of the model.
And learning and analyzing the processed data samples by using a machine learning algorithm such as a decision tree, a neural network and the like, and training a data quality evaluation model.
In one embodiment, the selected machine learning algorithm is a decision tree, which is a tree-structure-based supervised learning algorithm. During the training process, the dataset is continuously divided into subsets by selecting the appropriate features and split points until the stop condition is met. For data quality assessment, the decision tree can determine whether the data point belongs to abnormal data according to the value of the feature, and determine the severity value of the data point. For example, the decision tree model is constructed by taking the characteristics of data accuracy, integrity and the like as split basis. Each internal node represents a test of a feature, each branch represents a test result, and each leaf node represents a class (e.g., normal data, mild abnormal data, severe abnormal data, etc.) or outputs a continuous value (e.g., severity value). The decision tree is learned through the training data set, and the structure and parameters of the tree are adjusted, so that the model can accurately identify the rules and modes of the data quality.
In another embodiment, the selected machine learning algorithm is a neural network, which is an algorithm that mimics the structure of a human brain neural network. Is composed of a large number of neurons (nodes) through which information is transferred and processed. In the data quality evaluation, the feature vector is used as the input of the neural network, and the severity value or quality class of the data is finally output through the nonlinear transformation of the multi-layer neurons. For example, a multi-layer perceptron (MLP) neural network model is constructed that includes an input layer, a hidden layer, and an output layer. The input layer receives the data quality related characteristics, the hidden layer extracts and combines the characteristics through the activation function, and the output layer gives out an evaluation result. The weights of the neural network are iteratively updated by using the training data set through a back propagation algorithm, so that the performance of the model is optimized, and the model can learn the complex rules and modes of the data quality.
In the model training process, the method such as cross verification is adopted to verify the model, and the accuracy and generalization capability of the model are evaluated. By adjusting parameters of an algorithm, such as the depth of a decision tree, the number of layers of a neural network, the number of neurons and the like, the over-fitting or under-fitting of the model is avoided, and the prediction effect of the model is improved.
After the report data is input into the data quality evaluation model, the model utilizes the learned abnormal characteristic recognition rule to quickly locate abnormal points in the data. For each abnormal point, the severity value is determined by calculating the deviation degree of the abnormal point from normal data and adopting a standardized formula, such as a standard deviation method or a mean square deviation method based on statistics, so as to quantify the influence degree of the abnormal point on the data quality. Meanwhile, the influence range of the abnormal point on surrounding data points is analyzed by utilizing the association relation and the propagation path between the data. For example, for an outlier revenue data point in a financial report, by analyzing its association with other data points such as cost, profit, etc., determining the range of the outlier impact includes cost calculation data points, profit accounting data points, etc. directly related thereto, as well as the impact hierarchy and range boundaries throughout the report data. This process can accurately identify which data points are affected by outliers, providing an accurate basis for further quality assessment. Comprehensively considering severity values of the outlier and each affected data point and the influence range of the outlier, and carrying out overall quality assessment on report data by using a weighted summation or other mathematical methods. According to a preset evaluation model algorithm, the severity value of each abnormal point is multiplied by the corresponding influence range weight, and the quality weight value of other normal data points is added to obtain a comprehensive quality score. For example, for a report containing a plurality of outliers, the score for each outlier to the overall quality is calculated based on the severity value and the range of influence of each outlier, and then the overall quality score is subtracted from the score. Meanwhile, a detailed quality evaluation report is generated, and contents such as problems, positions of abnormal points, influence ranges and the like in report data are listed, so that a user can conveniently know the overall view of the data quality.
And according to the result of the overall quality evaluation, calculating the abnormal degree value of the report data through further data analysis and statistics. The report data processing method can be obtained by comprehensively summarizing and normalizing the severity values of all abnormal points, so that the report data processing method can intuitively reflect the degree of abnormality of the report data. Meanwhile, according to the detailed information of each abnormal point, including the abnormal type, position, severity value, influence range and the like, an abnormal detail list is generated according to a certain ordering rule and format requirement. The list clearly shows the abnormal point conditions in the report in the form of a table or a list, so that a user can conveniently and quickly locate and check detailed information of abnormal data, and a powerful basis is provided for subsequent data processing and correction.
And combining the abnormal detail list and the quality score, and generating a complete and detailed data quality assessment report by using a report generation template and a formatting tool. The report content includes various parts of an overview of the data quality assessment, assessment methods and models, outlier analysis, overall quality scoring, improvement advice, and the like. Wherein the outlier analysis section elaborates the details of each outlier, the overall quality scoring section presents the comprehensive level of data quality in the form of an intuitive graph or numerical value, and the improvement suggestion section provides the user with a reference opinion based on the evaluation result. By generating the quality evaluation report, the user can comprehensively know the quality condition of report data, discover and solve the data quality problem in time, and improve the reliability and availability of the data.
The report content includes:
And the data quality overview is used for summarizing the overall quality condition of report data, and comprises key indexes such as data quality scores, quality grades, abnormal degree values and the like.
Abnormality detail analysis, namely, detailing each abnormal point and affected data point in an abnormality detail list, and setting forth the specific situation and possible reasons of abnormality or affected. If the value of a sales data abnormal point is far higher than that of a history contemporaneous point, the value may be caused by sudden changes of the market environment or data entry errors, and the calculation result deviation of the affected profit data point is larger due to the abnormal sales data.
And (3) evaluating the influence range of the abnormal point, and explaining the related business links, data types and potential influences on other businesses and data. Such as sales data anomalies, not only affect profit computation, inventory management, production planning, etc.
And (3) improving advice, namely, aiming at the discovered data quality problem, providing corresponding improving advice. Such as enhancing data entry auditing, optimizing data monitoring rules, developing data anomaly root cause analysis, etc., to help enterprises to promote data quality management levels.
In the optional embodiment, through the established data quality evaluation model, abnormal points in report data can be rapidly and accurately identified, the complexity and error-prone performance of manual inspection are avoided, and the efficiency and accuracy of data abnormality identification are greatly improved. And determining severity values and influence ranges of the abnormal points and the affected data points, and carrying out overall quality assessment on report data, thereby realizing comprehensive quantitative analysis on data quality. This makes the assessment of data quality more objective and accurate, providing reliable data support for data management and decision making. The generated abnormal detail list and quality evaluation report provide detailed abnormal data information and improvement suggestions for users, are helpful for the users to find data problems in time and take effective improvement measures, improve the quality and credibility of report data, and further improve the data management level and decision efficiency of enterprises. Through a strict data quality evaluation and supervision mechanism, the quality of report data is ensured to be effectively ensured, system faults and business risks caused by data quality problems are reduced, and the reliability and stability of the whole system are enhanced.
The invention also provides a report data abnormity monitoring and quality assessment method, which comprises the following steps:
acquiring report data of an enterprise, and analyzing the report data to obtain data points corresponding to each data type in the report data;
Analyzing the data points of the data types, and judging whether abnormal points exist in the data points according to preset judging rules corresponding to the data types;
When the abnormal point exists in the data points, path tracking is carried out on the abnormal point through a path analysis algorithm, and a propagation path corresponding to the abnormal point is determined;
Performing an affected-degree analysis on the data points in the propagation path, determining affected data points;
evaluating an overall quality level of the report data based on the affected data points in combination with the outliers; and generating a quality evaluation report of the report data according to the overall quality level.
The report data anomaly monitoring and quality assessment method of the present invention has the same advantages as those of the report data anomaly monitoring and quality assessment system in the prior art, and is not described in detail herein.
As shown in fig. 3, the present invention further provides an electronic device, including a memory and a processor;
the memory is used for storing a computer program;
The processor is used for realizing the report data abnormity monitoring and quality evaluation method when the computer program is executed.
The advantages of the electronic device of the present invention compared with the prior art are the same as those of the report data anomaly monitoring and quality assessment system compared with the prior art, and are not described in detail herein.
Although the present disclosure is disclosed above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the disclosure.