Disclosure of Invention
The invention aims to provide a heterogeneous data source fusion analysis system so as to solve the problems in the background art.
In order to achieve the above purpose, the invention provides a heterogeneous data source fusion analysis system, which comprises:
the data integration module is used for acquiring data quality parameters of distributed storage nodes associated with heterogeneous data sources in a plurality of data acquisition periods, carrying out multidimensional analysis on the data quality parameters to generate data quality evaluation values, judging the data fusion reliability of the distributed storage nodes based on the data quality evaluation values, and generating priority instructions, wherein the instructions comprise high-priority adjustment instructions and low-priority adjustment instructions;
comparing the data quality assessment value with a data quality assessment threshold;
if the data quality evaluation value is larger than the data quality evaluation threshold value, generating a high-priority adjustment instruction;
if the data quality evaluation value is smaller than or equal to the data quality evaluation threshold value, generating a low-priority adjustment instruction;
The characteristic analysis module is used for extracting time nodes with abnormal data quality of the distributed storage nodes in a plurality of data acquisition periods based on the high-priority adjustment instruction, carrying out association analysis on the time nodes and the data quality parameters to generate a data characteristic rule value, judging whether the data quality abnormality presents periodic characteristics based on the data characteristic rule value, and generating a rule adjustment instruction if the data quality abnormality exists;
the time node with abnormal data quality comprises a specific time point when the abnormality occurs;
And the fusion evaluation module is used for determining an abnormal influence period in the data fusion period based on the rule adjustment instruction, matching the abnormal influence period with the current fusion time node, dynamically adjusting the fusion flow of the heterogeneous data source according to the matching result, and generating an adjustment instruction, wherein the adjustment instruction comprises a pause fusion instruction and a continuous fusion instruction.
Preferably, the data quality parameter comprises a data consistency index and a data integrity index;
Based on independent analysis of the data consistency index and the data integrity index, respectively generating a consistency evaluation value and an integrity evaluation value;
and carrying out weighted fusion on the consistency evaluation value and the integrity evaluation value to generate a data quality evaluation value.
Preferably, data consistency indexes of the distributed storage nodes in a plurality of data acquisition periods are integrated and an average value is calculated, a consistency average value is generated, the consistency average value and the total duration of the data acquisition periods are processed in proportion to obtain consistency frequency, and the consistency frequency and a consistency frequency threshold value are normalized to generate a consistency evaluation value.
Preferably, the data integrity indexes of the distributed storage nodes in a plurality of data acquisition periods are accumulated to generate an integrity sum, the integrity sum and the number of the data acquisition periods are processed in proportion to obtain an integrity average value, and the integrity average value and the total duration of the data acquisition periods are normalized to generate an integrity evaluation value.
Preferably, the generation mode of the data characteristic rule value is as follows:
Generating a data stability representation value and a data change representation value;
and linearly combining the data stable representation value and the data change representation value through a preset weight coefficient to generate a data characteristic rule value.
Preferably, the data stabilization representation value is generated by:
screening historical data acquisition periods of which the data consistency index and the data integrity index are in normal ranges, marking the historical data acquisition periods as stable periods, and carrying out serialization marking on data quality abnormality time points in each stable period according to a time sequence;
Calculating the time interval deviation of adjacent time points based on the time points after the serialization marking;
comparing the time interval deviation with a preset interval threshold, and marking the time interval deviation as a stable interval if the time interval deviation is smaller than or equal to the threshold;
counting the proportion of the number of the stable intervals to the total number of intervals, and marking the stable period as a comprehensive stable period if the proportion exceeds a preset proportion threshold;
and counting the proportion of the number of the comprehensive stable periods to the total number of the historical data acquisition periods, and generating a data stable representation value.
Preferably, the data change expression value is generated by:
extracting interval time of adjacent data quality abnormal time points in all comprehensive stable periods to form an interval time set;
and calculating the variance of each group of interval time in the interval time set, calculating an average value, generating a variance average value, and marking the variance average value as a data change representation value.
Preferably, comparing the characteristic rule value of the data with a characteristic rule threshold value;
If the characteristic rule value of the data is larger than the characteristic rule threshold value, a rule adjustment instruction is generated;
if the characteristic rule value of the data is smaller than or equal to the characteristic rule threshold value, no operation is triggered.
Preferably, the abnormal influence period in the data fusion period is determined in the following manner:
extracting a starting time point and a terminating time point of a data fusion period, and marking on a time axis;
Acquiring the latest data quality abnormality time point before the starting time point of the data fusion period, and taking the latest data quality abnormality time point as a reference point;
calculating the average value of the historical data quality abnormal interval time, generating an interval reference value, taking the reference point as a starting point, and marking the predicted abnormal starting point on a time axis according to the interval reference value in sequence;
superposing preset abnormal duration time after each predicted abnormal starting point to generate a predicted abnormal ending point;
and marking the time period between the predicted abnormal starting point and the corresponding predicted abnormal ending point as a predicted abnormal time period, and marking the overlapping part of the time range of the data fusion period and the predicted abnormal time period as an abnormal influence time period.
Preferably, when the data fusion task is executed, the current fusion time node is obtained and matched with the abnormal influence period;
if the current fusion time node is positioned in the abnormal influence period, generating a suspension fusion instruction;
If the current fusion time node is located outside the abnormal influence period, generating an instruction to be matched;
Based on the instruction to be matched, positioning a predicted abnormal starting point closest to the current fusion time node on a time axis, and calculating the interval duration between the two;
If the interval time length is greater than or equal to a preset safety threshold value, generating a continuous fusion instruction;
If the interval duration is smaller than the preset safety threshold, generating a pause fusion instruction.
Compared with the prior art, the invention has the beneficial effects that:
At the data quality evaluation level, the system acquires data quality parameters, such as data consistency indexes and data integrity indexes, of distributed storage nodes associated with heterogeneous data sources in a plurality of data acquisition periods and performs multidimensional analysis on the data quality parameters. Firstly, respectively generating a consistency evaluation value and an integrity evaluation value, and then, obtaining a data quality evaluation value through weighted fusion. Compared with the traditional simple judgment method, the method fully considers different characteristics of the data and avoids the unilateral performance of single index evaluation. Taking data processing in the financial industry as an example, in the fusion analysis of customer asset data and transaction data, accurate quality assessment can ensure the accuracy of the data, reduce risk assessment errors caused by data errors, provide reliable basis for the decision of financial institutions, and reduce potential economic losses.
Based on the data quality assessment results, the system can generate priority instructions. And generating a high-priority adjustment instruction when the data quality evaluation value is larger than a data quality evaluation threshold value, and generating a low-priority adjustment instruction when the data quality evaluation value is smaller than or equal to the threshold value. The system can adopt different processing strategies in time according to the data quality condition, and system resources are reasonably distributed. When the electronic commerce platform processes a large amount of commodity data, fusion analysis is carried out on high-quality data preferentially, a precise sales trend report is output rapidly, a power assisting merchant adjusts an operation strategy timely, and low-quality data is arranged to be processed when resources are relatively idle, so that the overall operation efficiency of the system is improved.
The feature analysis module further mines the data value. Based on a high-priority adjustment instruction, extracting a time node of the data quality abnormality, carrying out association analysis by combining the time node and the data quality parameter, generating a data characteristic rule value, judging whether the data quality abnormality presents a periodic characteristic, and if so, generating a rule adjustment instruction. This function helps to get in depth knowledge of the intrinsic rules of the data and to find potential problems in advance. In the power system monitoring, the time of equipment failure occurrence can be predicted by analyzing the periodic characteristics of abnormal power data quality, maintenance work is arranged in advance, power failure accidents caused by equipment failure are reduced, the stability of power supply is ensured, and the maintenance cost is reduced.
The fusion evaluation module determines an abnormal influence period in a data fusion period according to the rule adjustment instruction, matches the abnormal influence period with a current fusion time node, dynamically adjusts the fusion process of the heterogeneous data source, and generates a pause fusion instruction or a continuous fusion instruction. The dynamic adjustment mechanism effectively avoids fusion operation in the data quality abnormal period, and ensures the accuracy of fusion results. In the meteorological data fusion analysis, when the meteorological sensor data is abnormal, the system pauses fusion, so that the error data is prevented from being mixed into an analysis result, the reliability of meteorological prediction is ensured, and accurate meteorological information is provided for the production and life of people.
The data stability expression value and the generation mode of the data change expression value characterize the data from different angles. The data stability representation value is obtained through the steps of screening a stability period, analyzing time interval deviation and the like, the stability of the data is reflected, and the data change representation value is generated through calculating an interval time variance mean value, so that the change condition of the data is reflected. The data characteristic rule value generated by the linear combination of the two provides a comprehensive basis for judging the abnormal characteristics of the data quality. In the logistics data processing, the feature analysis can help enterprises to better master the rule of logistics distribution, discover abnormal fluctuation in time, optimize logistics routes and distribution plans, improve logistics efficiency and reduce operation cost.
The method for determining the abnormal influence time period in the data fusion period comprehensively considers factors such as the average value of the quality abnormal interval time of the historical data, so that the determination of the abnormal time period is more scientific and reasonable. In practical application, when the data fusion task is executed, the system can accurately make a decision according to the matching condition of the current fusion time node and the abnormal influence period, and the adaptability and the reliability of the system are improved. In the industrial production process monitoring, the function can ensure the accuracy of production data fusion analysis, discover abnormal conditions in the production process in time, ensure the smooth production and improve the product quality.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-5, the present invention provides a heterogeneous data source fusion analysis system, which comprises the following specific implementation steps:
And the data integration module continuously acquires data quality parameters of the distributed storage nodes associated with the heterogeneous data sources in a plurality of data acquisition periods in the system operation process. These data quality parameters play a key role in assessing data quality. After the data quality parameters are acquired, the data integration module performs multidimensional analysis on the data quality parameters, and a data quality evaluation value capable of accurately reflecting the data quality condition is generated by comprehensively considering factors of each dimension. Based on the data quality assessment value, the data integration module can judge the data integration reliability of the distributed storage nodes. Specifically, the data quality evaluation value is compared with a preset data quality evaluation threshold. If the data quality evaluation value is larger than the data quality evaluation threshold, the data quality is better, the fusion reliability is higher, and a high-priority adjustment instruction is generated at the moment, otherwise, if the data quality evaluation value is smaller than or equal to the data quality evaluation threshold, a low-priority adjustment instruction is generated. These instructions provide a basis for the prioritization of data processing by subsequent systems.
The feature analysis module starts working after receiving the high priority adjustment instruction. The method can extract time nodes with abnormal data quality of the distributed storage nodes in a plurality of data acquisition periods, and the time nodes are accurate to the specific moment when the abnormality occurs. Then, association analysis is performed by combining the time nodes and the data quality parameters. Through analysis and processing of a large amount of data, data characteristic rule values are generated. After the data characteristic rule value is obtained, the characteristic analysis module can judge whether the data quality abnormality presents periodic characteristics. If the data characteristic rule value shows that the periodic characteristic exists, a rule adjustment instruction is generated, and a basis is provided for adjustment of a fusion process by a subsequent fusion evaluation module.
And after receiving the rule adjustment instruction, the fusion evaluation module determines an abnormal influence period in the data fusion period. The abnormal influence time period is matched with the current fusion time node, and the fusion flow of the heterogeneous data source is dynamically adjusted according to the matching result. If the current fusion time node is outside the abnormal influence period, whether the fusion can be continued or not is further judged, and a continuous fusion instruction or other related instructions are generated according to different conditions.
The technical scheme of the invention is further described in detail below with reference to specific embodiments.
Example 1:
In the system, the data quality parameter encompasses a data consistency indicator and a data integrity indicator. For the data consistency index, the system integrates the data consistency index of the distributed storage nodes in a plurality of data acquisition periods. Summarizing the data consistency indexes in each period, and calculating the average value of the data consistency indexes to obtain a consistency average value. This consistency mean can reflect the average level of data consistency over multiple cycles. And then, carrying out proportional processing on the consistency mean value and the total duration of the data acquisition period. Through the proportional processing, the change frequency condition of data consistency in the whole data acquisition time period can be known, and the consistency frequency is obtained. And then, normalizing the consistency frequency and a consistency frequency threshold value. The normalization processing is to unify the data of different orders to one standard scale, so that comparison and analysis are convenient. Through such processing, a consistency evaluation value is finally generated.
For the data integrity index, the system accumulates the data integrity indexes of the distributed storage nodes in a plurality of data acquisition periods to obtain an integrity sum. And then, the total integrity and the number of data acquisition cycles are subjected to proportional processing, so that the average level of the data integrity in each cycle can be calculated, and an integrity average value is obtained. And finally, normalizing the integrity mean value and the total time length of the data acquisition period, thereby generating an integrity evaluation value.
In generating the data quality assessment values, the system performs weighted fusion of the consistency assessment values and the integrity assessment values. And respectively giving different weights to the consistency evaluation value and the integrity evaluation value according to the actual requirements and the importance of the data. And comprehensively considering two factors of data consistency and integrity through weighted calculation, and finally generating a data quality evaluation value capable of comprehensively reflecting the data quality. The generation mode of the evaluation value can evaluate the data quality more scientifically and accurately, and provides a reliable basis for subsequent data processing.
It is assumed that there is a medical data management scenario in which a plurality of different types of medical devices are involved, which store data as heterogeneous data sources in distributed storage nodes. Each device performs data acquisition at regular intervals, for example, every 10 minutes, which is one data acquisition cycle.
In this scenario, the data consistency index is mainly used to measure whether the data collected by different medical devices are consistent in logic and numerical terms. For example, multiple devices measuring vital signs of a patient should have similar heart rate measurement data for the same patient. If a large deviation occurs, the problem of data consistency is indicated. The data integrity index pays attention to whether the data is complete or not, like the medical record information of a patient, each necessary filling field should be filled completely, and if the data is complete, the data integrity is not complete.
After the system starts to operate, in a period of time (including a plurality of data acquisition periods), aiming at the data consistency index, relevant information of data consistency of different devices in each period is integrated. For example, the consistency of heart rate data for each device is recorded over 10 data acquisition cycles (i.e., 100 minutes). The mean value of the uniformity index of the heart rate data in the 10 periods is calculated, and the uniformity mean value is 87.7 by summing and dividing the sum by the period number 10, assuming that the index values are 85, 88, 90, 86, 87, 89, 91, 88, 87 and 86 respectively.
The mean of agreement was then scaled 87.7 to the total duration of the data acquisition cycle of 100 minutes. Here, a value reflecting the uniformity variation frequency, i.e. the uniformity frequency, may be obtained by a reasonable calculation (the specific calculation is set by the system and the formula is not involved). Assume that the resulting uniformity frequency is 0.877. Then, the coincidence frequency of 0.877 is normalized with a predetermined coincidence frequency threshold value (assumed to be 0.8). The normalization process will measure the consistency frequency and the threshold under the same standard according to the rules established by the system, and finally generate a consistency evaluation value. For example, after a series of processing, a consistency evaluation value of 0.9 is obtained.
For the data integrity indicator, the medical data integrity related values for each cycle are also accumulated over the 10 data acquisition cycles. Assuming that the data integrity index values in each cycle are 90, 88, 92, 89, 91, 90, 87, 88, 90, 93, respectively, these values are accumulated to obtain an integrity sum of 898. And proportional processing is carried out on the integrity sum 898 and the number 10 of the data acquisition cycles, so that the integrity average value is 89.8. And finally, carrying out normalization processing on the integrity mean value 89.8 and the total time length of the data acquisition period for 100 minutes to generate an integrity evaluation value. Assume that the final integrity assessment value is 0.89.
When generating the data quality assessment value, the system can give different weights to the consistency assessment value and the integrity assessment value according to actual conditions. For example, according to the characteristics and importance of medical data, data consistency is considered to be relatively more important, a consistency evaluation value is given a weight of 0.6, and an integrity evaluation value is given a weight of 0.4. By weighting calculation (no formulas are involved, but only conceptual explanation), the consistency evaluation value 0.9 is multiplied by 0.6, the integrity evaluation value 0.89 is multiplied by 0.4, and then both are added, resulting in a data quality evaluation value of 0.9×0.6+0.89×0.4=0.54+0.356=0.896. The data quality evaluation value comprehensively considers the data consistency and the integrity, can reflect the data quality condition more comprehensively, and provides a reliable basis for subsequent data processing and analysis.
Example 2:
The generation process of the data stability expression value is complex and precise. The system screens historical data acquisition periods in which both the data consistency index and the data integrity index are in the normal range, and marks the periods as stable periods. This step enables the determination of a period of time during which the data quality is relatively stable, providing a reliable data basis for subsequent analysis. Then, the data quality abnormality time points in each stabilization period are sequentially marked in time order so as to analyze the relationship between the time points later. Based on the time points after the serialization marking, the time interval deviation of the adjacent time points is calculated. By this calculation, the time interval change between the abnormal data quality time points can be known. Then, the time interval deviation is compared with a preset interval threshold. If the deviation is less than or equal to the threshold, this time interval is indicated as relatively stable and is marked as a stable interval. Thereafter, the ratio of the number of stable intervals to the total number of intervals is counted. When the ratio exceeds a preset ratio threshold, the interval of the data quality abnormal time points is stable in the stable period, and the stable period is marked as a comprehensive stable period. And finally, counting the proportion of the number of the comprehensive stable periods to the total number of the historical data acquisition periods, wherein the proportion value is the data stable representation value. The data stability representation value can reflect the stability degree of the data in a long time, and provides an important reference for analyzing the characteristic rule of the data.
It is assumed that in an electronic commerce data processing system, there are a plurality of distributed storage nodes for storing sales data of different stores, each data acquisition period being 1 hour. The data consistency index is used for measuring the consistency of various sales data (such as order quantity, sales amount and the like) counted by different shops at the same time in logic and numerical values, and the data integrity index pays attention to whether all necessary filling information (such as commodity name, sales quantity, customer information and the like) in the sales data of the shops is complete or not.
The system has sales data for a plurality of historical data collection periods over a period of time. Firstly, the system screens historical data acquisition periods in which both the data consistency index and the data integrity index are in a normal range. For example, upon examination, during the past 24 data acquisition cycles (i.e., 24 hours), it was found that 15 cycles of data consistency and integrity indicators met the normal criteria, and these 15 cycles were marked as stable cycles.
For each stabilization period, the system will time sequentially mark the time points where the data quality is abnormal. Assuming that two data quality anomalies occur during a certain stabilization period (the 10 th hour acquisition period), the first anomaly occurs at 10:15 and the second anomaly occurs at 10:45, then 10:15 is marked as anomaly time point number 1 and 10:45 is marked as anomaly time point number 2.
The time interval deviation of adjacent time points is calculated based on the time points after the serialization flag. In this example, the time interval at the abnormal time points No.1 and No. 2 is 30 minutes. If the system presets an interval threshold of 40 minutes, 30 minutes less than 40 minutes, then this interval is marked as a stable interval.
The system will count the proportion of the number of stable intervals to the total number of intervals. In these 15 stable periods, there are 20 abnormal time point intervals in total, in which 16 intervals are smaller than the preset interval threshold, and then the ratio of the number of stable intervals to the total number of intervals is 16+.20=0.8. If the preset ratio threshold is 0.7,0.8 is greater than 0.7, this settling period is labeled as the integrated settling period.
And counting the proportion of the number of the comprehensive stable periods to the total number of the historical data acquisition periods, and generating a data stable representation value. Of the 24 historical data acquisition cycles, 12 are labeled as integrated stability cycles, and the data stability appearance value is 12++24=0.5. The data stability expression value reflects the stability of the intervals of the data quality abnormal time points in the electronic commerce sales data in a longer time, and provides an important reference basis for the subsequent analysis of the data characteristic rule.
Example 3:
The generation of the data change representation value is based on the data over all integrated stability periods. The system extracts the interval time of adjacent data quality abnormal time points in all comprehensive stable periods, and collects the interval time to form an interval time set. The set includes a plurality of interval time data reflecting the variation of the intervals of the data quality anomaly time points in the integrated stabilization period. Then, the variance of each group of interval times in the interval time set is calculated and the average value is obtained. The variance can measure the discrete degree of the data, and a numerical value capable of reflecting the fluctuation condition of the interval time change can be obtained by calculating the average value of the variance and is marked as the data change representation value. The data change representation value supplements information of the data characteristic rule from another angle, and provides more comprehensive data support for generating the data characteristic rule value together with the data stability representation value. When generating the data characteristic rule value, the system can linearly combine the data stable expression value and the data change expression value through a preset weight coefficient. And presetting a weight coefficient of a data stable representation value and a data change representation value according to the characteristics of the actual data and analysis requirements. And (3) integrating the two values according to weights through linear combination calculation, and finally generating a data characteristic rule value capable of accurately reflecting the data characteristic rule.
Taking an urban traffic flow monitoring system as an example, a plurality of sensor nodes are distributed in the system and used for collecting traffic flow data of different road sections, the sensor nodes form heterogeneous data sources, data collection is carried out every 15 minutes, and the 15 minutes are a data collection period.
In the process described in this example, the system first determines all integrated stability periods according to the method of example 2. Assuming that, through analysis, 30 integrated stability periods are determined in the monitoring data for a certain day (96 data acquisition periods total).
The system extracts the interval time of adjacent data quality abnormal time points in the 30 comprehensive stable periods to form an interval time set. For example, in one of the integrated stabilization periods, the first occurrence of a data quality anomaly is at 10:15, the second occurrence is at 10:45, the interval is 30 minutes, and in the other integrated stabilization period, the interval between two adjacent data quality anomalies is 25 minutes, etc. All such intervals are collected to form a set containing a plurality of interval data.
The system calculates the variance of each group of interval times in the interval time set and calculates the average value. Variance is a statistic that measures the degree of dispersion of a set of data and reflects the fluctuations in these intervals. Assuming that the mean value of the resulting variance is 15 (which is merely exemplary data) after complex calculations (no specific formula operations are involved here), this value is labeled as the data change representation value. The method shows the variation fluctuation of the data quality abnormal time point interval in the comprehensive stable period.
When generating the data characteristic rule value, the system can linearly combine the data stable expression value and the data change expression value through a preset weight coefficient. It is assumed that the weight coefficient of the data stability expression value is preset to be 0.6 according to the characteristics and analysis requirements of the urban traffic flow data, and the weight coefficient of the data change expression value is preset to be 0.4. If the data stabilization representation value obtained in example 2 is 0.7, the data stabilization representation value 0.7 is multiplied by 0.6, the data change representation value 15 is multiplied by 0.4, and then the two values are added, i.e., 0.7x0.6+15 x 0.4=0.42+6=6.42, according to the calculation method of the linear combination (the formula is not involved, but only conceptual illustration), and the finally generated data characteristic rule value is 6.42. The characteristic rule value comprehensively considers the stability and the variability of the data, can more comprehensively and accurately reflect the characteristic rule of the urban traffic flow data quality abnormality, and provides important data support for subsequent traffic flow analysis and system decision.
Example 4:
When determining the abnormal influence period in the data fusion period, the system firstly extracts the starting time point and the ending time point of the data fusion period and definitely marks the starting time point and the ending time point on a time axis. Therefore, the time range of the data fusion period can be clearly defined, and a basis is provided for subsequent analysis. Next, the latest data quality abnormality time point before the start time point of the data fusion period is acquired, and this time point is taken as a reference point. This fiducial point is a key reference for the subsequent predicted anomaly time.
Then, an average value of the historical data quality abnormality interval time is calculated, and an interval reference value is generated. And (3) obtaining an average abnormal interval time through analysis of the historical data, and taking the average abnormal interval time as a basis for predicting future abnormal time. The predicted abnormality starting points are marked sequentially on the time axis with the reference points as starting points according to the interval reference values. The starting time of the data quality abnormality may occur in the future is presumed on the time axis based on the average interval time. And superposing preset abnormal duration time after each predicted abnormal starting point to generate a predicted abnormal ending point. The preset anomaly duration is set based on empirical or historical data for determining the duration that the anomaly may last. And finally, marking the time period between the predicted abnormal starting point and the corresponding predicted abnormal ending point as a predicted abnormal time period, and marking the overlapping part of the time range of the data fusion period and the predicted abnormal time period as an abnormal influence time period. By the method, the time period possibly influenced by the abnormal data quality in the data fusion period can be accurately determined, and an accurate basis is provided for the subsequent adjustment and fusion process.
Assume a financial transaction data processing system that is responsible for integrating transaction data from multiple financial institutions. These data are stored in the distributed storage nodes, and the data fusion period is set to be daily (from 00:00 to 23:59) for generating a daily integrated transaction report.
The system first extracts the start time point (00:00) and the end time point (23:59) of the data fusion period and marks the start time point and the end time point clearly on the time axis. This determines the time frame involved in the data fusion operation on the same day.
The system acquires the last data quality abnormality time point before the starting time point of the data fusion period. Assuming that a data quality anomaly occurred 22:30 yesterday (the day prior to the data fusion period), this 22:30 is used as the reference point.
The system calculates a mean value of the historical data quality anomaly interval times to generate an interval reference value. The system collects points in time of data quality anomalies over a period of time (e.g., over 30 days) and calculates the interval between adjacent points in time of anomalies. Assuming that these intervals are 12 hours, 8 hours, 10 hours, etc., respectively, by performing statistical calculation on these intervals (the specific calculation process does not involve a formula), an average interval of 10 hours is obtained, and this 10 hours is the interval reference value.
The predicted abnormality starting points are marked sequentially on the time axis with reference points 22:30 as starting points at intervals of 10 hours. The first predicted anomaly starting point is 22:30 plus 10 hours, i.e., 8:30 today, and the second predicted anomaly starting point is 8:30 plus 10 hours, i.e., 18:30 (due to the time scaling required over 24 hours).
And superposing a preset abnormality duration after each predicted abnormality starting point to generate a predicted abnormality ending point. Assuming that the preset anomaly duration is 2 hours, the predicted anomaly termination point corresponding to the first predicted anomaly start point 8:30 is 8:30 plus 2 hours, i.e., 10:30, and the predicted anomaly termination point corresponding to the second predicted anomaly start point 18:30 is 20:30.
The period between the predicted abnormal starting point and the corresponding predicted abnormal ending point is marked as a predicted abnormal period, namely 8:30-10:30 and 18:30-20:30 are predicted abnormal periods. Then, the overlapping portion of the time range of the data fusion period (the current day 00:00-23:59) and the predicted abnormality period is marked as an abnormality influence period. In this example, 8:30-10:30 and 18:30-20:30 are both within the data fusion period of the day, so both periods are abnormal impact periods. By the method, the system can accurately determine the time period possibly affected by the abnormal data quality in the data fusion period, provide accurate basis for the subsequent adjustment fusion process, and ensure the reliable data quality of the generated daily comprehensive transaction report.
Example 5:
When the system executes the data fusion task, the current fusion time node is acquired. This time node is the key basis for judging whether to continue the fusion operation. The system will match the current fusion time node with the abnormal impact period. If the current fusion time node is located in the abnormal influence period, the data fusion is possibly influenced by the abnormal data quality, and in order to ensure the quality of the fused data, a suspension fusion instruction is generated by the system to suspend the data fusion operation. If the current fusion time node is located outside the abnormal influence period, the system generates an instruction to be matched. Based on the instruction to be matched, the system locates the predicted abnormal starting point closest to the current fusion time node on a time axis, and calculates the interval duration between the two. If the interval time length is longer than or equal to a preset safety threshold, the fusion operation is carried out on the current fusion time node, the influence of the data quality abnormality is avoided in a period of time, the system generates a continuous fusion instruction to allow the data fusion operation to be continued, otherwise, if the interval time length is shorter than the preset safety threshold, the abnormal time that the current fusion time node is possibly close to the current fusion time node is indicated, and in order to avoid risks, the system generates a suspension fusion instruction to suspend the data fusion operation. Through the matching and judging mechanism, the system can dynamically adjust the fusion flow of heterogeneous data sources according to actual conditions, and the quality and reliability of data fusion are effectively improved.
A data fusion system of an online education platform is assumed, and integrates data of a plurality of data sources, such as student learning progress data, course feedback data and the like, and the data are stored in different distributed storage nodes. The data fusion period is set to be performed once per hour to generate a real-time analysis report of the learning condition.
When the system executes the data fusion task, the current fusion time node is acquired. For example, the current fusion time node is 14:20.
The system will match the current fusion time node with the previously determined period of abnormal influence. Assuming that 14:00-14:30 are abnormal impact periods by the method of example 4. Because 14:20 is in the abnormal influence period of 14:00-14:30, the system can generate a pause fusion instruction to stop the current data fusion operation so as to avoid being influenced by the abnormal data quality and ensure the accuracy of the analysis report of the learning condition.
If the current fusion time node is not within the abnormal influence period, for example, the current fusion time node is 15:00, and the abnormal influence period is 14:00-14:30, the system generates an instruction to be matched. Based on the instruction to be matched, the system locates the predicted abnormal starting point closest to the current fusion time node on the time axis. Assume that the nearest predicted anomaly starting point to 15:00 is 16:00 based on previous calculations and predictions.
The system then calculates the duration of the interval between the two. The formula is used here, interval duration=predicted anomaly starting point time-current fusion time node time, wherein "predicted anomaly starting point time" represents the predicted anomaly starting time nearest to the current fusion time node, and "current fusion time node time" represents the time point acquired when the system performs the data fusion task. In this example, 16:00 translates to 16×60=960 minutes, 15:00 translates to 15×60=900 minutes, interval duration=960-900=60 minutes.
Assuming that the preset safety threshold is 30 minutes, because 60 minutes is greater than 30 minutes, the system can generate a continuous fusion instruction, and data fusion operation is allowed to be continuously carried out at the time point of 15:00, so that an accurate learning condition analysis report can be timely generated, and support is provided for teaching decisions.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.