CN120653638A

Movatterモバイル変換

Info

Publication number: CN120653638A
Application number: CN202510769513.7A
Authority: CN
Inventors: 郑海娟; 张子斌; 张健民
Original assignee: Guangzhou Baojie Network Technology Co ltd
Current assignee: Guangzhou Baojie Network Technology Co ltd
Priority date: 2025-06-10
Filing date: 2025-06-10
Publication date: 2025-09-16

Abstract

The invention relates to the technical field of data processing and discloses a heterogeneous data source fusion analysis system. The system comprises a data integration module, a feature analysis module and a fusion evaluation module. The data integration module acquires data quality parameters of the heterogeneous data source distributed storage nodes, multidimensional analysis generates an evaluation value, judges data fusion reliability and generates a priority order. The feature analysis module extracts abnormal time nodes according to the high-priority instruction, generates a data feature rule value through association analysis, judges whether the abnormality is periodic, and generates a rule adjustment instruction. And the fusion evaluation module determines an abnormal influence period according to the rule adjustment instruction, matches with the current fusion time node, dynamically adjusts the fusion process and generates the instruction. The system can effectively evaluate the data quality, mine the data rule, dynamically adjust the fusion flow, improve the accuracy and reliability of heterogeneous data source fusion analysis, and is widely applicable to various data processing scenes.

Description

Heterogeneous data source fusion analysis system

Technical Field

The invention relates to the technical field of data processing, in particular to a heterogeneous data source fusion analysis system.

Background

In the present digital age, data becomes a key basis for enterprise and organization decisions. With the rapid development of information technology, data sources are increasingly diversified, and heterogeneous data sources become normal. Heterogeneous data sources refer to data sources that differ in terms of data structure, storage, data format, communication protocol, etc., such as relational databases, non-relational databases, file systems, sensor data, etc. This diversity, while providing rich material for data analysis, presents a number of challenges.

The data quality of heterogeneous data sources varies. Data consistency and integrity of different data sources is difficult to guarantee. In some distributed systems, data may be lost, duplicated, or erroneous during transmission or storage, resulting in data consistency issues. For example, in an e-commerce platform, order data may be subject to discrepancies in different database nodes, such that the order information is inconsistent, affecting accurate analysis of the sales data. Meanwhile, part of data sources can not completely provide required data due to various reasons, and the situation of data missing is common, which greatly hinders the accuracy and the integrity of data analysis. If part of financial data of the client is missing in the financial risk assessment scene, deviation of a risk assessment result is caused, and risks cannot be accurately predicted.

The heterogeneous data sources are difficult to fuse. Due to the differences in data structure and format, complex data conversion and preprocessing work is required in performing fusion. For example, the fusion process of structured relational database data and unstructured text data needs to solve the problems of data pattern mismatch, data type inconsistency and the like. Moreover, the update frequency and timestamp format of different data sources are also not the same, which makes it difficult to determine the timeliness and accuracy of the data when performing data fusion. In the environment of the Internet of things, the time synchronization of data acquired by a plurality of sensors is difficult, and the fusion is easy to cause time sequence confusion, so that the accurate analysis of physical phenomena is affected.

Traditional data fusion analysis methods often lack dynamic adjustment capabilities. When the data quality is abnormal, effective countermeasures cannot be found and taken in time, so that deviation of fusion results occurs. Once the data quality has been compromised, the entire data analysis process may need to be re-performed, consuming a significant amount of time and resources. Taking an intelligent traffic system as an example, if the road sensor data is abnormal, the traditional method cannot adjust the fusion flow in time, so that the traffic flow analysis result is wrong, and the traffic scheduling decision is further influenced.

The rapid increase in data volume also places higher demands on heterogeneous data source fusion analysis. The massive data not only increases the pressure of data storage and transmission, but also increases the complexity of data processing exponentially. The traditional analysis system is difficult to complete fusion and analysis of large-scale heterogeneous data in a short time, and cannot meet the real-time requirement. In social media data analysis, a large amount of text, pictures, etc. data generated per second need to be processed in time to mine valuable information, but conventional systems are not adequate.

Disclosure of Invention

The invention aims to provide a heterogeneous data source fusion analysis system so as to solve the problems in the background art.

In order to achieve the above purpose, the invention provides a heterogeneous data source fusion analysis system, which comprises:

the data integration module is used for acquiring data quality parameters of distributed storage nodes associated with heterogeneous data sources in a plurality of data acquisition periods, carrying out multidimensional analysis on the data quality parameters to generate data quality evaluation values, judging the data fusion reliability of the distributed storage nodes based on the data quality evaluation values, and generating priority instructions, wherein the instructions comprise high-priority adjustment instructions and low-priority adjustment instructions;

comparing the data quality assessment value with a data quality assessment threshold;

if the data quality evaluation value is larger than the data quality evaluation threshold value, generating a high-priority adjustment instruction;

if the data quality evaluation value is smaller than or equal to the data quality evaluation threshold value, generating a low-priority adjustment instruction;

The characteristic analysis module is used for extracting time nodes with abnormal data quality of the distributed storage nodes in a plurality of data acquisition periods based on the high-priority adjustment instruction, carrying out association analysis on the time nodes and the data quality parameters to generate a data characteristic rule value, judging whether the data quality abnormality presents periodic characteristics based on the data characteristic rule value, and generating a rule adjustment instruction if the data quality abnormality exists;

the time node with abnormal data quality comprises a specific time point when the abnormality occurs;

And the fusion evaluation module is used for determining an abnormal influence period in the data fusion period based on the rule adjustment instruction, matching the abnormal influence period with the current fusion time node, dynamically adjusting the fusion flow of the heterogeneous data source according to the matching result, and generating an adjustment instruction, wherein the adjustment instruction comprises a pause fusion instruction and a continuous fusion instruction.

Preferably, the data quality parameter comprises a data consistency index and a data integrity index;

Based on independent analysis of the data consistency index and the data integrity index, respectively generating a consistency evaluation value and an integrity evaluation value;

and carrying out weighted fusion on the consistency evaluation value and the integrity evaluation value to generate a data quality evaluation value.

Preferably, data consistency indexes of the distributed storage nodes in a plurality of data acquisition periods are integrated and an average value is calculated, a consistency average value is generated, the consistency average value and the total duration of the data acquisition periods are processed in proportion to obtain consistency frequency, and the consistency frequency and a consistency frequency threshold value are normalized to generate a consistency evaluation value.

Preferably, the data integrity indexes of the distributed storage nodes in a plurality of data acquisition periods are accumulated to generate an integrity sum, the integrity sum and the number of the data acquisition periods are processed in proportion to obtain an integrity average value, and the integrity average value and the total duration of the data acquisition periods are normalized to generate an integrity evaluation value.

Preferably, the generation mode of the data characteristic rule value is as follows:

Generating a data stability representation value and a data change representation value;

and linearly combining the data stable representation value and the data change representation value through a preset weight coefficient to generate a data characteristic rule value.

Preferably, the data stabilization representation value is generated by:

screening historical data acquisition periods of which the data consistency index and the data integrity index are in normal ranges, marking the historical data acquisition periods as stable periods, and carrying out serialization marking on data quality abnormality time points in each stable period according to a time sequence;

Calculating the time interval deviation of adjacent time points based on the time points after the serialization marking;

comparing the time interval deviation with a preset interval threshold, and marking the time interval deviation as a stable interval if the time interval deviation is smaller than or equal to the threshold;

counting the proportion of the number of the stable intervals to the total number of intervals, and marking the stable period as a comprehensive stable period if the proportion exceeds a preset proportion threshold;

and counting the proportion of the number of the comprehensive stable periods to the total number of the historical data acquisition periods, and generating a data stable representation value.

Preferably, the data change expression value is generated by:

extracting interval time of adjacent data quality abnormal time points in all comprehensive stable periods to form an interval time set;

and calculating the variance of each group of interval time in the interval time set, calculating an average value, generating a variance average value, and marking the variance average value as a data change representation value.

Preferably, comparing the characteristic rule value of the data with a characteristic rule threshold value;

If the characteristic rule value of the data is larger than the characteristic rule threshold value, a rule adjustment instruction is generated;

if the characteristic rule value of the data is smaller than or equal to the characteristic rule threshold value, no operation is triggered.

Preferably, the abnormal influence period in the data fusion period is determined in the following manner:

extracting a starting time point and a terminating time point of a data fusion period, and marking on a time axis;

Acquiring the latest data quality abnormality time point before the starting time point of the data fusion period, and taking the latest data quality abnormality time point as a reference point;

calculating the average value of the historical data quality abnormal interval time, generating an interval reference value, taking the reference point as a starting point, and marking the predicted abnormal starting point on a time axis according to the interval reference value in sequence;

superposing preset abnormal duration time after each predicted abnormal starting point to generate a predicted abnormal ending point;

and marking the time period between the predicted abnormal starting point and the corresponding predicted abnormal ending point as a predicted abnormal time period, and marking the overlapping part of the time range of the data fusion period and the predicted abnormal time period as an abnormal influence time period.

Preferably, when the data fusion task is executed, the current fusion time node is obtained and matched with the abnormal influence period;

if the current fusion time node is positioned in the abnormal influence period, generating a suspension fusion instruction;

If the current fusion time node is located outside the abnormal influence period, generating an instruction to be matched;

Based on the instruction to be matched, positioning a predicted abnormal starting point closest to the current fusion time node on a time axis, and calculating the interval duration between the two;

If the interval time length is greater than or equal to a preset safety threshold value, generating a continuous fusion instruction;

If the interval duration is smaller than the preset safety threshold, generating a pause fusion instruction.

Compared with the prior art, the invention has the beneficial effects that:

At the data quality evaluation level, the system acquires data quality parameters, such as data consistency indexes and data integrity indexes, of distributed storage nodes associated with heterogeneous data sources in a plurality of data acquisition periods and performs multidimensional analysis on the data quality parameters. Firstly, respectively generating a consistency evaluation value and an integrity evaluation value, and then, obtaining a data quality evaluation value through weighted fusion. Compared with the traditional simple judgment method, the method fully considers different characteristics of the data and avoids the unilateral performance of single index evaluation. Taking data processing in the financial industry as an example, in the fusion analysis of customer asset data and transaction data, accurate quality assessment can ensure the accuracy of the data, reduce risk assessment errors caused by data errors, provide reliable basis for the decision of financial institutions, and reduce potential economic losses.

Based on the data quality assessment results, the system can generate priority instructions. And generating a high-priority adjustment instruction when the data quality evaluation value is larger than a data quality evaluation threshold value, and generating a low-priority adjustment instruction when the data quality evaluation value is smaller than or equal to the threshold value. The system can adopt different processing strategies in time according to the data quality condition, and system resources are reasonably distributed. When the electronic commerce platform processes a large amount of commodity data, fusion analysis is carried out on high-quality data preferentially, a precise sales trend report is output rapidly, a power assisting merchant adjusts an operation strategy timely, and low-quality data is arranged to be processed when resources are relatively idle, so that the overall operation efficiency of the system is improved.

The feature analysis module further mines the data value. Based on a high-priority adjustment instruction, extracting a time node of the data quality abnormality, carrying out association analysis by combining the time node and the data quality parameter, generating a data characteristic rule value, judging whether the data quality abnormality presents a periodic characteristic, and if so, generating a rule adjustment instruction. This function helps to get in depth knowledge of the intrinsic rules of the data and to find potential problems in advance. In the power system monitoring, the time of equipment failure occurrence can be predicted by analyzing the periodic characteristics of abnormal power data quality, maintenance work is arranged in advance, power failure accidents caused by equipment failure are reduced, the stability of power supply is ensured, and the maintenance cost is reduced.

The fusion evaluation module determines an abnormal influence period in a data fusion period according to the rule adjustment instruction, matches the abnormal influence period with a current fusion time node, dynamically adjusts the fusion process of the heterogeneous data source, and generates a pause fusion instruction or a continuous fusion instruction. The dynamic adjustment mechanism effectively avoids fusion operation in the data quality abnormal period, and ensures the accuracy of fusion results. In the meteorological data fusion analysis, when the meteorological sensor data is abnormal, the system pauses fusion, so that the error data is prevented from being mixed into an analysis result, the reliability of meteorological prediction is ensured, and accurate meteorological information is provided for the production and life of people.

The data stability expression value and the generation mode of the data change expression value characterize the data from different angles. The data stability representation value is obtained through the steps of screening a stability period, analyzing time interval deviation and the like, the stability of the data is reflected, and the data change representation value is generated through calculating an interval time variance mean value, so that the change condition of the data is reflected. The data characteristic rule value generated by the linear combination of the two provides a comprehensive basis for judging the abnormal characteristics of the data quality. In the logistics data processing, the feature analysis can help enterprises to better master the rule of logistics distribution, discover abnormal fluctuation in time, optimize logistics routes and distribution plans, improve logistics efficiency and reduce operation cost.

The method for determining the abnormal influence time period in the data fusion period comprehensively considers factors such as the average value of the quality abnormal interval time of the historical data, so that the determination of the abnormal time period is more scientific and reasonable. In practical application, when the data fusion task is executed, the system can accurately make a decision according to the matching condition of the current fusion time node and the abnormal influence period, and the adaptability and the reliability of the system are improved. In the industrial production process monitoring, the function can ensure the accuracy of production data fusion analysis, discover abnormal conditions in the production process in time, ensure the smooth production and improve the product quality.

Drawings

FIG. 1 is a schematic diagram of the heterogeneous data source fusion analysis system according to the present invention;

fig. 2 is a schematic diagram of the operation of data quality evaluation value generation;

FIG. 3 is a schematic diagram of the operation of data stable representation value generation;

Fig. 4 is a schematic diagram of the operation of rule adjustment instruction generation.

Fig. 5 is a schematic diagram of the operation of determining the abnormal influence period in the data fusion period.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-5, the present invention provides a heterogeneous data source fusion analysis system, which comprises the following specific implementation steps:

And the data integration module continuously acquires data quality parameters of the distributed storage nodes associated with the heterogeneous data sources in a plurality of data acquisition periods in the system operation process. These data quality parameters play a key role in assessing data quality. After the data quality parameters are acquired, the data integration module performs multidimensional analysis on the data quality parameters, and a data quality evaluation value capable of accurately reflecting the data quality condition is generated by comprehensively considering factors of each dimension. Based on the data quality assessment value, the data integration module can judge the data integration reliability of the distributed storage nodes. Specifically, the data quality evaluation value is compared with a preset data quality evaluation threshold. If the data quality evaluation value is larger than the data quality evaluation threshold, the data quality is better, the fusion reliability is higher, and a high-priority adjustment instruction is generated at the moment, otherwise, if the data quality evaluation value is smaller than or equal to the data quality evaluation threshold, a low-priority adjustment instruction is generated. These instructions provide a basis for the prioritization of data processing by subsequent systems.

The feature analysis module starts working after receiving the high priority adjustment instruction. The method can extract time nodes with abnormal data quality of the distributed storage nodes in a plurality of data acquisition periods, and the time nodes are accurate to the specific moment when the abnormality occurs. Then, association analysis is performed by combining the time nodes and the data quality parameters. Through analysis and processing of a large amount of data, data characteristic rule values are generated. After the data characteristic rule value is obtained, the characteristic analysis module can judge whether the data quality abnormality presents periodic characteristics. If the data characteristic rule value shows that the periodic characteristic exists, a rule adjustment instruction is generated, and a basis is provided for adjustment of a fusion process by a subsequent fusion evaluation module.

And after receiving the rule adjustment instruction, the fusion evaluation module determines an abnormal influence period in the data fusion period. The abnormal influence time period is matched with the current fusion time node, and the fusion flow of the heterogeneous data source is dynamically adjusted according to the matching result. If the current fusion time node is outside the abnormal influence period, whether the fusion can be continued or not is further judged, and a continuous fusion instruction or other related instructions are generated according to different conditions.

The technical scheme of the invention is further described in detail below with reference to specific embodiments.

Example 1:

In the system, the data quality parameter encompasses a data consistency indicator and a data integrity indicator. For the data consistency index, the system integrates the data consistency index of the distributed storage nodes in a plurality of data acquisition periods. Summarizing the data consistency indexes in each period, and calculating the average value of the data consistency indexes to obtain a consistency average value. This consistency mean can reflect the average level of data consistency over multiple cycles. And then, carrying out proportional processing on the consistency mean value and the total duration of the data acquisition period. Through the proportional processing, the change frequency condition of data consistency in the whole data acquisition time period can be known, and the consistency frequency is obtained. And then, normalizing the consistency frequency and a consistency frequency threshold value. The normalization processing is to unify the data of different orders to one standard scale, so that comparison and analysis are convenient. Through such processing, a consistency evaluation value is finally generated.

For the data integrity index, the system accumulates the data integrity indexes of the distributed storage nodes in a plurality of data acquisition periods to obtain an integrity sum. And then, the total integrity and the number of data acquisition cycles are subjected to proportional processing, so that the average level of the data integrity in each cycle can be calculated, and an integrity average value is obtained. And finally, normalizing the integrity mean value and the total time length of the data acquisition period, thereby generating an integrity evaluation value.

In generating the data quality assessment values, the system performs weighted fusion of the consistency assessment values and the integrity assessment values. And respectively giving different weights to the consistency evaluation value and the integrity evaluation value according to the actual requirements and the importance of the data. And comprehensively considering two factors of data consistency and integrity through weighted calculation, and finally generating a data quality evaluation value capable of comprehensively reflecting the data quality. The generation mode of the evaluation value can evaluate the data quality more scientifically and accurately, and provides a reliable basis for subsequent data processing.

It is assumed that there is a medical data management scenario in which a plurality of different types of medical devices are involved, which store data as heterogeneous data sources in distributed storage nodes. Each device performs data acquisition at regular intervals, for example, every 10 minutes, which is one data acquisition cycle.

In this scenario, the data consistency index is mainly used to measure whether the data collected by different medical devices are consistent in logic and numerical terms. For example, multiple devices measuring vital signs of a patient should have similar heart rate measurement data for the same patient. If a large deviation occurs, the problem of data consistency is indicated. The data integrity index pays attention to whether the data is complete or not, like the medical record information of a patient, each necessary filling field should be filled completely, and if the data is complete, the data integrity is not complete.

After the system starts to operate, in a period of time (including a plurality of data acquisition periods), aiming at the data consistency index, relevant information of data consistency of different devices in each period is integrated. For example, the consistency of heart rate data for each device is recorded over 10 data acquisition cycles (i.e., 100 minutes). The mean value of the uniformity index of the heart rate data in the 10 periods is calculated, and the uniformity mean value is 87.7 by summing and dividing the sum by the period number 10, assuming that the index values are 85, 88, 90, 86, 87, 89, 91, 88, 87 and 86 respectively.

The mean of agreement was then scaled 87.7 to the total duration of the data acquisition cycle of 100 minutes. Here, a value reflecting the uniformity variation frequency, i.e. the uniformity frequency, may be obtained by a reasonable calculation (the specific calculation is set by the system and the formula is not involved). Assume that the resulting uniformity frequency is 0.877. Then, the coincidence frequency of 0.877 is normalized with a predetermined coincidence frequency threshold value (assumed to be 0.8). The normalization process will measure the consistency frequency and the threshold under the same standard according to the rules established by the system, and finally generate a consistency evaluation value. For example, after a series of processing, a consistency evaluation value of 0.9 is obtained.

For the data integrity indicator, the medical data integrity related values for each cycle are also accumulated over the 10 data acquisition cycles. Assuming that the data integrity index values in each cycle are 90, 88, 92, 89, 91, 90, 87, 88, 90, 93, respectively, these values are accumulated to obtain an integrity sum of 898. And proportional processing is carried out on the integrity sum 898 and the number 10 of the data acquisition cycles, so that the integrity average value is 89.8. And finally, carrying out normalization processing on the integrity mean value 89.8 and the total time length of the data acquisition period for 100 minutes to generate an integrity evaluation value. Assume that the final integrity assessment value is 0.89.

When generating the data quality assessment value, the system can give different weights to the consistency assessment value and the integrity assessment value according to actual conditions. For example, according to the characteristics and importance of medical data, data consistency is considered to be relatively more important, a consistency evaluation value is given a weight of 0.6, and an integrity evaluation value is given a weight of 0.4. By weighting calculation (no formulas are involved, but only conceptual explanation), the consistency evaluation value 0.9 is multiplied by 0.6, the integrity evaluation value 0.89 is multiplied by 0.4, and then both are added, resulting in a data quality evaluation value of 0.9×0.6+0.89×0.4=0.54+0.356=0.896. The data quality evaluation value comprehensively considers the data consistency and the integrity, can reflect the data quality condition more comprehensively, and provides a reliable basis for subsequent data processing and analysis.

Example 2:

The generation process of the data stability expression value is complex and precise. The system screens historical data acquisition periods in which both the data consistency index and the data integrity index are in the normal range, and marks the periods as stable periods. This step enables the determination of a period of time during which the data quality is relatively stable, providing a reliable data basis for subsequent analysis. Then, the data quality abnormality time points in each stabilization period are sequentially marked in time order so as to analyze the relationship between the time points later. Based on the time points after the serialization marking, the time interval deviation of the adjacent time points is calculated. By this calculation, the time interval change between the abnormal data quality time points can be known. Then, the time interval deviation is compared with a preset interval threshold. If the deviation is less than or equal to the threshold, this time interval is indicated as relatively stable and is marked as a stable interval. Thereafter, the ratio of the number of stable intervals to the total number of intervals is counted. When the ratio exceeds a preset ratio threshold, the interval of the data quality abnormal time points is stable in the stable period, and the stable period is marked as a comprehensive stable period. And finally, counting the proportion of the number of the comprehensive stable periods to the total number of the historical data acquisition periods, wherein the proportion value is the data stable representation value. The data stability representation value can reflect the stability degree of the data in a long time, and provides an important reference for analyzing the characteristic rule of the data.

It is assumed that in an electronic commerce data processing system, there are a plurality of distributed storage nodes for storing sales data of different stores, each data acquisition period being 1 hour. The data consistency index is used for measuring the consistency of various sales data (such as order quantity, sales amount and the like) counted by different shops at the same time in logic and numerical values, and the data integrity index pays attention to whether all necessary filling information (such as commodity name, sales quantity, customer information and the like) in the sales data of the shops is complete or not.

The system has sales data for a plurality of historical data collection periods over a period of time. Firstly, the system screens historical data acquisition periods in which both the data consistency index and the data integrity index are in a normal range. For example, upon examination, during the past 24 data acquisition cycles (i.e., 24 hours), it was found that 15 cycles of data consistency and integrity indicators met the normal criteria, and these 15 cycles were marked as stable cycles.

For each stabilization period, the system will time sequentially mark the time points where the data quality is abnormal. Assuming that two data quality anomalies occur during a certain stabilization period (the 10 th hour acquisition period), the first anomaly occurs at 10:15 and the second anomaly occurs at 10:45, then 10:15 is marked as anomaly time point number 1 and 10:45 is marked as anomaly time point number 2.

The time interval deviation of adjacent time points is calculated based on the time points after the serialization flag. In this example, the time interval at the abnormal time points No.1 and No. 2 is 30 minutes. If the system presets an interval threshold of 40 minutes, 30 minutes less than 40 minutes, then this interval is marked as a stable interval.

The system will count the proportion of the number of stable intervals to the total number of intervals. In these 15 stable periods, there are 20 abnormal time point intervals in total, in which 16 intervals are smaller than the preset interval threshold, and then the ratio of the number of stable intervals to the total number of intervals is 16+.20=0.8. If the preset ratio threshold is 0.7,0.8 is greater than 0.7, this settling period is labeled as the integrated settling period.

And counting the proportion of the number of the comprehensive stable periods to the total number of the historical data acquisition periods, and generating a data stable representation value. Of the 24 historical data acquisition cycles, 12 are labeled as integrated stability cycles, and the data stability appearance value is 12++24=0.5. The data stability expression value reflects the stability of the intervals of the data quality abnormal time points in the electronic commerce sales data in a longer time, and provides an important reference basis for the subsequent analysis of the data characteristic rule.

Example 3:

The generation of the data change representation value is based on the data over all integrated stability periods. The system extracts the interval time of adjacent data quality abnormal time points in all comprehensive stable periods, and collects the interval time to form an interval time set. The set includes a plurality of interval time data reflecting the variation of the intervals of the data quality anomaly time points in the integrated stabilization period. Then, the variance of each group of interval times in the interval time set is calculated and the average value is obtained. The variance can measure the discrete degree of the data, and a numerical value capable of reflecting the fluctuation condition of the interval time change can be obtained by calculating the average value of the variance and is marked as the data change representation value. The data change representation value supplements information of the data characteristic rule from another angle, and provides more comprehensive data support for generating the data characteristic rule value together with the data stability representation value. When generating the data characteristic rule value, the system can linearly combine the data stable expression value and the data change expression value through a preset weight coefficient. And presetting a weight coefficient of a data stable representation value and a data change representation value according to the characteristics of the actual data and analysis requirements. And (3) integrating the two values according to weights through linear combination calculation, and finally generating a data characteristic rule value capable of accurately reflecting the data characteristic rule.

Taking an urban traffic flow monitoring system as an example, a plurality of sensor nodes are distributed in the system and used for collecting traffic flow data of different road sections, the sensor nodes form heterogeneous data sources, data collection is carried out every 15 minutes, and the 15 minutes are a data collection period.

In the process described in this example, the system first determines all integrated stability periods according to the method of example 2. Assuming that, through analysis, 30 integrated stability periods are determined in the monitoring data for a certain day (96 data acquisition periods total).

The system extracts the interval time of adjacent data quality abnormal time points in the 30 comprehensive stable periods to form an interval time set. For example, in one of the integrated stabilization periods, the first occurrence of a data quality anomaly is at 10:15, the second occurrence is at 10:45, the interval is 30 minutes, and in the other integrated stabilization period, the interval between two adjacent data quality anomalies is 25 minutes, etc. All such intervals are collected to form a set containing a plurality of interval data.

The system calculates the variance of each group of interval times in the interval time set and calculates the average value. Variance is a statistic that measures the degree of dispersion of a set of data and reflects the fluctuations in these intervals. Assuming that the mean value of the resulting variance is 15 (which is merely exemplary data) after complex calculations (no specific formula operations are involved here), this value is labeled as the data change representation value. The method shows the variation fluctuation of the data quality abnormal time point interval in the comprehensive stable period.

When generating the data characteristic rule value, the system can linearly combine the data stable expression value and the data change expression value through a preset weight coefficient. It is assumed that the weight coefficient of the data stability expression value is preset to be 0.6 according to the characteristics and analysis requirements of the urban traffic flow data, and the weight coefficient of the data change expression value is preset to be 0.4. If the data stabilization representation value obtained in example 2 is 0.7, the data stabilization representation value 0.7 is multiplied by 0.6, the data change representation value 15 is multiplied by 0.4, and then the two values are added, i.e., 0.7x0.6+15 x 0.4=0.42+6=6.42, according to the calculation method of the linear combination (the formula is not involved, but only conceptual illustration), and the finally generated data characteristic rule value is 6.42. The characteristic rule value comprehensively considers the stability and the variability of the data, can more comprehensively and accurately reflect the characteristic rule of the urban traffic flow data quality abnormality, and provides important data support for subsequent traffic flow analysis and system decision.

Example 4:

When determining the abnormal influence period in the data fusion period, the system firstly extracts the starting time point and the ending time point of the data fusion period and definitely marks the starting time point and the ending time point on a time axis. Therefore, the time range of the data fusion period can be clearly defined, and a basis is provided for subsequent analysis. Next, the latest data quality abnormality time point before the start time point of the data fusion period is acquired, and this time point is taken as a reference point. This fiducial point is a key reference for the subsequent predicted anomaly time.

Then, an average value of the historical data quality abnormality interval time is calculated, and an interval reference value is generated. And (3) obtaining an average abnormal interval time through analysis of the historical data, and taking the average abnormal interval time as a basis for predicting future abnormal time. The predicted abnormality starting points are marked sequentially on the time axis with the reference points as starting points according to the interval reference values. The starting time of the data quality abnormality may occur in the future is presumed on the time axis based on the average interval time. And superposing preset abnormal duration time after each predicted abnormal starting point to generate a predicted abnormal ending point. The preset anomaly duration is set based on empirical or historical data for determining the duration that the anomaly may last. And finally, marking the time period between the predicted abnormal starting point and the corresponding predicted abnormal ending point as a predicted abnormal time period, and marking the overlapping part of the time range of the data fusion period and the predicted abnormal time period as an abnormal influence time period. By the method, the time period possibly influenced by the abnormal data quality in the data fusion period can be accurately determined, and an accurate basis is provided for the subsequent adjustment and fusion process.

Assume a financial transaction data processing system that is responsible for integrating transaction data from multiple financial institutions. These data are stored in the distributed storage nodes, and the data fusion period is set to be daily (from 00:00 to 23:59) for generating a daily integrated transaction report.

The system first extracts the start time point (00:00) and the end time point (23:59) of the data fusion period and marks the start time point and the end time point clearly on the time axis. This determines the time frame involved in the data fusion operation on the same day.

The system acquires the last data quality abnormality time point before the starting time point of the data fusion period. Assuming that a data quality anomaly occurred 22:30 yesterday (the day prior to the data fusion period), this 22:30 is used as the reference point.

The system calculates a mean value of the historical data quality anomaly interval times to generate an interval reference value. The system collects points in time of data quality anomalies over a period of time (e.g., over 30 days) and calculates the interval between adjacent points in time of anomalies. Assuming that these intervals are 12 hours, 8 hours, 10 hours, etc., respectively, by performing statistical calculation on these intervals (the specific calculation process does not involve a formula), an average interval of 10 hours is obtained, and this 10 hours is the interval reference value.

The predicted abnormality starting points are marked sequentially on the time axis with reference points 22:30 as starting points at intervals of 10 hours. The first predicted anomaly starting point is 22:30 plus 10 hours, i.e., 8:30 today, and the second predicted anomaly starting point is 8:30 plus 10 hours, i.e., 18:30 (due to the time scaling required over 24 hours).

And superposing a preset abnormality duration after each predicted abnormality starting point to generate a predicted abnormality ending point. Assuming that the preset anomaly duration is 2 hours, the predicted anomaly termination point corresponding to the first predicted anomaly start point 8:30 is 8:30 plus 2 hours, i.e., 10:30, and the predicted anomaly termination point corresponding to the second predicted anomaly start point 18:30 is 20:30.

The period between the predicted abnormal starting point and the corresponding predicted abnormal ending point is marked as a predicted abnormal period, namely 8:30-10:30 and 18:30-20:30 are predicted abnormal periods. Then, the overlapping portion of the time range of the data fusion period (the current day 00:00-23:59) and the predicted abnormality period is marked as an abnormality influence period. In this example, 8:30-10:30 and 18:30-20:30 are both within the data fusion period of the day, so both periods are abnormal impact periods. By the method, the system can accurately determine the time period possibly affected by the abnormal data quality in the data fusion period, provide accurate basis for the subsequent adjustment fusion process, and ensure the reliable data quality of the generated daily comprehensive transaction report.

Example 5:

A data fusion system of an online education platform is assumed, and integrates data of a plurality of data sources, such as student learning progress data, course feedback data and the like, and the data are stored in different distributed storage nodes. The data fusion period is set to be performed once per hour to generate a real-time analysis report of the learning condition.

When the system executes the data fusion task, the current fusion time node is acquired. For example, the current fusion time node is 14:20.

The system will match the current fusion time node with the previously determined period of abnormal influence. Assuming that 14:00-14:30 are abnormal impact periods by the method of example 4. Because 14:20 is in the abnormal influence period of 14:00-14:30, the system can generate a pause fusion instruction to stop the current data fusion operation so as to avoid being influenced by the abnormal data quality and ensure the accuracy of the analysis report of the learning condition.

If the current fusion time node is not within the abnormal influence period, for example, the current fusion time node is 15:00, and the abnormal influence period is 14:00-14:30, the system generates an instruction to be matched. Based on the instruction to be matched, the system locates the predicted abnormal starting point closest to the current fusion time node on the time axis. Assume that the nearest predicted anomaly starting point to 15:00 is 16:00 based on previous calculations and predictions.

The system then calculates the duration of the interval between the two. The formula is used here, interval duration=predicted anomaly starting point time-current fusion time node time, wherein "predicted anomaly starting point time" represents the predicted anomaly starting time nearest to the current fusion time node, and "current fusion time node time" represents the time point acquired when the system performs the data fusion task. In this example, 16:00 translates to 16×60=960 minutes, 15:00 translates to 15×60=900 minutes, interval duration=960-900=60 minutes.

Assuming that the preset safety threshold is 30 minutes, because 60 minutes is greater than 30 minutes, the system can generate a continuous fusion instruction, and data fusion operation is allowed to be continuously carried out at the time point of 15:00, so that an accurate learning condition analysis report can be timely generated, and support is provided for teaching decisions.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

Translated fromChinese

1.一种异构数据源融合分析系统，其特征在于：包括：1. A heterogeneous data source fusion analysis system, characterized by comprising:

数据整合模块：获取异构数据源所关联的分布式存储节点在多个数据采集周期内的数据质量参数，基于对数据质量参数进行多维分析，生成数据质量评估值，基于数据质量评估值判断分布式存储节点的数据融合可靠性，并生成优先级指令，其中，指令包括高优先级调整指令和低优先级调整指令；Data integration module: obtains data quality parameters of distributed storage nodes associated with heterogeneous data sources over multiple data collection cycles, generates data quality assessment values based on multi-dimensional analysis of the data quality parameters, determines the data fusion reliability of the distributed storage nodes based on the data quality assessment values, and generates priority instructions, wherein the instructions include high-priority adjustment instructions and low-priority adjustment instructions;

将数据质量评估值与数据质量评估阈值进行对比；comparing the data quality assessment value with the data quality assessment threshold;

若数据质量评估值大于数据质量评估阈值，则生成高优先级调整指令；If the data quality assessment value is greater than the data quality assessment threshold, a high priority adjustment instruction is generated;

若数据质量评估值小于等于数据质量评估阈值，则生成低优先级调整指令；If the data quality assessment value is less than or equal to the data quality assessment threshold, a low priority adjustment instruction is generated;

特征分析模块：基于高优先级调整指令，提取分布式存储节点在多个数据采集周期内数据质量异常的时间节点，结合时间节点与数据质量参数进行关联分析，生成数据特征规律值，基于数据特征规律值判断数据质量异常是否呈现周期性特征，若存在，则生成规律调整指令；Feature Analysis Module: Based on high-priority adjustment instructions, it extracts the time nodes where data quality anomalies occur in distributed storage nodes within multiple data collection cycles. It then performs correlation analysis based on the time nodes and data quality parameters to generate data feature regularity values. Based on these values, it determines whether the data quality anomaly exhibits periodic characteristics. If so, it generates regularity adjustment instructions.

其中，数据质量异常的时间节点包括异常发生的具体时间点；The time point of data quality anomaly includes the specific time point when the anomaly occurs;

融合评估模块：基于规律调整指令，确定数据融合周期内的异常影响时段，并将其与当前融合时间节点进行匹配，根据匹配结果动态调整异构数据源的融合流程，并生成调整指令，其中，调整指令包括暂停融合指令和继续融合指令。Fusion evaluation module: Based on regular adjustment instructions, it determines the abnormal impact period within the data fusion cycle and matches it with the current fusion time node. According to the matching results, it dynamically adjusts the fusion process of heterogeneous data sources and generates adjustment instructions, among which the adjustment instructions include pause fusion instructions and continue fusion instructions.

2.根据权利要求1所述的一种异构数据源融合分析系统，其特征在于：数据质量参数包括数据一致性指标以及数据完整性指标；2. A heterogeneous data source fusion analysis system according to claim 1, characterized in that: the data quality parameters include data consistency indicators and data integrity indicators;

基于对数据一致性指标和数据完整性指标进行独立分析，分别生成一致性评估值和完整性评估值；Based on independent analysis of data consistency indicators and data integrity indicators, a consistency assessment value and an integrity assessment value are generated respectively;

将一致性评估值与完整性评估值进行加权融合，生成数据质量评估值。The consistency assessment value and the integrity assessment value are weighted and fused to generate a data quality assessment value.

3.根据权利要求2所述的一种异构数据源融合分析系统，其特征在于：将多个数据采集周期内分布式存储节点的数据一致性指标进行整合并计算均值，生成一致性均值，将一致性均值与数据采集周期总时长进行比例处理，得到一致性频率，将一致性频率与一致性频率阈值进行归一化处理，生成一致性评估值。3. A heterogeneous data source fusion analysis system according to claim 2 is characterized by: integrating the data consistency indicators of distributed storage nodes in multiple data collection cycles and calculating the mean to generate a consistency mean, proportionally processing the consistency mean and the total duration of the data collection cycle to obtain the consistency frequency, normalizing the consistency frequency and the consistency frequency threshold to generate a consistency evaluation value.

4.根据权利要求2所述的一种异构数据源融合分析系统，其特征在于：将多个数据采集周期内分布式存储节点的数据完整性指标进行累加，生成完整性总和，将完整性总和与数据采集周期的数量进行比例处理，得到完整性均值，将完整性均值与数据采集周期总时长进行归一化处理，生成完整性评估值。4. A heterogeneous data source fusion analysis system according to claim 2, characterized in that: the data integrity indicators of distributed storage nodes in multiple data collection cycles are accumulated to generate a total integrity sum, the total integrity sum is proportional to the number of data collection cycles to obtain a mean integrity value, and the mean integrity value is normalized with the total duration of the data collection cycle to generate an integrity assessment value.

5.根据权利要求1所述的一种异构数据源融合分析系统，其特征在于：数据特征规律值的生成方式为：5. The heterogeneous data source fusion analysis system according to claim 1 is characterized in that the data characteristic regularity value is generated in the following manner:

生成数据稳定表现值和数据变化表现值；Generate data stability performance value and data change performance value;

通过预设权重系数对数据稳定表现值和数据变化表现值进行线性组合，生成数据特征规律值。The data stability performance value and the data change performance value are linearly combined through preset weight coefficients to generate data characteristic regularity values.

6.根据权利要求5所述的一种异构数据源融合分析系统，其特征在于：数据稳定表现值的生成方式为：6. A heterogeneous data source fusion analysis system according to claim 5, characterized in that the data stability performance value is generated by:

筛选数据一致性指标与数据完整性指标均处于正常范围的历史数据采集周期，将其标记为稳定周期，将每个稳定周期内的数据质量异常时间点按时间顺序进行序列化标记；Screen historical data collection periods where both data consistency and data integrity indicators are within normal ranges, mark them as stable periods, and serialize and mark the abnormal data quality time points in each stable period in chronological order;

基于序列化标记后的时间点计算相邻时间点的时间间隔偏差；Calculate the time interval deviation of adjacent time points based on the serialized marked time points;

将时间间隔偏差与预设间隔阈值进行对比，若偏差小于等于阈值，则标记为稳定间隔；Compare the time interval deviation with the preset interval threshold. If the deviation is less than or equal to the threshold, it is marked as a stable interval.

统计稳定间隔数量占总间隔数量的比例，若比例超过预设比例阈值，则将该稳定周期标记为综合稳定周期；The ratio of the number of stable intervals to the total number of intervals is counted. If the ratio exceeds a preset ratio threshold, the stable period is marked as a comprehensive stable period.

统计综合稳定周期的数量占历史数据采集周期总数的比例，生成数据稳定表现值。The ratio of the number of comprehensive stable periods to the total number of historical data collection periods is counted to generate the data stability performance value.

7.根据权利要求5所述的一种异构数据源融合分析系统，其特征在于：数据变化表现值的生成方式为：7. A heterogeneous data source fusion analysis system according to claim 5, characterized in that the data change performance value is generated by:

提取所有综合稳定周期内相邻数据质量异常时间点的间隔时间，形成间隔时间集合；Extract the interval time of adjacent data quality abnormal time points within all comprehensive stable periods to form an interval time set;

计算间隔时间集合中各组间隔时间的方差并求取平均值，生成方差均值，将其标记为数据变化表现值。Calculate the variance of each group of interval time in the interval time set and take the average value to generate the variance mean, which is marked as the data change performance value.

8.根据权利要求5所述的一种异构数据源融合分析系统，其特征在于：将数据特征规律值与特征规律阈值进行对比；8. The heterogeneous data source fusion analysis system according to claim 5, characterized in that: the data characteristic regularity value is compared with the characteristic regularity threshold;

若数据特征规律值大于特征规律阈值，则生成规律调整指令；If the data characteristic regularity value is greater than the characteristic regularity threshold, a regularity adjustment instruction is generated;

若数据特征规律值小于等于特征规律阈值，则不触发任何操作。If the data characteristic regularity value is less than or equal to the characteristic regularity threshold, no action is triggered.

9.根据权利要求1所述的一种异构数据源融合分析系统，其特征在于：数据融合周期内的异常影响时段的确定方式为：9. The heterogeneous data source fusion analysis system according to claim 1, wherein the abnormal impact period within the data fusion cycle is determined by:

提取数据融合周期的起始时间点和终止时间点，并在时间轴上标记；Extract the start and end time points of the data fusion cycle and mark them on the time axis;

获取数据融合周期起始时间点之前最近一次数据质量异常时间点，将其作为基准点；Obtain the most recent abnormal data quality time point before the start time point of the data fusion cycle and use it as the benchmark point;

计算历史数据质量异常间隔时间的均值，生成间隔基准值，以基准点为起点，按间隔基准值在时间轴上依次标记预测异常起始点；Calculate the average of the interval time of historical data quality anomalies, generate the interval benchmark value, take the benchmark point as the starting point, and mark the predicted anomaly starting points on the time axis in sequence according to the interval benchmark value;

在每个预测异常起始点后叠加预设异常持续时间，生成预测异常终止点；The preset abnormality duration is superimposed after each predicted abnormality starting point to generate a predicted abnormality ending point;

将预测异常起始点与对应预测异常终止点之间的时段标记为预测异常时段，将数据融合周期的时间范围与预测异常时段的重叠部分标记为异常影响时段。The period between the predicted anomaly start point and the corresponding predicted anomaly end point is marked as the predicted anomaly period, and the overlapping part of the time range of the data fusion cycle and the predicted anomaly period is marked as the anomaly impact period.

10.根据权利要求9所述的一种异构数据源融合分析系统，其特征在于：当执行数据融合任务时，获取当前融合时间节点，将其与异常影响时段进行匹配；10. The heterogeneous data source fusion analysis system according to claim 9, characterized in that: when executing a data fusion task, the current fusion time node is obtained and matched with the abnormal impact period;

若当前融合时间节点位于异常影响时段内，则生成暂停融合指令；If the current fusion time node is within the abnormal impact period, a pause fusion instruction is generated;

若当前融合时间节点位于异常影响时段外，则生成待匹配指令；If the current fusion time node is outside the abnormal impact period, a matching instruction is generated;

基于待匹配指令，在时间轴上定位距离当前融合时间节点最近的预测异常起始点，计算两者之间的间隔时长；Based on the instruction to be matched, locate the predicted anomaly starting point closest to the current fusion time node on the time axis and calculate the interval between the two;

若间隔时长大于等于预设安全阈值，则生成继续融合指令；If the interval duration is greater than or equal to the preset safety threshold, a continue fusion instruction is generated;

若间隔时长小于预设安全阈值，则生成暂停融合指令。If the interval duration is less than the preset safety threshold, a pause fusion instruction is generated.