Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It is to be understood that these embodiments are merely discussed so that those skilled in the art may better understand and implement the subject matter described herein and that changes may be made in the function and arrangement of the elements discussed without departing from the scope of the disclosure herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described in some examples may be combined in other examples as well.
In at least one embodiment of the present invention, an enterprise project management method based on multi-source heterogeneous data integration is disclosed, as shown in fig. 1, comprising the following steps:
Step 1, receiving heterogeneous data from a plurality of data sources, wherein the heterogeneous data comprise data with different structures, formats and sources;
The method specifically comprises the following steps:
step 1.1, an adapter registry is created for managing various data source adapters;
the registry contains an adapter metadata repository, an adapter instance manager, and an adapter monitoring component.
The adapter metadata store configuration information of various adapters, including connection parameters, data structure descriptions and conversion rules;
The adapter instance manager is responsible for the creation, starting, stopping and destroying of the adapter;
the adapter monitoring component monitors the running state and performance index of each adapter in real time.
The adapter registry may be implemented in a centralized or distributed architecture.
In a centralized architecture, a single registry server manages all adapters;
in the distributed architecture, a plurality of registry nodes work cooperatively to improve the reliability and expansibility of the system.
In some embodiments, the registry may also support auto-discovery and registration functions for the adapter, simplifying the system configuration and deployment process.
And 1.2, realizing a standard interface of the data source adapter, wherein the standard interface comprises a connection interface, a data extraction interface, a metadata acquisition interface and a health check interface.
Wherein, the connection interface defines a method for establishing and maintaining connection with the data source;
the data extraction interface defines a method for acquiring data from a data source and supports full extraction and incremental extraction;
the metadata acquisition interface defines a method for acquiring data source structure information;
The health check interface defines a method of checking the running state of the adapter.
In addition, the design of the standard interface follows the principle of minimization, ensuring that the interface is compact and easy to implement.
For example, the data extraction interface may comprise the following core methods:
The method comprises the steps of acquiring full data;
The incremental data after the appointed time stamp is acquired;
For acquiring data according to the filtering conditions.
In some embodiments, additional methods such asFor obtaining real-time data streams, orFor querying the feature set supported by the adapter.
Step 1.3, constructing adapter plug-ins for different types of data sources based on standard interfaces, wherein the adapter plug-ins comprise a relational database adapter, an API interface adapter, a message queue adapter, a file system adapter, a log system adapter and the like.
Each adapter optimizes data extraction logic aiming at a data source of a specific type, for example, a relational database adapter adopts a transaction log analysis technology to realize efficient incremental data extraction, and an API interface adapter realizes a request throttling and caching mechanism to avoid API current limiting caused by frequent requests.
Thus, efficient acquisition of various types of data sources can be ensured in this way.
And 1.4, constructing a unified data collection bus, receiving the data collected by each adapter and performing preliminary classification.
The unified data collection bus is realized by adopting a distributed message queue, supports data partition and data pipeline processing, and ensures high throughput and low delay.
The bus performs time stamping, source information stamping and data quality evaluation on the received data, and provides a basis for subsequent processing.
Step 2, adaptively classifying the heterogeneous data according to the update frequency and the data importance to form a multidimensional heterogeneous data stream with a differential processing priority;
The method specifically comprises the following steps:
step 2.1, classifying the uniformly collected data streams, and dividing the data streams into fast streams according to the data update frequencyMedium speed flowAnd slow flow。
The partitioning is based on the specific update frequency range as follows:
the update period of the fast stream is from second level to minute level, such as system monitoring data and real-time log;
the update period of the medium speed flow is from hour level to day level, such as task state update and code submitting record;
the update period of the slow flow is from week level to month level, such as financial data and performance evaluation data.
In some embodiments, the data stream classification may employ an adaptive threshold approach, with classification thresholds dynamically adjusted based on historical update statistics of the data stream.
For example, the system may calculate an average update frequency for each data stream over different time periods, and automatically categorize the data streams into the most appropriate categories in combination with the traffic importance and access patterns of the data streams. This approach can adapt the system to seasonal or long-term trend changes in the data stream update pattern.
Step 2.2, establishing a dependency graph between the data streams, wherein,A set of data streams is represented and,Representing the dependency between the data streams.
For data streams with dependencyIntensity weight of system maintenance dependencyAnd time sensitivityFor the calculation of subsequent logical time stamps.
The dependency graph can be constructed by combining static definition and dynamic discovery.
Statically defining a dependency relationship based on a pre-configured business rule, and definitely specifying the dependency relationship between data streams;
The dynamic discovery of the dependency relationships automatically identifies potential dependency relationships by analyzing access patterns and dependencies between data streams.
For example, the system may apply an association rule mining algorithm, analyze the timing relationship of data flows in traffic processing, discover frequently co-occurring data flows, and establish dependencies.
Step 2.3, applying a logical timestamp algorithm to assign a logical timestamp to each data item.
The calculation formula of the logical timestamp is as follows:
;
wherein, theThe logical time stamp representing the data item is a unified time identifier distributed by the system for the data item and is used for ensuring the logical consistency of the data with different rates; representing the physical acquisition time of the data item, namely the time point when the data is actually acquired, and the actual timestamp recorded by the system; a set of logical timestamps representing other data items that have a dependency relationship with the data item for reflecting a timing dependency relationship between the data items; And the logical timestamp calculating function is represented and used for comprehensively considering the physical time and the dependency relationship to generate a final logical timestamp value.
The specific calculation is as follows:
wherein, theThe value range [0,1] is dynamically adjusted according to the data stream type. When (when)When approaching 1, the logical timestamp is more dependent on the physical acquisition time, whenNear 0, the logical timestamp is more dependent on the timestamp of the associated data item; The physical acquisition time of the data item is the time point when the data is actually acquired; a set of logical timestamps for other data items that have a dependency on the data item; the logic time stamp is used for ensuring the logic consistency of the data with different rates; as a synthesis function of the dependent timestamp, a weighted synthesis value for calculating the dependent data item timestamp is calculated as:
;
wherein, theA synthesis function representing the dependent time stamps for calculating a weighted time stamp sum of all dependent data items; representing summing all pairs of data streams for which a dependency exists,Representing a data streamTo data streamIs dependent on the relationship of (1); Representing a data streamTo the point ofThe dependency relationship strength weight of (2) is used for quantifying the importance degree of the dependency relationship; Representing a data streamTo the point ofFor measuring the sensitivity of the dependency relationship to time variation; Representing dependent data streamsI.e. the logical time value of the upstream data item on which the current data item depends.
And 2.4, constructing a data stream buffer management mechanism for coordinating the processing of the data streams with different rates.
For fast streamingCalculating a logic time stamp of the latest data in real time by adopting a sliding window processing mechanism;
for medium speed flowAggregating data according to configured time periods by adopting a time partition buffer area;
For slow flowThe method combines long-term storage and incremental updating, and ensures the consistency and availability of data.
In a specific application scenario of enterprise project management, the implementation mode of the adaptive processing algorithm of the different-speed data stream is as follows:
taking software development project management as an example, the system processes data sources from different rates simultaneously:
the fast flow includes system performance monitoring data (updates per second) and development environment status data (updates per minute);
the medium speed stream includes task status updates (hourly or daily) and code submission records (unscheduled updates);
The slow flow includes weekly report data and monthly performance assessment data.
The logical time stamping algorithm distributes reasonable physical time weight factors for data with different rates according to the requirement of project management:
For real-time monitoring of a scene, fast streamingThe value is close to 1, and the physical time is prioritized;
for progress tracking scenarios, medium speed streamingA value of about 0.6, the balance taking into account physical time and dependencies;
For performance assessment scenarios, slow streamingThe value is about 0.3, and more dependencies are considered.
In project milestone review, the system needs to integrate data at different rates for analysis. At this time, through the logical timestamp algorithm, the system can ensure the consistency of real-time monitoring data (fast stream), the current day task completion condition (medium speed stream) and the month plan completion condition (slow stream) in logic, and avoid analysis deviation caused by the difference of the data updating frequency. For example, when evaluating the quality of completion of a development task, the system correlates the status update of the task (medium speed flow) with the associated performance monitoring data (fast flow) and quality assessment report (slow flow) via logical time stamps, providing a comprehensive evaluation view.
Step 3, distributing a logic time stamp based on business semantics for the multidimensional different-speed data stream, and constructing a time sequence consistency frame;
The method specifically comprises the following steps:
step 3.1, constructing a unified data model for standardizing heterogeneous data of different sources;
The unified data model is represented by a graph structure and comprises two basic elements of entity nodes and relationship edges.
Entity nodes represent business objects (e.g., items, tasks, resources, etc.), and relationship edges represent associations (e.g., dependencies, etc.) between entities.
Each entity node and relationship edge contains a set of attributes for storing specific business data.
The unified data model may be implemented in different configurations. In addition to the above-described graph structure, a hierarchical structure (tree model), a relational structure (table model), or a hybrid structure may be used, and the most suitable model structure may be selected according to specific service requirements and data characteristics.
For example, a hierarchy may be employed for organizational data having a well-defined hierarchical relationship, and a relationship structure may be employed for transaction-centric financial data.
In some implementations, the unified data model may also support dynamic expansion of the model, allowing new entity types and relationship types to be added during system operation.
And 3.2, realizing an incremental data stream conversion algorithm, wherein the algorithm dynamically selects an optimal conversion strategy according to the property of data change.
The incremental data stream conversion algorithm comprises the following calculation steps:
for incoming data streamsCalculate the data stream of the last processingIs a set of variations of (1):
;
wherein, theRepresenting a data stream input at the current moment, including the latest data set; A historical data stream representing the last processing is used as a comparison standard; representing a set of changes, i.e., a portion of the difference between the current data stream and the historical data stream; and representing a set difference operation for extracting data items in the current data stream, which are different from the historical data stream.
For a set of changesClassifying into newly added dataUpdating dataDeleting data;
The conversion operations are applied separately for different types of change data:
For newly added dataApplying complete transfer functions;
For update dataApplying a differential transfer function;
For deleted dataApplying a reverse transfer function;
Merging the conversion result with the existing unified data model to obtain an updated data model:
;
wherein, theRepresenting the updated data model, namely the latest unified data model after conversion processing is completed; Representing a data model before updating, namely a unified data model of a previous version; A representation model merging operation for integrating the converted change data with the existing model; representing the combined result of various conversion functions, including the integrated result after conversion function processing which is respectively applied to the newly added data, the updated data and the deleted data; Representing the set of changes, i.e. the part of the difference between the current data stream and the historical data stream.
In the scenario where the data is frequently changed but of small magnitude, the following optimization variant algorithm is optionally employed:
To improve the processing efficiency of small-scale frequent changes, the system can implement a change batch processing mechanism. The mechanism collects a plurality of small-scale changes in a short time to form a change batchThen, the merging process is performed:
;
wherein, theRepresenting the merged batch change set, which is the result of merging a plurality of small-scale changes; A merge operator representing a change set for integrating a plurality of change sets into one set; The representation starts from the first change set; representing the total number of change sets; Represent the firstA plurality of change sets representing single data changes; Indicating a change batch, which is a collection of a plurality of small-scale changes;、、 Respectively represent the first collected in a short time、、A plurality of independent change sets; Indicating the number of changes included in the batch.
Merged change setAnd then processing according to the standard algorithm, so that the processing times and the system overhead are obviously reduced.
And 3.3, constructing a data conversion rule engine, and supporting declarative rule definition and automatic rule derivation.
The rule engine comprises a rule warehouse, a rule executor and a rule learning module;
The rule warehouse stores predefined conversion rules and rule templates;
The rule executor is responsible for interpreting and executing the conversion rule;
the rule learning module automatically derives new conversion rules based on the historical conversion data, and improves system adaptability.
Step 3.4, establishing a data quality evaluation and repair mechanism, monitoring the data quality problem in the conversion process in real time and automatically repairing;
the data quality assessment includes integrity checking, consistency checking, accuracy checking, and timeliness checking.
When the data quality problem is found, the system automatically repairs according to a preset repair strategy, and records a repair log for subsequent analysis and improvement.
In the actual application scenario of enterprise project management, the implementation manner and application examples of the incremental data stream conversion algorithm are as follows:
Taking an enterprise project management environment integrated by multiple systems as an example, an enterprise uses multiple systems such as a project management system, a code management system, a CI/CD system, a human resource system, and the like. Incremental data stream conversion algorithms, when processing these heterogeneous system data, take different strategies according to the nature of the data changes:
when the task state in the project management system changes from "in progress" to "completed" (belonging to update data) The system only processes the state change part, applies the difference conversion functionThe state changes are converted to corresponding updates in the unified data model. This incremental processing approach significantly reduces the amount of computation and data transfer compared to full processing of all task data.
When a code warehouse is newly added (belonging to newly added data) The system applies a complete transfer functionThe complete metadata and initial code analysis results of the repository are converted to new nodes and relationships in the unified data model. In this case, the system needs to perform a complete conversion process because of the completely new data.
When a member of a project leaves (belonging to deleted data) The system applies a reverse transfer functionThe association relationship between the member and the item is safely removed from the unified data model, and the history contribution record is kept. This way of processing ensures the integrity and consistency of the data.
Step 4, intelligent conversion, context awareness aggregation and predictive analysis are carried out on the multidimensional different-speed data stream based on the logic timestamp, and the logic integrity of the cross-data source is ensured according to the data stream processing model of dynamic evolution;
The method specifically comprises the following steps:
step 4.1, constructing a business process monitoring system, and collecting and analyzing business activity logs in real time;
The system includes a log collector, a flow identification engine, and a change detector.
The log collector collects activity logs from each business system;
the process identification engine builds a current business process model based on the collected logs;
the change detector detects a change in the business process by comparing the process models at different points in time.
The detailed composition of the business process monitoring system is as follows:
the log collector adopts a distributed acquisition architecture, and comprises a log agent, a log transmission channel and a log aggregation service.
The log agent is deployed on each service system and is responsible for the acquisition and preliminary processing of local logs;
the log transmission channel is realized based on the message queue, so that reliable transmission of log data is ensured;
the log aggregation service receives all log data and performs unified storage and indexing.
The process recognition engine extracts a process model from the business activity log based on a process mining technique.
The engine comprises a preprocessing module, a flow discovery module and a flow optimization module.
The preprocessing module cleans, filters and converts the original log;
the flow discovery module applies an improved Alpha algorithm to construct a flow model from the active sequence;
the process optimization module perfects the process model by combining similar paths, removing noise and calculating frequency information.
The change detector identifies changes in the business process by calculating the similarity and the difference of the process models. The detector includes a model comparison algorithm, a change classifier, and a change impact analyzer.
The model comparison algorithm calculates the structural difference and attribute difference between the two flow models;
The change classifier classifies the detected change into types of new nodes, deleted nodes, path change and the like;
The change impact analyzer evaluates the extent of potential impact of the change on the data stream processing.
Step 4.2, establishing a mapping relation between the business process and data stream processing;
The mapping relationship is expressed by a function:
;
wherein, theA complete rule set representing how the system processes, converts and routes data; representing a business process model comprising a structured representation of business activity nodes, decision points, circulation paths, and relationships thereof; representing data source characteristics including data format, update frequency, reliability, integrity, etc. describing parameter sets of data source attributes; The representation context environment parameters include system load, network conditions, user preferences, security policies, etc., that affect the processing of the data stream.
The specific mapping calculation method comprises the following steps:
;
wherein, theRepresenting a data stream processing configuration; A data routing rule set is expressed, and rules of how data flows in the system are defined, wherein the rules comprise a source, a destination, a transmission path, conditional branches and the like of the data; Representing a data conversion rule set, defining how to convert the original data into a standard format, including rules such as field mapping, format conversion, data cleaning and verification; representing a data aggregation rule set, defining how to combine and aggregate related data, including rules such as data grouping, summarization calculation, time window aggregation, association combination and the like; representing data processing priority rules, defining the processing order of different data flows, including priority allocation mechanisms based on traffic importance, timeliness, resource consumption, and dependencies.
Each rule set is composed of business process modelsThe node and edge maps in (a) are generated, for example, decision points in the business process are mapped to data routing rules, and business activities are mapped to data conversion rules.
And 4.3, realizing an adaptive data stream reconfiguration mechanism. When a business process change is detected, the system performs data stream reconfiguration as follows:
Identifying changed business process nodes and relationships, and calculating change set;
Calculating the data flow processing configuration to be updated according to the mapping function:
;
wherein, theRepresenting the data flow processing configuration which needs to be updated, namely a data processing rule set which needs to be adjusted due to the change of the business flow; The representation mapping function is used for converting the business process change into a corresponding data stream processing configuration change; representing a change set of the business process, wherein the change set comprises newly added, modified or deleted business process nodes and relations; representing data source characteristics including data format, update frequency, reliability, integrity, etc. describing parameter sets of data source attributes; The representation context environment parameters include system load, network conditions, user preferences, security policies, etc., that affect the processing of the data stream.
Generating a data stream reconfiguration plan comprising configuring an update sequence and verifying test cases;
And carrying out data stream reconfiguration according to a plan, monitoring the state and influence of the reconfiguration process, and ensuring the stability of the system.
And 4.4, constructing a visual flow data association monitoring tool, providing visual views of business flow and data flow processing for management personnel, and supporting manual intervention and adjustment.
The tool displays the current business process model, the data flow processing configuration and the mapping relation among the business process model and the data flow processing configuration, and provides historical change records and performance index analysis at the same time, so as to assist management staff in knowing the running state and the optimizing direction of the system.
When the enterprise organization changes, an example of application of the business process aware data stream processing technique is as follows:
the function type organization of a certain manufacturing enterprise is changed into the matrix organization, so that the project approval process is changed from the original step-by-step approval to the parallel approval mode.
The business process monitoring system detects the obvious change of the approval path by analyzing the recent activity log, namely the original serial path of 'project manager → department manager → general supervision → auxiliary president' is converted into a parallel mixed path of 'project manager → (department manager, functional expert) → comprehensive review board'.
The change detector recognizes this significant flow change and triggers a data flow reconfiguration mechanism. The system calculates the data flow processing configuration which needs to be updated based on the mapping relation, wherein the data routing rule is changed from serial forwarding to parallel distribution, the data aggregation rule is changed from gradual summarization to a comprehensive scoring mechanism, and the processing priority is changed from the hierarchical priority to the time priority.
The data flow reconfiguration meter is divided into two phases to execute:
the first stage, establish new approval route and data processing logic, but not activate temporarily;
And in the second stage, starting a new flow, and reserving the in-transit approval project processed by the old flow, wherein the new flow is completely carried out by the new project.
By this smooth transition, project management data processing during enterprise organization revolution maintains continuity and consistency, avoiding data processing confusion and business disruption during changes in conventional systems.
Step 5, applying the analysis result to a multi-level decision support system for enterprise project management through a visual decision matrix;
The method specifically comprises the following steps:
and 5.1, designing a process data consistency model, and defining consistency constraints between the business process state and the data stream processing state.
The consistency model is expressed by the following formula:
;
wherein, theRepresenting a consistency constraint function between a business process state and a data stream processing state; Representing the business flow state of a time point t, namely, completely describing the configuration, rules and execution conditions of all business flows in a system at a specific time point t; A complete description of the data flow processing state representing the point in time t, i.e. the routing rules, switching logic, processing priorities and execution states of all data flows in the system at a specific point in time t; representing a node set in a system, wherein the node set comprises all servers, application instances and processing units which participate in data processing and business process execution; Representing nodesViews of state X, i.e. nodesPerceived information about state X, including nodesAll attributes and parameters of the accessible state X; representing an equivalence relation, which shows that the two views are completely consistent in terms of semantics, and ensuring that the understanding of the node to the business flow and the data flow is kept synchronous; the representation holds true for all nodes in the system, ensuring global consistency
The consistency model ensures that each node in the system has equivalent views on the business flow state and the data flow processing state, and avoids data processing errors caused by inconsistent states.
Step 5.2, realizing a distributed consensus algorithm, and achieving global consensus when the business flow is changed;
the distributed consensus algorithm is based on a two-phase commit protocol extension, comprising the steps of:
the flow change coordinator sends pre-commit messages to all participating nodes including business flow change descriptionsAnd data flow reconfiguration plans;
Each node verifies the feasibility of the change and returns a readiness or refusal response to the coordinator;
if all nodes return to be ready, the coordinator sends a commit message, otherwise, a abort message is sent;
Finally, each node executes or gives up the change according to the decision of the coordinator, and returns the execution result.
Step 5.3, constructing a business process version management system, and supporting the parallel operation of new and old business processes;
in addition, the version management system comprises a flow version library, a version router and a version compatibility checker.
The process version library stores business process models of different versions and corresponding data stream processing configurations;
the version router directs the data stream to the appropriate version of processing logic based on the data characteristics and the context information;
the version compatibility checker ensures that data exchange between different versions does not result in data loss or errors.
Step 5.4, constructing a flow data switching manager which is responsible for coordinating the smooth transition between new and old business flows;
the switching manager implements the following functions:
calculating a data mapping relation between new and old processes to generate a data migration plan;
The state synchronization control is to maintain the state synchronization of the new and old processes in the switching process, and ensure the data consistency;
the rollback mechanism can safely rollback to the previous stable state when an abnormality occurs in the switching process;
and the switching progress monitoring is used for monitoring the progress and the state of the switching process in real time and providing visual display and alarm functions.
The specific implementation of the flow data synchronization protocol and its details are as follows:
The distributed consensus algorithm is implemented based on an improved version of the two-phase commit protocol, adding a pre-verification phase and a dynamic participant management mechanism.
The pre-verification stage comprehensively tests the configuration change before formal submission to ensure the feasibility of the change;
The dynamic participant management mechanism allows for dynamic addition or subtraction of participating nodes in the consensus process, improving the flexibility and fault tolerance of the system.
The algorithm decides the acceptance or rejection of the configuration change through a voting mechanism, and adopts a weighted majority principle, so that the key nodes have higher voting weights.
The realization of the consistency model adopts a layered state synchronization mechanism:
;
wherein, theRepresenting a consistency constraint function between a business flow state and a data flow processing state realized by adopting a hierarchical state synchronization mechanism; The business flow state of the time point t is represented, and the business flow state comprises configuration, rules and execution conditions of all business flows; the data flow processing state of the time point t is represented, and comprises the routing rules, the conversion logic and the processing state of all data flows; Representing all node sets in the system, including all servers and application instances involved in the processing; Representing nodesView or understanding of business process states; Representing nodesView or understanding of the data stream processing state; Representing the equivalence relation, ensuring that the two views are completely consistent semantically; meaning that the equivalence relation must be satisfied for all nodes in the system.
The realization of the model adopts a layered state synchronization mechanism:
at the bottom layer, tracking the state update condition of each node by using a vector clock;
in the middle layer, recording all state change events by adopting an event tracing technology, and supporting state reconstruction and backtracking;
at the upper layer, view consistency check is realized, and view equivalence of all nodes to the business flow state and the data flow processing state is ensured.
The business process version management system comprises a process version library, a process version database and a business process version management system, wherein the process version library stores process models of different versions by adopting a graph database, and supports version branching, merging and comparison;
the version router is realized based on a context-aware rule engine, and dynamically selects a processing version according to factors such as data characteristics, processing stages, system loads and the like;
the version compatibility checker evaluates the data exchange compatibility between different versions through semantic equivalence analysis and data flow simulation testing.
In the multi-place collaborative project management scenario, an example of an application of the flow data synchronization protocol is as follows:
The development projects of a national enterprise involve three development centers in asia, europe and north america, each having respective project management processes and data processing logic. The enterprise decides to unify the global development flows and maintains business continuity during the transition.
The application of the flow data synchronization protocol in this scenario includes:
Each node agrees with a new global unified flow through a distributed consensus algorithm, and meanwhile, the length of a transition period and a switching strategy are determined;
The business process version management system configures process variants adapting to local characteristics for each region, and meanwhile, the compatibility with global standard processes is maintained;
The flow data switching manager makes a personalized transition plan for each region, and determines the optimal switching time according to the project period and the business characteristics of each region.
In the process of implementing a new process in an asian development center, some ongoing critical project requires continued use of the old process until completion. At this time, the flow data synchronization protocol ensures that the data processing of the project is normally performed under the old flow, and simultaneously, the project state data is kept consistent in the new and old systems through the state synchronization control mechanism, so that the continuity and consistency of the data processing during the business flow change are realized.
Through implementation of the five main steps, the technical scheme of the application realizes the enterprise project management method based on multi-source heterogeneous data integration, and effectively solves the technical problems of data time consistency, business process change adaptability, real-time processing efficiency, data source integration complexity and the like faced by multi-source heterogeneous data integration in enterprise project management.
Application example of the present embodiment:
The actual application process and effect of the method are shown by the following project management system modification cases of a nationwide software research and development enterprise. The enterprise manages a plurality of software product lines simultaneously, relates to a plurality of departments of research, test, operation and maintenance and the like, and faces typical challenges of multi-source heterogeneous data integration by using a plurality of sets of information systems.
Application scenario description:
the enterprise is globally provided with 6 development centers, 35 software product projects are simultaneously operated, and about 2000 staff. The following specific challenges are faced in project management:
The data source diversity problem is that enterprises use 12 different information systems, including a JIRA task management system, gitLab code management platform, jenkinsCI/CD system, sonarQube code quality analysis tool, prometheus monitoring system, employee attendance system, financial system, customer feedback system and the like. These systems employ different data structures and formats, and there are a variety of database types (MySQL, mongoDB, postgreSQL, etc.) and data exchange modes (API, message queues, file exports, etc.).
The data updating frequency is obvious, the system monitoring data is updated every 10 seconds, the code submitting records are generated irregularly (tens of times per hour on average), the task state is updated several times per day, and the financial data and the personnel checking data are updated weekly or monthly. The traditional unified batch processing mode either delays the processing of real-time data or causes resource waste.
And (3) frequent business process adjustment, namely adjusting a product development strategy according to market demands by enterprises in quarters, and correspondingly changing project management processes such as approval processes, resource allocation processes and the like. These changes require manual adjustment of the data processing logic, resulting in project management systems often lagging the actual business process by 1-2 weeks, resulting in data inconsistencies and decision bias.
And the system performance bottleneck is that as the project and data volume increase, the original batch processing data integration scheme can not meet the performance requirement, the data updating delay is up to 4 hours on average, and the peak period exceeds 12 hours, so that the real-time decision making and abnormal early warning capability are seriously affected.
Enterprises hope to construct a unified project management platform, can integrate all heterogeneous data sources in real time, automatically adapt to business process changes, and provide comprehensive and accurate project state views and decision support.
Core step implementation example:
Construction of a lightweight multi-source data acquisition adapter system:
the enterprise firstly builds an adapter registration center, adopts a distributed architecture to be deployed in 6 development centers, and sets a main center as a coordination node. The registry is configured with the following key components:
The adapter metadata base records the configuration information of each adapter by using a distributed key value storage system, and comprises the following steps:
an adapter ID, a unique identifier, in the form of "AD-2023-JIRA-001";
connection parameters, namely data source access credentials, access addresses, access modes (API/database/message queues);
The data structure describes field mapping relation and data type conversion rule;
the acquisition frequency configuration is that acquisition interval setting dynamically adjusted according to data updating frequency;
The standard interface implementation designs an adapter universal interface based on an abstract factory mode, which covers 4 core functions:
Connection management, namely establishing, maintaining and releasing connection with a data source;
data extraction, namely supporting full extraction, incremental extraction and change capture;
metadata acquisition, namely automatically discovering and mapping a data source structure;
health monitoring, self-diagnosis and recovery ability;
Adapter plug-in development-specific adapter plug-ins were developed for each of the 12 information systems, each plug-in being optimized for the characteristics of a particular data source:
Aiming at JIRA and GitLab systems, the adapter realizes API flow limit control and response caching, and effectively limits the API usage;
Aiming at a MySQL and other relational databases, the adapter adopts binlog analysis technology to realize zero-delay change capture;
Aiming at an analysis tool of the SonarQube, the adapter realizes an incremental data extraction algorithm, and only acquires the newly added or changed analysis result;
aiming at financial and personnel systems, the adapter integrates file monitoring and automatic analysis capabilities and processes a regularly generated electronic table report;
in actual deployment, the enterprise also develops an Adapter Development Kit (ADK), which significantly simplifies the development process of the new adapter. With this tool pack, a developer can complete the development of the adapter of the new system on average only 4 hours, while using the conventional integration method requires 3 to 5 days on average.
All adapters send the collected data to a unified data collection bus, the bus is realized based on a distributed message queue, and the bus is configured into a multi-level theme structure to realize preliminary classification of the data.
Automatically adding metadata tags when data enters a bus, comprising:
A data source identifier, which is to record a source system of data;
a physical timestamp, namely recording the accurate time when the data is acquired;
collecting node information, namely recording a server node for executing collection;
And recording the preliminary evaluation results of the integrity and the accuracy.
Application of the adaptive processing algorithm of the different-speed data flow:
Enterprises divide the collected data streams into three tiers based on data update frequency characteristics:
Fast flowConfiguration:
Covering data, namely monitoring data by a system, logging user activities and constructing a pipeline state;
Update frequency from 10 seconds to 5 minutes;
the processing strategy is that real-time processing is carried out, and the size of a sliding window is set to be 30 seconds;
Physical time weight [ ]) 0.85, Giving priority to physical time;
medium speed flowConfiguration:
Covering data, namely changing task state, submitting codes to record and analyzing quality results;
updating the frequency from 1 hour to 1 day;
the processing strategy is quasi-real-time processing, and the time partition is 1 hour;
Physical time weight [ ]) 0.6, Balancing physical time and dependency relationship;
slow flowConfiguration:
coverage data, weekly report data, financial data, personnel assessment data;
the update frequency is 1 week to 1 month;
the processing strategy is to process regularly and keep the long-term consistency of the data;
Physical time weight [ ]) 0.3, Giving priority to the dependency relationship;
the enterprise analyzes the data flow dependency relationship of each product line item to construct a dependency relationship graph. The following is an example of the partial dependencies of core product a:
Dependency relationship 1 task completion statusConstruction state:
Dependent on intensity weights;
Time sensitivity;
Indicating that the construction flow should be triggered after the task is completed;
dependency 2 code commitMass analysis:
Dependent on intensity weights;
Time sensitivity;
Indicating that the code should be subjected to quality analysis after submission;
Dependency relationship 3 build StateDeployment state → deployment state:
Dependent on intensity weights;
Time sensitivity;
Indicating that the deployment flow should be triggered after the construction is completed;
The system calculates a logical timestamp for each data item based on these configurations and the dependency graph. Taking a code submission event as an example, the calculation process is as follows:
Physical acquisition time:
;
Logical timestamp of dependent data:
;
according to the formulaAnd (3) calculating:
(medium speed stream weight);
(rely on timestamp synthesis results);
Final logical timestamp;
This logical time stamping mechanism enables the system to coordinate data processing at different rates. For example, when processing the release state analysis of a certain product version in a certain week, the system can ensure that all data (from real-time monitoring data to weekly report data) used are consistent in logic time, and inconsistent analysis results caused by different data updating frequencies in the traditional system are avoided.
In practical application, the enterprise configures corresponding processing mechanisms for three types of data streams respectively:
The rapid flow adopts a flow processing engine based on a memory, a sliding window mechanism is configured, and the window size is dynamically adjusted according to the data flow characteristics, and the range is 10 seconds to 5 minutes;
The medium speed flow adopts a near real-time batch processing engine, a micro batch processing mechanism is configured, and the batch interval is 15 minutes;
the slow flow adopts periodic batch processing, and the processing period is set to be once a day by combining an increment updating mechanism;
The layering processing mechanism remarkably optimizes the utilization rate of system resources while ensuring the logical consistency of the data, and avoids the resource waste or the real-time reduction caused by the application of a uniform processing period to all data.
Application of incremental data stream conversion algorithm:
enterprises build a unified data model based on graph structures, comprising the following core entities and relationships:
core entity node:
project, representing a product development Project;
task, representing development Task and work item;
member represents a project team Member;
code, representing source Code and repository;
building, namely representing CI/CD building tasks;
Resource, representing project resources;
Relationship edge:
The items contain tasks;
responsible Responsible members are responsible for tasks;
creating Created member creation code;
Triggering Triggers, code triggering construction;
use Users, task use resources;
The system implements a special conversion rule for each data source, such as JIRA task data conversion rules:
field mapping, JIRA task ID-unified model task ID;
state transition, JIRA state "InProgress" → unified model state "in progress";
Relationship mapping, namely JIRA task allocation relationship and unified model 'responsible' relationship;
enterprises implement incremental data stream conversion algorithms, the following are examples of handling development task changes in practical cases:
the system detects the change of TASK TASK-1024 in JIRA, which comprises the state from 'to be developed' to 'in progress', and the distributor changes from team member A to team member B;
Change set calculation System calculates Change set:
"Change set= {
Update item [
{ Field: "status", old value: "to be developed", new value: "in progress" },
{ Field: "assignee", old value: "Member A", new value: "Member B" }
]
}”
Change classification-the system classifies a change set as update data(No new or deleted);
the system only applies the conversion function to the change field, and does not process the fields of unchanged description, priority and the like:
"differential conversion result= {
Entity update: {
Entity type: "task",
Entity ID "UNIFIED-TASK-1024",
Update properties {
The current state is in progress,
"State Change time" 2023-10-16T09:30:45"
}
},
Relation update: {
Delete relations: {
Type "responsible",
The source node is UNIFIED-MEMBER-A,
Target node 'UNIFIED-TASK-1024'
},
The new relation is {
Type "responsible",
The source node is UNIFIED-MEMBER-B,
Target node 'UNIFIED-TASK-1024'
}
}
}”
Merging the data models, namely merging the conversion result into a unified data model by the system, and updating the task node attribute and the correlation;
In actual operation, the system employs a batch optimization strategy for frequent changes. For example, during code review, a developer may update the same code file multiple times in a short period of time. The system sets a change batch processing window of 5 seconds, and combines multiple updates to the same file in the period of time into one process.
The enterprise also implements a declarative transformation rules engine that enables business personnel to define and modify transformation rules via a visual interface without writing code. The rules engine supports the following functions:
Rule templates of common data conversion scenes, such as task state mapping, personnel role mapping and the like, are predefined;
rule verification, namely automatically verifying the integrity and consistency of the rule and finding potential conflict;
rule version management, namely tracking rule change history and supporting rollback to the previous version;
Rule learning, namely automatically recommending rule optimization suggestions based on historical data;
in addition, enterprises establish a data quality assessment mechanism, monitor data quality problems in the conversion process in real time and trigger automatic repair. For example, upon detecting the absence of a team member associated with a task, the system may automatically mark the data as "to-be-verified" and generate a repair suggestion. The system also maintains a detailed log of the conversion process, records the input, output and quality metrics for each conversion, and supports problem localization and quality optimization.
The data flow processing technology of the business flow perception is realized:
The enterprise builds a business process monitoring system and tracks the execution condition and change of the project management process in real time. The system comprises the following core components:
Log collector, a lightweight agent deployed in each business system, captures the following key activity data:
recording an operation sequence of a user in the project management system;
recording state change and flow propulsion automatically triggered by the system;
API call log, recording the call condition between the system interfaces;
a database transaction log records the change condition of service data;
the log collection adopts a low-invasiveness design, and is realized through an interceptor, a monitor and a log analyzer, so that the influence on the performance of the original system is less than 3%.
And the flow identification engine is used for extracting a business flow model from the activity log by applying an improved alpha++ algorithm. The algorithm comprises the following steps:
log preprocessing, namely cleaning abnormal data and converting an original event into standard activity;
The relation discovery comprises the steps of calculating the causal relation and the parallel relation between activities;
constructing a flow model comprising sequential, selective, parallel and cyclic structures;
Decision point identification, namely detecting key decision points and decision rules in a flow;
the process recognition engine performs a full-scale analysis every 24 hours, an incremental analysis every 4 hours, and maintains an up-to-date process model.
And the change detector is used for detecting the business flow change by comparing the flow models at different time points. The following cases show detected changes in the approval process of product release:
original flow path:
submitting release application by a product manager, examining and approving by a technical manager, examining and approving by a quality manager, examining and approving by a department master, and releasing;
new flow path:
the product manager submits the release application- → (technical manager approves, quality manager approves) [ parallel ] →release committee approves- →release;
The change detection algorithm identifies the following change points:
Newly added nodes, namely approval by a release committee;
Deleting nodes, namely department general supervision and approval;
path change, namely, the technical manager and the quality manager examine and approve are changed from serial to parallel;
The decision rule is changed, and the approval passing condition is changed from 'all approvers pass' to 'committee majority pass';
the enterprise establishes a mapping relation model of the business flow and the data flow processing, and defines functionsIs a specific implementation rule:
Mapping the decision point and the conditional branch in the flow to be data routing rules;
An example is that an approval passing/rejecting decision point in a release approval process is mapped to a data stream routing condition;
IF approval result= = "route to" release preparation "flow through" THEN;
ELSE routes to the "issue cancellation" flow;
mapping the conversion rule, namely mapping the active nodes in the flow to data conversion rules;
examples are "quality check" activity mapping to quality data aggregate conversion rules;
"aggregate instruction= {
Inputs [ "unit test results", "integrated test results", "performance test results" ],
The operation is that "calculate weighted average score",
The weight is [0.3,0.4,0.3],
Output of the product is "quality comprehensive score"
}”
Priority rule mapping, namely mapping the activity priority in the flow into data processing priority;
an example is that an urgent bug repair procedure is mapped to a high priority data processing tag;
the IF task tag contains "urgent repair" THEN data processing priority= "highest";
when the system detects the change of the business flow, automatically triggering a data flow reconfiguration mechanism:
change impact analysis, identifying data stream processing logic affected by the flow change. In the case of issuing approval process changes, the system identifies the following points of impact:
The approval status data aggregation logic needs to be changed from serial to parallel;
data acquisition and processing related to a newly added release committee are needed;
the approval decision condition is changed from 'all pass' to 'majority pass';
configuration update plan the system generates a configuration update sequence comprising:
the preparation phase is to create a new data stream configuration but not activate it temporarily;
the transition stage, wherein new and old configurations run in parallel and process in-transit data;
a switching stage of completely switching to a new configuration;
And (3) performing configuration update, namely performing data stream reconfiguration according to a plan by the system, monitoring the state of the system, and automatically rolling back when 3 anomalies occur within 10 minutes of a certain node.
Enterprises also develop process data association visualization tools to provide visual views for managers and display association relations between business processes and data stream processing. The tool supports the following functions:
the flow visualization is that a current business flow model is displayed, wherein the current business flow model comprises an active node, a circulation path and a decision point;
The data flow visualization is that a data flow processing flow is displayed, and the data flow processing flow comprises a data source, a conversion step and a target output;
The mapping visualization of association is that the mapping relation between the flow nodes and the data processing steps is highlighted;
History comparison, namely supporting to check a history flow version and a change history;
impact analysis, namely aiming at flow change in the plan, previewing the potential impact of the flow change on data processing;
In practical application, the data flow processing technology of the service flow perception enables the system to adapt to service changes rapidly. Taking product development flow adjustment as an example, when an enterprise switches from a waterfall development mode to an agile development mode, the system automatically recognizes flow change and correspondingly adjusts data processing logic, including transition from staged data aggregation to iterative cycle data aggregation, transition from a fixed milestone report to a persistent integrated status report, and the like. This keeps the data processing consistent with the actual service requirements all the time, avoiding the problem of data processing disjoint during service change in the conventional system.
Application of flow data synchronization protocol:
To ensure consistency of data processing during business process changes, enterprises implement a distributed consensus-based process data synchronization protocol. The protocol coordinates data processing among a plurality of research and development centers worldwide, and ensures the consistency of system states during parallel operation of new and old flows.
The process data consistency model is realized:
Enterprise definition and implementation of a consistency modelIt is ensured that at any point in time t, each node n in the system has an equivalent view of the traffic flow state and the data flow processing state. The specific implementation method comprises the following steps:
Hierarchical state synchronization mechanism:
tracking state updates of each node by using a vector clock (VectorClock);
recording a state change event by adopting an event tracing (EventSourcing);
the upper layer realizes view consistency check and ensures that the data processing state and the business flow state are synchronous;
Consistency check algorithm the system performs a consistency check every 30 seconds.
The distributed consensus algorithm applies:
Enterprises implement a distributed consensus algorithm based on an improved two-phase commit protocol for coordinating data flow reconfiguration when business process changes. The algorithm deploys coordination nodes in 6 research and development centers worldwide, and ensures consistency of configuration change. In the case of product release process change, the algorithm execution process is as follows:
First stage (pre-commit):
The master coordination node sends a pre-commit message containing the change description to all the participating nodes;
each node verifies the feasibility of the change and tests the stability of the new configuration;
The node returns a readiness or refusal response to the coordinator, and a detailed verification result is attached;
Second stage (commit):
The master coordination node collects the responses of all the participating nodes;
when more than 80% of the nodes return to ready, sending a commit message;
otherwise, sending an abort message and recording the failure reason;
each node decides to execute or discard the change according to the coordinator;
Pre-verification optimization, in which, before submitting in two stages of formal start, the system executes the pre-verification of the configuration change, including:
configuration grammar verification, namely checking grammar correctness of new configuration;
conflict detection, namely identifying potential conflicts with existing configurations;
Predicting the influence of the change on the system performance;
Business process version management system:
the enterprise builds a business process version management system and supports the parallel operation of new and old business processes:
version library implementation, namely, adopting a graph database to store different versions of flow models, and supporting:
Version tree structure, recording the derivative relation between the process versions;
difference comparison, highlighting changes between different versions;
Adding semantic tags for important versions;
The version router is implemented by a context-aware based rule engine which dynamically selects a processing version according to the following factors:
selecting a proper version according to a data source system;
Project phase-projects of different phases may use different streams Cheng Banben;
Explicit marking, namely, the support data comprises an explicit version mark;
compatibility checker, evaluate version compatibility by static analysis and dynamic testing:
Static analysis, namely checking compatibility of the data structure and the circulation path;
dynamic testing, namely verifying data exchange among different versions by using historical data;
Conflict resolution, providing automatic and manual conflict resolution mechanisms;
A flow data switching manager:
in the process that the enterprise turns from waterfall type development to agile development, the switching manager performs the following operations:
Data migration planning, namely, setting a data mapping relation between a waterfall and an agile mode:
"mapping rule example = {
"Milestone" → "sprint",
"Staged delivery" → "iteration increment",
"Stage review" → "iterative review"
}”
And (3) state synchronous control, namely in a transition period of 6 weeks, the system simultaneously maintains project states of a waterfall mode and an agile mode:
directly adopting agile mode for new projects;
Determining whether to switch the in-process project according to the completion degree;
the completion is less than 30 percent, and the method is switched to the agile mode;
the completion is more than 70 percent, and the waterfall mode is maintained;
the completion degree is 30-70%, wherein team selects autonomously;
Rollback mechanism to ensure service continuity, the system implements a three-level rollback mechanism:
configuration level rollback, namely only rollback data stream configuration, and reserving data state;
state level rollback, namely rolling back to the system state at a specified time point;
full rollback, recovering to the complete system state before changing;
Switching monitoring, namely monitoring switching progress and system state in real time:
The data consistency index is used for monitoring the consistency degree of new and old flow data;
monitoring system response time and resource utilization rate in the switching process;
tracking user operation error rate and satisfaction;
In agile-style projects, multiple product lines of an enterprise need to switch development modes at different points in time. The process data synchronization protocol ensures continuity and consistency of project data in the transformation process. For example, during a sprint planning phase, the system can seamlessly map demand planning data in an old flow to sprint planning data in a new flow while maintaining the integrity and relevance of historical data.
The key value of the protocol is that the system can maintain the continuity of data processing during the change of the business flow, and the system does not need to stop or freeze data, thereby obviously reducing the influence of the business change on operation. In practical application, the protocol ensures that the project management system always provides accurate and consistent data support during the agile transformation period of 6 months of enterprises, and the problem of inconsistent data caused by flow change does not occur.
And (3) verifying the technical effects:
In order to verify the technical effects of the method, enterprises perform system performance and effect tests after implementing, and two key technical effects of real-time improvement and service adaptability enhancement are mainly evaluated.
The data processing real-time improvement effect:
the enterprises compare the data processing delay conditions of the system before and after transformation based on the actual operation data of 5 main product line projects. The following is a comparison of processing delays for different types of data sources:
fast stream data processing delay (system monitor data):
before reconstruction, the average delay is 3 minutes, and the peak delay can reach 15 minutes;
After reconstruction, the average delay is 0.2 seconds, and the peak delay is not more than 1.5 seconds;
the processing delay is reduced by 99.8 percent, and the real-time monitoring requirement is met;
Medium speed stream data processing delay (task and code commit data):
before transformation, the average delay is 2.5 hours, and the peak delay can reach 8 hours;
After reconstruction, the average delay is 45 seconds, and the peak delay is not more than 3 minutes;
the processing delay is reduced by 97 percent, and the quasi-real-time processing level is reached;
slow stream data processing delay (financial and personnel assessment data):
before reconstruction, the average delay is 1.5 days, and part of data delay is more than 3 days;
after modification, the average delay is 4 hours, and the maximum delay is not more than 12 hours;
The processing delay is reduced by 89%, and the timeliness of the data is obviously improved;
in the real-time test of a specific scene, an enterprise selects a key service scene of code submission triggering construction flow, and compares the end-to-end response time before and after transformation:
Before transformation, the code is submitted to the data analysis result to be available, and the average time is 28 minutes;
After transformation, submitting codes to the data analysis result for use, wherein the average time is only 65 seconds;
the improvement effect is that the end-to-end response time is reduced by 96 percent;
the improvement of the real-time data processing capability enables enterprises to realize the following business values:
Early detection of the problem, that is, the system can give out early warning within 5 minutes of the occurrence of the problem, and 3 to 4 hours are required before the improvement. This shortens the problem repair time by 63% on average.
The resource utilization rate is improved, namely, the enterprise optimizes the resource allocation by knowing the use condition of the resource in real time, and the resource utilization rate is improved to 78% from the average 62% before transformation.
The decision speed is accelerated, the time for the management layer to acquire the project state report is increased from once daily to ready-to-check, and the decision period is shortened from 2 days on average to 4 hours.
Service adaptability enhancement effect:
To verify the improvement in business process change adaptability, the enterprise records multiple business process changes that occur within 6 months, and measures the time and resources required by the system to accommodate these changes. The following is an adaptive comparison of three typical business process changes:
approval process change (serial approval to parallel approval):
3 developers are required to work for 5 days to carry out system adjustment before transformation, and data processing is suspended during the transformation period;
after transformation, the system automatically recognizes the process change and reconfigures, and only 1 administrator is required to confirm, so that data processing is not required to be suspended;
the labor cost is reduced by 93%, and the service continuity is improved by 100%;
Project management method changes (from waterfall to agility):
Before transformation, a special team is required to work for 3 weeks to reconstruct a system, and meanwhile, historical data is manually migrated;
After transformation, the system automatically adapts to the process change, performs data mapping and smooth transition, and takes 5 days;
The adaptation time is reduced by 76%, and the data consistency error is reduced by 95%;
organization structure adjustment (department merge and split):
before transformation, the data processing logic and authority setting need to be manually adjusted, and the average time is 12 days;
after transformation, the system automatically adjusts data stream processing and authority mapping according to organization change, which takes 2 days;
the adaptation time is reduced by 83%, and the configuration error is reduced by 89%;
for the business change high frequency period (organization year adjustment period), the enterprise measures the consistency of data processing and actual business processes:
before transformation, after the service is changed, 42% of data processing logic is inconsistent with the actual service on average, and the correction period is 14 days on average;
after the modification, the inconsistent proportion is reduced to 6% after the service is changed, and the system can automatically correct the inconsistent proportion by 95% within 24 hours;
the improvement of service adaptability brings remarkable service value for enterprises:
the change cost is reduced, namely, the implementation cost of the flow change is reduced by 85 percent on average, and the average cost is reduced to 6.3 days from 42 days per change.
And the service agility is improved, namely the period from the process proposal of the service department to the complete implementation is shortened from 25 days to 4 days on average, so that enterprises can respond to market changes more quickly.
User satisfaction improves, namely, the user satisfaction of the project management system is improved from 68% to 92% before modification, and the problem that the system does not meet the working requirement reported by the user is reduced by 78%.
In combination with the verification result, the technical scheme provided by the application is fully verified in the actual enterprise environment, so that remarkable real-time improvement and service adaptability enhancement are realized, the key technical problem faced by multi-source heterogeneous data integration is solved, and a powerful technical support is provided for enterprise project management.
As shown in fig. 2 and fig. 3, the change of the system resource utilization rate with the increase of the data stream and the relationship between the adaptation time of the business process change and the change complexity are respectively shown.
While the embodiments of the present invention have been described above, the embodiments are not limited to the above-described embodiments, which are intended to be illustrative only and not limiting, and many equivalents thereof may be made by those of ordinary skill in the art in light of the present disclosure, which fall within the scope of the embodiments.