Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
The embodiment provides a data center for enterprise information integrated management, which comprises the following steps:
The system comprises a data integration module, a data processing module and a data processing module, wherein the data integration module adopts a plug-in acquisition architecture, realizes unified access of multi-source heterogeneous data through a plurality of different data source adapters and ensures real-time synchronization of the data;
The data management module is used for full life cycle management of data and comprises metadata management, main data management, data modeling and data blood-edge analysis;
The data service module is used for providing a unified API management platform, realizing unified management, safety control and load balancing of services through an API gateway, providing a visual development tool, and providing a built-in data processing component for a user to quickly construct data services;
the data security module is used for guaranteeing data security and access control;
and the system monitoring module is used for monitoring the running state of the system.
Specifically, the data integration module includes:
the data acquisition unit is used for providing batch, real-time and incremental acquisition modes;
the data conversion unit is used for data cleaning, conversion and standardization processing;
the data quality control unit is used for checking and processing data quality, including data integrity check, consistency check and accuracy verification, and providing flexible quality control strategy configuration through the quality rule engine;
The data synchronization unit is used for providing a plurality of synchronization modes of full synchronization, incremental synchronization and real-time synchronization and automatically adjusting a synchronization strategy according to the data volume and the system load;
And the data storage unit is used for data storage and supports various storage modes and data types.
Specifically, the data management module includes:
the metadata management unit is used for constructing a unified metadata warehouse, collecting and managing technical metadata, business metadata and operation and maintenance metadata, realizing unified management and retrieval of data assets, and providing data tracing and influence analysis;
The main data management unit is used for providing a main data modeling tool for a user to define a main data model and a standard, and ensuring the consistency of enterprise core data through a main data synchronization mechanism;
the data modeling unit is used for providing a visual modeling tool, supporting the design and conversion of a conceptual model, a logic model and a physical model, and ensuring the rationality of the design through model verification;
the data blood edge unit is used for automatically recording and analyzing the dependency relationship in the data circulation process, displaying the data blood edge relationship in a visual mode, helping a user understand the data circulation path and facilitating problem positioning and influence analysis.
Specifically, the data service module includes:
an API service unit for providing standardized data access interface;
The data development unit is used for providing custom data processing logic;
the data analysis unit is used for providing data analysis and mining capabilities;
The data sharing unit is used for realizing safe sharing of data;
and the service monitoring unit is used for monitoring service calling conditions.
Specifically, the data security module includes:
An access control unit for managing data access rights;
The data desensitization unit is used for realizing sensitive data protection;
The audit log unit is used for recording a data operation log;
The security policy unit is used for formulating a data security policy;
and the risk monitoring unit is used for monitoring the data security risk.
Specifically, the system monitoring module includes:
The performance monitoring unit is used for resource monitoring, service monitoring, interface monitoring, task monitoring and alarm setting;
the log management unit is used for log acquisition, log analysis, log retrieval, log archiving and log audit;
and the operation and maintenance management unit is used for configuration management, deployment management, backup recovery, capacity planning and problem diagnosis.
The present invention will be described in further detail below.
1. System architecture
The data center in this embodiment adopts a hierarchical design architecture, and a complete data management system is constructed. On the whole framework, the system is divided into four layers of a data access layer, a data processing layer, a data service layer and an application layer, and each layer bears different functional responsibilities. Wherein:
The data access layer is responsible for unified access of multi-source heterogeneous data, supports access adaptation of various data sources, and comprises structured data, semi-structured data and unstructured data;
the data service layer provides a unified data service interface and supports various service modes and protocols;
the application layer provides personalized data application support for specific business scenes.
The system realizes data flow among layers through a unified data bus, and ensures the efficiency and reliability of data transmission by adopting the techniques of message queues, distributed caches and the like. The architecture design adopts a micro-service architecture, the system function is divided into a plurality of independent service components, each service component can be independently deployed and expanded, and unified management and calling of the service are realized through a service registration center and an API gateway. The security framework of the system penetrates through all layers to realize unified user authentication, authority control and data security protection. Meanwhile, the system also provides perfect operation and maintenance monitoring functions, including performance monitoring, log management, alarm notification and the like, so that stable operation of the system is ensured.
As shown in fig. 1, in a specific implementation manner, the data center station adopts a layered architecture design, and sequentially includes a data access layer, a data processing layer, a data service layer and an application layer from bottom layer to top layer. The system adopts a containerized deployment mode and performs resource scheduling and management based on Kubernetes.
In terms of hardware configuration, the proposed configuration is as follows:
Access layer server, CPU 16 core, memory 64GB, storage 2TB
Processing layer server, CPU 32 core, memory 128GB, storage 5TB
Service layer server, CPU 16 core, memory 64GB, storage 1TB
Distributed storage cluster with total capacity not less than 50TB
The system adopts a micro-service architecture and mainly comprises the following core services:
Data integration service responsible for data collection and synchronization
Data processing service responsible for data cleaning and conversion
Metadata service responsible for metadata management
API gateway service responsible for unified management of interfaces
Task scheduling service, responsible for job scheduling monitoring alarm service, and responsible for system monitoring.
2. Data integration module
Through the data integration module, the invention realizes strong data integration capability and lays a foundation for unified management of enterprise data. In the aspect of data source access, the system adopts a plug-in acquisition architecture, provides rich data source adapters, and supports access of various data sources such as a relational database (such as MySQL, oracle, SQL SERVER and the like), a large data component (such as Hadoop, hive, HBase and the like), a message middleware (such as Kafka, rabbitMQ and the like), an application system interface (such as REST API, webService and the like) and the like.
In the aspect of data synchronization, the system supports various synchronization modes such as full-volume synchronization, incremental synchronization, real-time synchronization and the like, provides an intelligent scheduling strategy, and can automatically adjust the synchronization strategy according to factors such as data volume, system load and the like. Real-time data synchronization-in a distributed system, the trade-off of data synchronization involving consistency and availability can be illustrated using CAP theory. The relationship of consistency C, availability A, and partition tolerance P for a system is described by the following formula:
C+A≤P
the above formula can be used for explaining how to balance according to actual scenes in the data synchronization process.
In the data transmission process, a multithreading concurrent processing mechanism is adopted, and the transmission efficiency is optimized through technologies such as data compression, breakpoint continuous transmission and the like. In terms of data quality control, the system provides a complete quality monitoring mechanism, including data integrity checking, consistency checking, accuracy verification and the like, and flexible quality control strategy configuration is supported through a quality rule engine. For the discovered quality problems, the system provides two processing modes of automatic repair and manual intervention, and helps users to discover and solve the data quality problems in time through a quality analysis report.
Data integrity check-a probabilistic statistical model can be used to measure the integrity and consistency of data. For example, the probability of defining data integrity is P (Q) and the consistency is P (C). A bayesian formula is introduced to detect data anomalies, defining P (a|b) as the probability of event a occurring under condition B:
the formula can be used for detecting the abnormality degree of new data according to the characteristics of historical data.
As shown in fig. 2, in a specific implementation manner, the data integration module adopts a distributed architecture, so that unified access and processing of multi-source heterogeneous data can be realized. The source system sequentially enters the target system through data source management 201, data acquisition 202, data conversion 203, data quality control 204 and data loading 205, wherein the data source management 201 comprises metadata management and connection configuration, the data acquisition 202 comprises real-time acquisition and batch acquisition, the data conversion 203 comprises format conversion and data mapping, the data quality control 204 comprises data verification and quality monitoring, and the data loading 205 comprises incremental loading and full loading.
The method specifically comprises the following implementation mechanisms:
1) Data source access mechanism:
supporting relational databases such as MySQL, oracle, SQL SERVER, etc
Support big data components Hive, HBase, elasticsearch etc
Support file system, local file, HDFS, object store, etc
Support message queues Kafka, rabbitMQ and the like
And supporting the application system to be accessed through REST API, SDK and other modes.
2) Data synchronization mechanism:
Batch synchronization supporting both full and incremental modes
Real-time synchronization of implementing millisecond delay based on log parsing
Timing synchronization supporting cron expression configuration
Trigger synchronization, supporting event trigger and API trigger.
3) And (3) data quality control:
Real-time checking of data
Rules engine supporting custom quality rules
Problem handling, automatic repair or human intervention
Quality report, periodically generating a data quality report.
3. Data management module
The embodiment establishes a perfect data management system and can realize the full life cycle management of data. In terms of metadata management, the system builds a unified metadata warehouse, and collects and manages technical metadata (including database table structures, field attributes, storage locations and the like), business metadata (including business descriptions, index calibers, business rules and the like), operation and maintenance metadata (including task configuration, monitoring indexes, operation logs and the like). Through metadata management, unified management and retrieval of data assets are realized, and data tracing and influence analysis are supported. In the aspect of main data management, the system provides a main data modeling tool, supports user definition of a main data model and a standard, and ensures consistency of enterprise core data through a main data synchronization mechanism. In the aspect of data modeling, the system provides a visual modeling tool, supports the design and conversion of a conceptual model, a logic model and a physical model, and ensures the rationality of the design through model verification. In the aspect of data blood edge analysis, the system automatically records and analyzes the dependency relationship in the data circulation process, and displays the data blood edge relationship in a visual mode, so that a user is helped to understand the data circulation path, and the problem positioning and the influence analysis are facilitated.
As shown in FIG. 3, in one particular implementation, the data management module enables unified management of enterprise data assets, including metadata management 301, master data management 302, data modeling 303, data blood-line 304.
The specific functions are as follows:
1) Metadata management:
technical metadata, table Structure, field Properties, etc
Service metadata, service description, index caliber and the like
Operation and maintenance metadata, task configuration, monitoring indexes and the like
Data standards, naming specifications, code specifications, etc.
2) And (3) main data management:
Data model unified main data model
Data standard Main data Standard Specification
Data synchronization-master data real-time synchronization
Data quality, namely main data quality control.
3) Modeling data:
Concept model business concept and relationship
Logical model entity attributes and relationships
Physical model storage structure design
Version management, model version control.
4) Data blood margin:
Data source
Data flow
Data impact.
4. Data service module
The embodiment provides rich data service capability and supports diversified data application scenes. In the aspect of API service, the system provides a unified API management platform, supports REST, graphQL, webSocket and other various service protocols, and realizes unified management, safety control and load balancing of the service through an API gateway. In the aspect of service development, the system provides a visual development tool, supports drag-type service arrangement, provides a script development environment, and supports a plurality of development languages such as SQL, python and the like. The system is internally provided with a common data processing component which comprises data conversion, filtering, aggregation and the like, so that a user can conveniently and quickly construct data service.
In the aspect of data analysis, a K-means clustering algorithm is adopted, and the K-means clustering algorithm is an unsupervised machine learning algorithm and is used for dividing a data set into a plurality of different clusters. The data points in each cluster have similar characteristics, while the data points differ significantly between different clusters. At the data analysis module, data grouping is performed using the K-means algorithm, the data set is divided into K clusters, and the objective is to minimize the following objective functions:
Wherein,
J represents the objective function value, representing the clustering cost (smaller better).
K represents the number of clusters.
Ci represents the ith cluster, containing a set of data points.
Mui is the centroid of cluster c_i.
The term "|x- μi || denotes the Euclidean distance of the data point x from the centroid. Euclidean distance is a common way to measure the distance between a data point and centroid in K-means clusters.
And (5) visualizing the data. The system provides interactive analysis tools, supporting ad hoc queries and visual analysis, providing efficient multidimensional analysis capabilities through the OLAP engine. Data visualization is an important means of supporting decisions. Direct visualization becomes very difficult, especially when the data dimension is high.
In the data visualization construction, principal Component Analysis (PCA) is adopted, and the PCA is mainly used for dimension reduction, feature extraction, redundancy removal, visualization and key index identification in a data analysis module, so that the distribution and the features of complex data are intuitively displayed on a two-dimensional or three-dimensional plane. Can help an analyst understand better, thereby assisting decision making.
The dimensionality of the data can be reduced by principal component analysis, with the goal of maximizing the variance of the data after projection, given by:
Z=XW;
where Z is the projected data, X is the original data matrix, and W is the eigenvector matrix.
In the aspect of data sharing, the system realizes a data resource catalog, provides unified display and management of data resources, supports data authorization and subscription, and promotes effective circulation and multiplexing of data. Meanwhile, the system provides perfect service monitoring and management functions, including service call monitoring, performance analysis, capacity management and the like, and ensures the stability and reliability of data service.
As shown in FIG. 4, in a specific implementation, the data service module provides unified data service capability, including an API service 401, data development 402, data analysis 403 and service monitoring 404, wherein the API service 401 comprises interface management, service registration and authority control, the data development 402 comprises development environment, task scheduling and code management, the data analysis 403 comprises spot query, visualization analysis and report generation, and the service monitoring 404 comprises performance monitoring, alarm management and log analysis.
The specific implementation is as follows:
1) API service:
REST API standard RESTful interface
GraphQL Flexible query language
WebSocket real-time data push
RPC high performance remote invocation
Automatic document generation Swagger document
2) Data development:
Visual development-drag type development interface
Script development of SQL, python, etc
Debugging tool, breakpoint debugging and log checking
Version management code version control
Automatic release process
3) Data analysis:
Impromptu query interactive query analysis
Report development-visual report design
Data export, multiple formats export
Authority control, namely fine granularity access control.
5. Data security module
As shown in FIG. 5, in a specific implementation, the data security module implements an omnibearing data security protection mechanism, including an access control 501, a data desensitization 502, a security audit 503 and a risk monitoring 504, wherein the access control 501 includes identity authentication, authority management and access policy, the data desensitization 502 includes sensitive data identification, desensitization rules and desensitization processing, the security audit 503 includes operation log, access record and compliance check, and the risk monitoring 504 includes anomaly detection, risk assessment and early warning response.
The specific implementation method comprises the following steps:
1) Access control implementation:
Unified authentication supporting multiple authentication methods (LDAP, OAuth2.0, etc.)
Rights management rights control based on RBAC model
Data authority supporting row and column level fine grain control
Access audit, recording all access operations
Dynamic authorization supporting temporary rights grants.
2) Data desensitization was achieved:
static desensitization-desensitization during storage
Dynamic desensitization-desensitization during interrogation
Desensitizing rule supporting multiple desensitizing algorithms
Desensitization strategy-dynamic desensitization according to role
Key management, secure key storage and management.
3) The safety monitoring is realized:
monitoring abnormal access behavior in real time
Risk analysis rule-based risk assessment
Alarm notification, multichannel alarm push
Security report periodically generating a security report
Emergency response-rapid handling of security events.
6. System monitoring module
As shown in FIG. 6, in a specific implementation, the system monitoring module implements comprehensive operation and maintenance monitoring capability, including performance monitoring 601, log management 602, operation and maintenance management 603 and alarm center 604, wherein the performance monitoring 601 includes resource monitoring, performance indexes and load balancing, the monitoring indexes include CPU usage, memory occupation, network traffic and response time, the log management 602 includes log collection, log analysis and log storage, the log types include system logs, application logs, security logs and operation logs, the operation and maintenance management 603 includes task scheduling, configuration management and system maintenance, maintenance tasks include backup recovery, version update and fault processing, the alarm center 604 includes alarm rules, alarm triggering and alarm processing, and the alarm level includes general alarms, important alarms and emergency alarms.
The method specifically comprises the following steps:
1) And (3) performance monitoring:
resource monitoring, such as CPU, memory, disk, etc
Service monitoring, service status and Performance
Interface monitoring API call condition
Task monitoring, data processing task
Alarm setting, namely threshold alarm configuration.
2) And (3) log management:
Log collection, namely unified collection of multi-source logs
Journal analysis: real-time journal analysis
Journal search, full text search capability
Log archiving, historical Log archiving
Log audit, operation log audit.
3) And (3) operation and maintenance management:
Configuration management, unified configuration center
Deployment management automated deployment
Backup recovery: periodic backup of data
Capacity planning: resource capacity prediction
And (3) diagnosing the problems, namely quickly positioning faults.
The invention has the following remarkable advantages and beneficial effects:
1) And the data management efficiency is remarkably improved:
The method and the device realize unified management of enterprise data, eliminate data islands, standardize data processing flow, improve data processing efficiency, automatically synchronize data, reduce manual intervention, unify metadata management, improve data asset management efficiency and reduce management difficulty by a visualized management tool.
2) Comprehensively guaranteeing the data quality:
The quality control mechanism of the whole flow ensures the accuracy of data, monitors the quality in real time, discovers problems in time, repairs the quality automatically, improves the processing efficiency, is a complete quality assessment system, quantifies the quality level, and is convenient for problem positioning due to traceable data blood edges.
3) Providing flexible data services:
The system has the advantages of unified service interface, simplified development integration, rich service types, visual development tools, improved development efficiency, fine authority control, data security assurance, perfect monitoring mechanism and service quality assurance, and meets the diversified demands.
4) Enhancing data security protection:
The system comprises a unified security framework, omnibearing protection, fine-grained authority control, a complete audit mechanism, a flexible desensitization strategy, sensitive data protection and multiple backup mechanisms, wherein the authority control prevents data leakage, the complete audit mechanism tracks data use, and the data security is ensured.
5) Support the innovative development of business:
The method comprises the steps of providing data support, assisting in decision analysis, promoting data sharing, promoting business collaboration, accelerating data circulation, improving response speed, mining data value, promoting business innovation, precipitating data assets and improving enterprise competitiveness.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and it should be noted that it is possible for those skilled in the art to make several improvements and modifications without departing from the technical principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention.