CN112231304A

Movatterモバイル変換

Info

Publication number: CN112231304A
Application number: CN202011479233.6A
Authority: CN
Inventors: 郁强; 李开民; 李圣权
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-01-15

Abstract

The invention relates to the technical field of data warehouses, in particular to a data processing system introducing a data warehouse construction technology. The system comprises a data source unit, a data processing unit, a data query statistical analysis unit, a metadata unit and a management center unit, wherein the data source unit is used for establishing database files, flat files, html files and xml files, the data processing unit is used for processing data in the data source unit, the data query statistical analysis unit is used for uniformly recording and analyzing information data, the metadata management unit is used for storing the data in the data query statistical analysis unit, and the management center unit is used for managing and maintaining the whole system.

Description

Data processing system and method introducing data warehouse construction technology

Technical Field

The invention relates to the technical field of data warehouses, in particular to a data processing system and a data processing method introducing a data warehouse construction technology.

Background

With the rapid development and wide application of computer network and database technology, information management in various industries has entered a new era. Early databases were primarily independent databases, applied to various aspects of the data processing field;

the independent database systems generally have the following characteristics that the functions are single, the system can not adapt to a large number of increasingly complex applications only aiming at a specific service, and more or less manual processing operation is required; the existing service information system is developed on different hardware platforms, the operating system and the database management system are different, and a uniform data interface cannot be provided; the business information systems are physically dispersed, have low degree of mutual association, relatively closed information and low sharing degree, and in view of this, a data processing system and a method introducing a data warehouse construction technology are provided.

Disclosure of Invention

The present invention is directed to a data processing system and method for introducing a data warehouse building technology, so as to solve the problems in the background art.

In order to achieve the above object, in one aspect, the present invention provides a data processing system incorporating a data warehouse building technology, including a data source unit, a data processing unit, a data query statistical analysis unit, a metadata management unit, and a management center unit;

the data source unit is used for establishing a database file, a flat file, an html file and an xml file;

the data processing unit is used for processing the data in the data source unit;

the data query statistical analysis unit is used for uniformly recording and analyzing the information data;

the metadata management unit is used for storing the data in the data query statistical analysis unit;

the metadata management unit comprises a data metadata module and a process metadata module, wherein the data metadata module is used for retrieving, accessing and understanding source information; the process metadata module is used for searching, evaluating, accessing and managing data;

the management center unit is used for managing and maintaining the whole system, the management center unit comprises an authority control module, a performance management module and a fault recovery module, the authority control module is used for setting authority for logging in by a user to verify that the performance management module is used for evaluating the performance of equipment and a network unit, network blockage or interruption conditions can be found conveniently and timely, comprehensive fault elimination, capacity planning based on facts and network resources can be effectively distributed, the fault recovery module is used for automatically creating a recovery point, the system can return to a working state, and fault recovery can be carried out quickly on the premise that a data file does not need to be reinstalled and cannot be damaged.

Preferably, the data source unit comprises an online transaction processing module, a legacy data module, an internal office data module and an external data module;

the online transaction processing module is used for reflecting daily operation conditions of an enterprise, is a main data source of an enterprise data warehouse, reflects the daily operation conditions of the enterprise, and is generally fine in granularity;

the legacy data module is used for data mining and trend analysis, is offline or archival data, has significant historical value for data mining, trend analysis and the like, and is guided to a data warehouse by a proper application program;

the internal office data module is used for analyzing data of enterprise offices, and the data comprises unstructured (such as documents in non-electronic form), structured (such as electronic reports, word processing documents and the like) and semi-structured (such as annual reports and the like), and the data is useful for supporting analysis of cross-departments;

the external data module is used for recording data of demographic information, competitor information, questionnaire survey and xml documents.

Preferably, the data processing unit comprises a data extraction module, a data verification module, a data cleaning module, a data integration module, a data aggregation module and a data loading module;

the data extraction module is used for capturing data, and the data capturing mode comprises a complete capturing mode and an incremental capturing mode, wherein the complete capturing mode is used for extracting complete information of a data source; the incremental capturing mode focuses more on modified data in the data, and the incremental capturing mode is adopted after a complete capturing mode is performed once, so that in real-time extraction, the incremental capturing mode can reduce the extracted data volume and network flow;

the data verification module is used for detecting data in the data source unit, including data problems of lost data recovery, fuzzy data conversion and business operation, and solving data quality problems, wherein the detected content comprises effective values of attributes (domain check), effective values among rows in the table and other tables (foreign key check), detection of similar duplicate records and detection of missing values, the effective values of the attributes (domain check) and the effective values among rows in the table and other tables (foreign key check) are detected by using reference integrity check of the database, the similar duplicate records refer to the same real entity and are represented by a plurality of incompletely identical records in different data sets, and the DBMS cannot correctly identify due to some difference in format and spelling, and the records are called as 'similar duplicate records';

the data cleaning module is used for cleaning the dirty data detected in the data verification module, ensuring the correctness, consistency, integrity and reliability of the data, ensuring the data quality of an information source and simultaneously ensuring the correctness and accuracy of the original data of an auxiliary decision;

the data integration module is used for integrating a plurality of data into unified data for analysis;

the data aggregation module is used for collecting and summarizing information data, the data aggregation method of the data aggregation module comprises numerical value aggregation and dimension reduction, numerical value aggregation, namely, the compression of longitudinal data of the two-dimensional table is actually performed to remove unnecessary details in source data by selecting a smaller data volume to replace an original data set, the dimension reduction is performed to reduce the data volume by deleting irrelevant attributes, the compression is performed to transverse attributes of the two-dimensional table, the reduction of the number of instances of an entity to a level easy to manipulate is facilitated, and the calculation of widely-applied summary numbers is facilitated;

the data loading module is used for storing the converted data, so that the bad data can generate an error report.

Preferably, the numerical data metric in the data integration module is calculated according to the following formula:

where n is the number of tuples,

and

is the average value of A and B, σ_AAnd σ_BThe standard deviation of A and B, respectively.

Preferably, the non-numerical data metric in the data integration module is calculated according to the following formula:

wherein P (A), P (B), and P (Au B) are respectively attribute A_I、B_IAnd, a $ B in attribute set I = { I = } B₁，i₂，i₃⋯，i_mProbability of occurrence in.

Preferably, the information data in the data query statistic analysis unit comprises market overall situation, market structure, market dynamics, investment structure, financial situation of listed companies, market index and macro economic index.

Preferably, the data query statistical analysis unit comprises a data warehouse module, a data mart module, an operational data storage module and a front-end access module;

the data warehouse module is used for supporting a data set of enterprise analytic decision, is subject-oriented, integrated, time-varying and non-updatable, and provides a global view for data of the whole enterprise;

the data mart module is used for supporting a data set of department decisions, is a logic subset of the data warehouse module, is a data set oriented to the department decision support, and has higher flexibility, controllability and professional function compared with a data warehouse;

the operational data storage module is used for supporting a data set of daily application of an enterprise and has the characteristics of theme orientation, integration, variability and current or near-current data;

the front-end access module is used for displaying the data in a text, report, curve and graphic mode after the data is arranged and processed, and can be displayed simply, conveniently and quickly.

Preferably, the data warehouse module comprises a multidimensional data model, a star model and a snowflake model;

the multidimensional data models organize data in an intuitive manner and support high-performance data access, each multidimensional data model being represented by a plurality of multidimensional data patterns; the star model can well support the multidimensional analysis with the following characteristics in the data mart, namely, known, stable-demand, query and repeated reports which need reasonable response time and can be completely predicted, and under the condition of more complex application; the snowflake model is an extension to the star model, where a fact table may be connected to multiple levels of dimension tables, each point connected to multiple points along a radius.

Preferably, in another aspect, the present invention further provides a method for data processing by introducing a data warehouse building technology, including any one of the data processing systems introduced with the data warehouse building technology, and the operation steps are as follows:

s1, establishing a data source: establishing a database file, a flat file, an html file and an xml file under a data source unit;

s2, data processing: extracting data from the data source unit, processing the data through the data processing unit, sorting, organizing and processing the data, loading the data into a target database, and periodically refreshing and reflecting the change of the data source;

s3, building a data warehouse: collecting the demand information of a user through a data query statistical analysis unit, estimating the data quantity, then selecting a proper software and hardware platform design data model, storing the data after inspection, sorting, processing and recombination, constructing a data warehouse module, and managing the data;

s4, front end access and analysis: the front-end access module carries out various processing and arrangement on the data in the database module, carries out mining and prediction, and then displays the obtained data in a text, report and curve mode;

s5, storage data: the metadata management unit stores the data in the data query statistical analysis unit, so that the syntax and the semantics of the whole enterprise are kept consistent;

s6, permission setting: the right control module is used for setting the right verification of user login, so that the user can log in and inquire at any time, and the performance management module is used for evaluating the performance of the equipment and the network unit.

Compared with the prior art, the invention has the beneficial effects that: in the data processing system and the method for introducing the data warehouse construction technology, the data processing unit and the data query statistical analysis unit are arranged, the technology of materializing the data warehouse is adopted to adapt to the conditions of the system, the integration and the synthesis of related data are carried out on subjects proposed by a user, meanwhile, the user query can be provided for the data which are not extracted in a remote access service mode, the data processing system has higher flexibility, controllability and professional function, the front-end access module carries out various processing and sorting, mining and prediction on the data in the data warehouse module, and then the obtained data are displayed in the modes of characters, reports and curves, so that the data can be conveniently queried and analyzed, and the information relevance is strong.

Drawings

FIG. 1 is a block diagram of the overall structure of the present invention;

FIG. 2 is a block diagram of a data source unit according to the present invention;

FIG. 3 is a block diagram of a data processing unit according to the present invention;

FIG. 4 is a block diagram of a data query statistic analysis unit according to the present invention;

FIG. 5 is a block diagram of a metadata management unit according to the present invention;

FIG. 6 is a block diagram of the unit structure of the management center of the present invention;

FIG. 7 is a flow chart of a data processing method of the present invention.

The various reference numbers in the figures mean:

100. a data source unit;

110. an online transaction processing module; 120. a legacy data module; 130. an internal office data module; 140. an external data module;

200. a data processing unit;

210. a data extraction module; 220. a data verification module; 230. a data cleaning module; 240. a data integration module; 250. a data aggregation module; 260. a data loading module;

300. a data query statistical analysis unit;

310. a data warehouse module; 320. a data mart module; 330. an operational data storage module; 340. a front-end access module;

400. a metadata management unit; 410. a data metadata module; 420. a process metadata module;

500. a management center unit; 510. an authority control module; 520. a performance management module; 530. and a failure recovery module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the equipment or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.

Example 1

Referring to fig. 1, the present embodiment provides a data processing system incorporating a data warehouse building technology, including adata source unit 100, adata processing unit 200, a data querystatistical analysis unit 300, ametadata management unit 400, and amanagement center unit 500;

thedata source unit 100 is used for establishing database files, flat files, html files and xml files;

thedata processing unit 200 is configured to process data in thedata source unit 100, and a theme corresponding to application-oriented traditional database is an abstraction that integrates, classifies, analyzes and utilizes data in an enterprise information system at a higher level, and each theme corresponds to a macroscopic analysis field and can reflect the content of historical data in a period of time;

the data querystatistical analysis unit 300 is used for uniformly recording and analyzing the information data;

themetadata management unit 400 is used for storing data in the data querystatistical analysis unit 300;

referring to fig. 5, themetadata management unit 400 includes adata metadata module 410 and aprocess metadata module 420, wherein thedata metadata module 410 is used for retrieving, accessing and understanding source information, ensuring that information is used in a new application environment, and supporting the evolution of the whole information structure; theprocess metadata module 420 is used for searching, evaluating, accessing and managing data, and a large software structure comprises metadata describing interfaces, functions and dependency relationships of each component, and the metadata ensures flexible and dynamic configuration of software components;

as shown in fig. 6, themanagement center unit 500 is used for managing and maintaining the entire system, themanagement center unit 500 includes anauthority control module 510, aperformance management module 520, and afailure recovery module 530, theauthority control module 510 is used for setting authority verification for user login, so as to ensure the security of the system, theperformance management module 520 is used for evaluating the performance of the device and the network unit, so as to find network congestion or interruption situations in time, remove failures comprehensively, plan capacity based on facts, and allocate network resources effectively, thefailure recovery module 530 is used for automatically creating a recovery point, so that the system returns to a working state, and failure recovery can be performed quickly without reinstallation or destruction of data files.

Further, as shown in fig. 2, thedata source unit 100 includes an onlinetransaction processing module 110, alegacy data module 120, an internaloffice data module 130, and anexternal data module 140;

the onlinetransaction processing module 110 is used for reflecting daily operation conditions of an enterprise, and the onlinetransaction processing module 110 is a main data source of an enterprise data warehouse, reflects daily operation conditions of the enterprise, and is generally fine in granularity;

thelegacy data module 120 is used for data mining and trend analysis, thelegacy data module 120 is offline or archival data, has significant historical value for data mining, trend analysis and the like, and is guided into a data warehouse by a proper application program;

the internaloffice data module 130 is used for data analysis of business offices, including unstructured (e.g., documents in non-electronic form), structured (e.g., electronic reports, word processing documents, etc.), and semi-structured (e.g., annual reports, etc.), which are useful for supporting cross-department analysis;

theexternal data module 140 is used to record data of demographic information, competitor intelligence information, questionnaires, xml documents.

Further, as shown in fig. 3, thedata processing unit 200 includes adata extracting module 210, adata verifying module 220, adata cleaning module 230, adata integrating module 240, adata aggregating module 250, and adata loading module 260;

thedata extraction module 210 is used for capturing data, and there are two main ways of capturing data, namely, a full capture way and an incremental capture way, where the full capture way is to extract complete information of a data source; the incremental capturing mode focuses more on modified data in the data, and the incremental capturing mode is adopted after a complete capturing mode is performed once, so that in real-time extraction, the incremental capturing mode can reduce the extracted data volume and network flow;

thedata verification module 220 is used for detecting data in thedata source unit 100, including data problems of lost data recovery, fuzzy data transformation and business operations, and solving data quality problems, where the detected content includes valid values of attributes (domain check), valid relationships between rows in the table and other tables (foreign key check), detection of similar duplicate records and detection of missing values, the valid values of attributes (domain check) and valid relationships between rows in the table and other tables (foreign key check) are checked by using reference integrity of the database itself, the similar duplicate records refer to the same real entity represented by multiple incompletely identical records in different data sets, and the DBMS cannot correctly recognize due to some differences in format and spelling, and the records are called "similar duplicate records";

thedata cleaning module 230 is configured to clean up dirty data detected by thedata verification module 220, ensure correctness, consistency, integrity and reliability of the data, ensure data quality of an information source, and ensure correctness and accuracy of original data for assisting a decision;

thedata integration module 240 is configured to integrate a plurality of data into unified data for analysis;

thedata aggregation module 250 is used for collecting and summarizing information data, the data aggregation method of thedata aggregation module 250 comprises numerical value aggregation and dimension reduction, numerical value aggregation, namely, the compression of longitudinal data of the two-dimensional table is actually performed to remove unnecessary details in source data by selecting a smaller data volume to replace an original data set, dimension reduction is performed to reduce the data volume by deleting irrelevant attributes, the compression is performed to transverse attributes of the two-dimensional table, the reduction of the number of instances of an entity to a level easy to handle is facilitated, and the pre-calculation of widely-used summary numbers is facilitated;

thedata loading module 260 is used to save the converted data so that the bad data can be reported as an error, and the dirty data can be corrected later.

Specifically, the numerical data metric in thedata integration module 240 is calculated as follows:

where n is the number of tuples,

and

is the average value of A and B, σ_AAnd σ_BIs the standard deviation of A and B, respectively, r is the difference between A and B if they are positively correlated_A,BThe value is large enough to indicate that one of the attributes is redundant and can be removed if r is sufficient_A,B= O, meaning that A and B are uncorrelated, it is also possible for r_A,B"0" when A and B are negatively correlated, indicating that the presence of A (B) prevents the presence of B (A) and therefore there is no redundancy.

Specifically, the non-numerical data metric in thedata integration module 240 is calculated as follows:

wherein P (A), P (B)A U B) are respectively attribute A_I、B_IAnd, a $ B in attribute set I = { I = } B₁，i₂，i₃⋯，i_mProbability of occurrence in (f) if corr_A,B∃ 1, A and B are positively correlated, meaning that there is redundancy if corr is present_A,BIf =1, A and B are independent, if corr_A,B0, then A and B are inversely related.

Wherein, the information data in the data querystatistic analysis unit 300 comprises market general condition, market structure, market dynamics, investment structure, financial condition of listed companies, market index and macro economic index.

Further, as shown in fig. 4, the data querystatistical analysis unit 300 includes adata warehouse module 310, adata mart module 320, an operationaldata storage module 330, and a front-end access module 340;

thedata warehouse module 310 is used for supporting data sets of enterprise analytic decisions, and thedata warehouse module 310 is theme-oriented, integrated, time-varying and non-updatable, and provides a global view for data of the whole enterprise;

thedata mart module 320 is used for supporting data sets of department decisions, thedata mart module 320 is a logical subset of thedata warehouse module 310, is a data set facing to the department decision support, and has higher flexibility, controllability and professional function compared with a data warehouse;

the operationaldata storage module 330 is used to support data collections for enterprise everyday applications with the characteristics of being theme-oriented, integrated, variable, and data is current or near current;

the front-end access module 340 is used for displaying the data in a text, report, curve and graphic mode after the data is arranged and processed, and can be simply, conveniently and quickly displayed.

Specifically, thedata warehouse module 310 includes a multidimensional data model, a star model, and a snowflake model;

In another aspect, referring to fig. 7, the present invention further provides a method for data processing by introducing a data warehouse building technology, which includes the following steps:

s1, establishing a data source: establishing a database file, a flat file, an html file and an xml file under thedata source unit 100;

s2, data processing: extracting data from thedata source unit 100, processing the data through thedata processing unit 200, sorting, organizing, processing, loading into a target database, and periodically refreshing and reflecting the change of the data source;

s3, building a data warehouse: collecting the user's demand information through the data querystatistical analysis unit 300, estimating the data volume, then selecting a suitable software and hardware platform design data model, storing the data after inspection, sorting, processing and recombination, building thedata warehouse module 310, and managing the data;

s4, front end access and analysis: the front-end access module 340 performs various processing and sorting, mining and forecasting on the data in thedata warehouse module 310, and then displays the obtained data in the forms of characters, reports and curves;

s5, storage data: themetadata management unit 400 stores the data in the data querystatistical analysis unit 300, so that the syntax and semantics of the whole enterprise are kept consistent;

s6, permission setting: theright control module 510 sets the right verification for user login, so that the user can log in and inquire at any time, and theperformance management module 520 evaluates the performance of the device and the network unit.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A data processing system incorporating data warehouse building techniques, characterized by: the system comprises a data source unit (100), a data processing unit (200), a data query statistical analysis unit (300), a metadata management unit (400) and a management center unit (500);

the data source unit (100) is used for establishing database files, flat files, html files and xml files;

the data processing unit (200) is used for processing data in the data source unit (100);

the data query statistical analysis unit (300) is used for uniformly recording and analyzing the information data;

the metadata management unit (400) is used for storing data in the data query statistical analysis unit (300);

the metadata management unit (400) comprises a data metadata module (410) and a process metadata module (420);

the data metadata module (410) is used for retrieving, accessing and understanding source information;

the process metadata module (420) is used for finding, evaluating, accessing and managing data;

the management center unit (500) is used for management and maintenance of the whole system, and the management center unit (500) comprises an authority control module (510), a performance management module (520) and a failure recovery module (530);

the authority control module (510) is used for setting the authority verification of user login;

the performance management module (520) is used for evaluating the performance of the equipment and the network unit;

the failure recovery module (530) is used for automatically creating a recovery point and enabling the system to return to the working state.

2. The data processing system incorporating data warehouse building techniques of claim 1, wherein: the data source unit (100) comprises an online transaction processing module (110), a legacy data module (120), an internal office data module (130) and an external data module (140);

the online transaction processing module (110) is used for reflecting daily operation conditions of an enterprise;

the legacy data module (120) is used for mining and trend analysis of data;

the internal office data module (130) is used for data analysis of enterprise offices;

the external data module (140) is used for recording data of demographic information, competitor information, questionnaire survey and xml documents;

the data processing unit (200) comprises a data extraction module (210), a data verification module (220), a data cleaning module (230), a data integration module (240), a data aggregation module (250) and a data loading module (260);

the data extraction module (210) is used for capturing data;

the data verification module (220) is used for detecting data in the data source unit (100), including data problems of lost data recovery, fuzzy data conversion and business operation, and solving data quality problems;

the data cleaning module (230) is used for cleaning the dirty data detected in the data verification module (220);

the data integration module (240) is used for integrating a plurality of data into unified data for analysis;

the data aggregation module (250) is used for collecting and summarizing information data;

the data loading module (260) is used for storing the converted data, so that the bad data can generate an error report;

the numerical data metric in the data integration module (240) is calculated as follows:

where n is the number of tuples,

and

is the average value of A and B, σ_AAnd σ_BStandard deviation of a and B, respectively; the non-numerical data metric in the data integration module (240) is calculated as follows:

3. The data processing system incorporating data warehouse building techniques of claim 1, wherein: the information data in the data query statistic analysis unit (300) comprises market overall situation, market structure, market dynamic, investment structure, financial situation of listed companies, market index and macroscopic economic index.

4. The data processing system incorporating data warehouse building techniques of claim 1, wherein: the data query statistical analysis unit (300) comprises a data warehouse module (310), a data mart module (320), an operational data storage module (330) and a front-end access module (340);

the data warehouse module (310) is configured to support enterprise analytics-type decision-making data sets;

the data mart module (320) is used for supporting data sets of department decisions, the data mart module (320) is a logical subset of the data warehouse module (310);

the operational data storage module (330) is used for supporting data collection of enterprise daily application;

the front-end access module (340) is used for displaying the data in a text, report, curve and graphic mode after the data is arranged and processed.

5. The data processing system incorporating data warehouse building techniques of claim 3, wherein: the data warehouse module (310) includes a multidimensional data model, a star model, and a snowflake model.

6. A method for data processing by introducing a data warehouse building technology is characterized in that: a data processing system including the import data warehouse construction technique of any of claims 1 to 4, the operational steps of which are as follows:

s1, establishing a data source: establishing a database file, a flat file, an html file and an xml file under a data source unit (100);

s2, data processing: data are extracted from the data source unit (100), then the data are processed through the data processing unit (200), and are sorted, organized and processed, and loaded into the target database, and meanwhile, the change of the data source can be periodically refreshed and reflected;

s3, building a data warehouse: the method comprises the steps that demand information of a user is collected through a data query statistical analysis unit (300), data quantity is estimated, then a proper software and hardware platform design data model is selected, data after inspection, sorting, processing and recombination are stored, a data warehouse module (310) is built, and the data are managed;

s4, front end access and analysis: the front-end access module (340) carries out various processing and sorting, mining and forecasting on the data in the data warehouse module (310), and then the obtained data is displayed in a text, report and curve mode;

s5, storage data: the metadata management unit (400) stores the data in the data query statistical analysis unit (300), so that the syntax and the semantics of the whole enterprise are kept consistent;

s6, permission setting: the right control module (510) is used for setting the right verification of user login, so that the user can log in for inquiry at any time, and the performance management module (520) is used for evaluating the performance of the equipment and the network unit.