Movatterモバイル変換


[0]ホーム

URL:


US7200619B2 - Method and process to optimize correlation of replicated with extracted data from disparate data sources - Google Patents

Method and process to optimize correlation of replicated with extracted data from disparate data sources
Download PDF

Info

Publication number
US7200619B2
US7200619B2US10/161,251US16125102AUS7200619B2US 7200619 B2US7200619 B2US 7200619B2US 16125102 AUS16125102 AUS 16125102AUS 7200619 B2US7200619 B2US 7200619B2
Authority
US
United States
Prior art keywords
data
mapping
warehouse
record
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US10/161,251
Other versions
US20030225742A1 (en
Inventor
Jeffrey Wayne Tenner
Eric W. Will
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines CorpfiledCriticalInternational Business Machines Corp
Priority to US10/161,251priorityCriticalpatent/US7200619B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATIONreassignmentINTERNATIONAL BUSINESS MACHINES CORPORATIONASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: WILL, ERIC W., TENNER, JEFFREY WAYNE
Publication of US20030225742A1publicationCriticalpatent/US20030225742A1/en
Application grantedgrantedCritical
Publication of US7200619B2publicationCriticalpatent/US7200619B2/en
Adjusted expirationlegal-statusCritical
Expired - Fee Relatedlegal-statusCriticalCurrent

Links

Images

Classifications

Definitions

Landscapes

Abstract

A method, a data structure, a computer program product and a computer-readable medium for correlating at least a first plurality of data records and a second plurality of data records, each data record of the first plurality of data records being uniquely identified within a corresponding data source by an associated internal identifier and each data record of the first and second plurality of data records comprising at least one external identifier. According to one embodiment, the method comprises determining a data record of the first plurality of data records and at least one data record of the second plurality of data records having an identical external identifier; and mapping the at least one determined data record of the second plurality of data records to the internal identifier associated with the determined data record of the first plurality of data records.

Description

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to correlating different data sources and, more particularly, to correlating data from at least two different data sources in a data warehouse.
2. Description of the Related Art
Computerized information storage and retrieval systems made up of a (possibly large) number of files or tables are available in form of databases. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. Storing and retrieving data in a database is performed by means of a computer database management system (DBMS).
Regardless of the particular architecture, in a DBMS, a requesting entity (e.g., an application, the operating system or a user) demands access to a specified database by issuing a database access request. Such requests may include, for instance, simple catalog lookup requests or transactions and combinations of transactions that operate to read, change and add specified records in the database. These requests are made using high-level query languages such as the Structured Query Language (SQL).
A relational database management system (RDBMS) is a DBMS that uses relational techniques for storing and retrieving data. RDBMS software using a SQL interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
Illustratively, SQL is used to make interactive queries for getting information from and updating a database such as International Business Machines' (IBM®) DB2®, Microsoft's SQL Server®, and database products from Oracle®, Sybase®, and Computer Associates®. The term “query” denominates a set of commands for retrieving and processing data from a stored database. Queries take the form of a command language that lets programmers and programs select, insert, update, find out the location of data, and so forth.
Data records in a computerized RDBMS are maintained in tables, which are a collection of rows all having the same columns. Each row represents a data record and each column maintains information on a particular type of data for the data records that comprise the rows. Data records may be indexed using unique indices or keys to join different data records in different related tables together. A preferred model is for the keys to be internal numeric identifiers that can be assigned incrementally, rather than directly mapping the data records to some external identifier.
An internal identifier is an identifier that is generated by means of the RDBMS or application and is used to correlate information across tables of the application. Internal identifiers are not externalized outside the application. An external identifier is an identifier that may be generated by means of the RDBMS or application or supplied by the end-user and uniquely identifies a corresponding entity in the real world. Assume that in a credit card transaction system of a financial institution, a data record is generated for each client in a table comprising information related to the clients, each data record representing a row of the table. Assume further that the table comprises a plurality of columns, i.e. last name, middle initial, first name, street, city, state, ZIP code and credit card number. Each data record thus contains information related to one client, i.e., last name, middle initial, first name, street, city, state, ZIP code and credit card number of the client. To uniquely identify each data record, in general, a 32-bit integer is generated by means of the RDBMS or application as internal identifier and stored in an additional column of the table. However, this internal identifier does not comprise relevant information regarding the corresponding client and may only be used to uniquely identify the data record related to this client. In contrast thereto, the credit card number, which has not been generated by means of the RDBMS, and which is located in each data record, also uniquely identifies the corresponding data record and client and represents an external identifier in that the client is aware of this number and uses it to identify him or herself.
Where internal identifiers are used, a mapping table may be provided to map internal identifiers to their respective external identifier. Accessing tables associated with an external identifier is then accomplished by using the external identifier to index into the mapping table and locate the corresponding internal identifier. The internal identifier is then used as a key to access the relevant tables. Assignment of a new external identifier for a data record in a table (e.g., TABLE 1) can be handled simply by changing the relevant row of the mapping table to map the new external identifier to the corresponding internal identifier of TABLE 1. A history of external identifiers can be kept in a history mapping table which maps past external identifiers to the internal identifier. In this manner, a single unchanging internal identifier can be used to access all data associated with any number of related external identifiers.
To illustrate the advantages of using an internal identifier as a key, consider in the above described example the assignment of a new credit card number (a new external identifier) to replace the old credit card number (the old external identifier) of a card that was lost or stolen. The data for the card account, such as credit history and balances, must still be associated with the same entity (e.g., client). As a result, if the old external identifier, i.e., the old credit card number was used as the key across the tables, all tables using the old external identifier to uniquely identify corresponding data records and thus containing references to the old credit card number would have to be updated to reflect the new external identifier, i.e., the new credit card number. Dependent on the number of tables to be updated, this may, however, be complex and cumbersome. Additionally, by simply replacing the old credit card number with the new credit card number, history about the old credit card number(s) for the entity's account is lost, as no link between the old and the corresponding new credit card number could be established. In order to avoid loss of history of the external identifiers, a history mapping table or a new column in one of the updated tables must be created, which maps the old external identifier to the new external identifier, i.e., the old credit card number to the new credit card number. In contrast thereto, there is no need to change an internal identifier used as the key across the tables, as the internal identifier is used within a corresponding database for uniquely identifying and joining data records, so that, for example, only one entry in a corresponding history mapping table needs to be updated to map the old credit card number to the new credit card number.
Although using internal identifiers in the foregoing manner simplifies maintenance of relational databases, it complicates construction of a data warehouse. By definition, data warehouses pull together data from different data sources and try to establish correlations between them. Thus, queries may directly be run on the data stored in the data warehouse so that the performance of the computer system that issues the data required to satisfy a corresponding query will not be affected.
In constructing a data warehouse, data from many different databases, each using their own method for generating unique internal identifiers, have to be correlated. Continuing with the credit card example and assume that the financial institution offering the credit card also has a rewards program for using the card which uses the credit card number as the external identifier for the rewards program. Assume further that the credit card transaction system is written by a different software vendor than the rewards tracking program. Since the external identifier, i.e., the credit card number is common to both data sources, it would seem logical to use it to correlate the data from the two data sources when constructing a data warehouse on the basis of these data sources. However, such correlation would require adding the credit card number as a key to the tables copied to the data warehouse or always joining the tables in the data warehouse via some mapping table.
Adding the external identifier as a key also introduces the problem described earlier where changes to the credit card number can require changes across many tables in the data warehouse. The problems with using external identifiers as keys are compounded because, in the data warehouse, data from one data source or application may be using a newer external identifier than another application(s) during the window of time that the applications are updated. Further, performing a join with some mapping table can severely impact the performance of queries against the database. To complicate matters further, many times data in a data warehouse must also be correlated back to the original data source to allow capturing of changes of the original data source.
A recent trend is building data warehouses with real-time data. In some cases, such data warehouses consist of some or all of the data being replicated to the data warehouse rather than more traditional extract-transform-load processes. In general, replicating data represents moving data from the original data source to the data warehouse, wherein only changes are copied, but not the whole data. In the extract-transform-load processes, extracted data is typically entirely copied, i.e., converted/transformed and processed during the extract process according to the requirements of a corresponding data warehouse and has more freedom to change values as long as some method to map the data back to the original data source exists to allow incremental updates. In contrast, replicated data is merely copied without transformation. As a result, the internal identifier values for replicated data cannot be changed when moved to the data warehouse.
Accordingly, there is a need for a more efficient and effective technique for correlating different data sources in a data warehouse.
SUMMARY OF THE INVENTION
The present invention generally relates to correlating different data sources and, more particularly, to correlating data from at least two different data sources in a data warehouse.
One aspect of the present invention relates to a computer-readable medium comprising computer-executable instructions which, when executed, perform an operation of correlating at least a first plurality of data records and a second plurality of data records, each data record of the first plurality of data records being uniquely identified within a corresponding data source by an associated internal identifier and each data record of the first and second plurality of data records comprising at least one external identifier. The operation comprises: determining a data record of the first plurality of data records and at least one data record of the second plurality of data records having an identical external identifier; and mapping the at least one determined data record of the second plurality of data records to the internal identifier associated with the determined data record of the first plurality of data records.
Another aspect of the present invention relates to a computer-readable medium comprising computer-executable instructions which, when executed, perform an operation of creating a data warehouse mapping data structure to correlate at least two different data sources. The operation comprises creating a plurality of mapping data records in the warehouse mapping data structure. Each mapping data record comprises: a first value representing an internal identifier uniquely identifying the mapping data record in the warehouse mapping data structure; a second value representing an external identifier of one of a data record of a first data source and a data record of a second data source; and a third value representing an internal identifier uniquely identifying a data record of the second data source in the second data source, the data record of the second data source having the second value as external identifier; whereby a correlation between the first and the second data sources is established.
Another aspect of the present invention relates to a computer-readable medium comprising computer-executable instructions which, when executed, perform an operation of correlating data from at least two different data sources in a data warehouse. The operation comprises: loading data from a first data source into the data warehouse, the data from the first data source comprising a plurality of first internal identifiers and a plurality of first external identifiers; creating a warehouse mapping data structure on the basis of the plurality of internal identifiers each mapped to an associated first external identifier; loading data from a second data source into the data warehouse, the data from the second data source comprising a plurality of second internal identifiers each associated with a second external identifier, wherein at least one of the associated second external identifiers is identical to one of the first external identifiers; and mapping each second internal identifier associated with a second external identifier that is identical to one of the first external identifiers in the warehouse mapping data structure to the first internal identifier of the identical matching first external identifier, whereby a correlation between data of the first and the second data sources is established.
Another aspect of the present invention relates to a method of correlating at least a first plurality of data records and a second plurality of data records, each data record of the first plurality of data records being uniquely identified within a corresponding data source by an associated internal identifier and each data record of the first and second plurality of data records comprising at least one external identifier. The method comprises: determining a data record of the first plurality of data records and at least one data record of the second plurality of data records having an identical external identifier; and mapping the at least one determined data record of the second plurality of data records to the internal identifier associated with the determined data record of the first plurality of data records.
Another aspect of the present invention relates to a method of creating a data warehouse mapping data structure to correlate at least two different data sources. The method comprises creating a plurality of mapping data records in the warehouse mapping data structure. Each mapping data record comprises: a first value representing an internal identifier uniquely identifying the mapping data record in the warehouse mapping data structure; a second value representing an external identifier of one of a data record of a first data source and a data record of a second data source; and a third value representing an internal identifier uniquely identifying a data record of the second data source in the second data source, the data record of the second data source having the second value as external identifier; whereby a correlation between the first and the second data sources is established.
Another aspect of the present invention relates to a method of correlating data from at least two different data sources in a data warehouse. The method comprises: loading data from a first data source into the data warehouse, the data from the first data source comprising a plurality of first internal identifiers and a plurality of first external identifiers; creating a warehouse mapping data structure on the basis of the plurality of internal identifiers each mapped to an associated first external identifier; loading data from a second data source into the data warehouse, the data from the second data source comprising a plurality of second internal identifiers each associated with a second external identifier, wherein at least one of the associated second external identifiers is identical to one of the first external identifiers; and mapping each second internal identifier associated with a second external identifier that is identical to one of the first external identifiers in the warehouse mapping data structure to the first internal identifier of the identical matching first external identifier, whereby a correlation between data of the first and the second data sources is established.
Another aspect of the present invention relates to a mapping data structure residing in storage. The mapping data structure comprises a plurality of data records. Each data record comprises: a first portion comprising a warehouse internal identifier; a second portion comprising an external identifier common to a first data source and a second data source; and a third portion comprising an internal identifier of the second data source; wherein at least one data record of the plurality of data records comprises a warehouse internal identifier representing an internal identifier of the first data source, a common external identifier associated with the internal identifier of the first data source and an internal identifier of the second data source associated with the common external identifier, whereby a correlation between data of the first and the second data sources is established.
Another aspect of the present invention relates to a warehouse mapping table residing in storage. The warehouse mapping table comprises a plurality of external identifiers common to a first data source and a second data source, a warehouse internal identifier for each of the plurality of external identifiers, and an internal identifier of the second data source for a least a portion of each of the warehouse internal identifiers, whereby data from the first and second data sources is correlated.
Another aspect of the present invention relates to a computer comprising a memory and a processor adapted to execute contents of the memory. The memory contains at least: a database management system; a data warehouse for storing data of a first and a second data source; and a mapping data structure for correlating the data of the first and second data sources in the data warehouse, the mapping data structure comprising a plurality of data records. Each data record in the mapping data structure comprises: a first portion comprising a warehouse internal identifier; a second portion comprising an external identifier common to a first data source and a second data source; and a third portion comprising an internal identifier of the second data source; wherein at least one data record of the plurality of data records comprises a warehouse internal identifier representing an internal identifier of the first data source, a common external identifier associated with the internal identifier of the first data source and an internal identifier of the second data source associated with the common external identifier.
BRIEF DESCRIPTION OF THE DRAWINGS
So that the manner in which the above recited features, advantages and objects of the present invention are attained and can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings.
It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
FIGS. 1–3 are high level diagrams of a computing environment according to an aspect of the present invention;
FIG. 4 is a relational view of a data management system (DMS) according to one embodiment of the invention;
FIGS. 5A–C are flow charts illustrating creation of a mapping data structure according to one embodiment of the invention;
FIGS. 6A–C are relational views of exemplary components of a first data source;
FIGS. 7A–C are relational views of exemplary components of a second data source;
FIGS. 8A–B are relational views of exemplary components of a first and second data source used to create a mapping data structure;
FIGS. 9A–D are relational views of exemplary components of a data warehouse comprising a mapping data structure; and
FIGS.10A–D and11A–D are relational views of exemplary updated components of a data warehouse comprising updated mapping data structures;
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention generally provides a method, a data structure, a computer program product and a computer for correlating different data sources and, more particularly, for correlating data from at least two different data sources in a data warehouse.
In the following description, the term “replicating” is used to designate a real-time copying process of a portion of data, which has been changed, wherein the portion of data is automatically copied when the change occurs. The term “extracting” is used to designate an automated periodical copying process of the entire data at pre-determined time intervals. However, it should be appreciated that both, the described “replicating” and “extracting” processes are interchangeable in the described embodiments and that any modifications to the described “replicating” and “extracting” processes are also contemplated. For instance, “replicating” may comprise copying the entire data when the change occurs and “extracting” may be performed at any desired time and/or may be initiated manually by a user. Furthermore, it should be noted that the terms “identifier” and “key” are interchangeably used in the following description and additionally, where reference is made to a table, the terms “data record” and “row” are used interchangeably.
One embodiment of the invention is implemented as a program product for use with a computer system such as, for example, thecomputing environment100 shown inFIGS. 1–3 and described below. The program(s) of the program product defines functions of the embodiments (including the methods described herein) and can be contained on a variety of signal-bearing media. Illustrative signal-bearing media include, but are not limited to: (i) information permanently stored on non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive); (ii) alterable information stored on writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive); or (iii) information conveyed to a computer by a communications medium, such as through a computer or telephone network, including wireless communications. The latter embodiment specifically includes information downloaded from the Internet and other networks. Such signal-bearing media, when carrying computer-readable instructions that direct the functions of the present invention, represent embodiments of the present invention.
In general, the routines executed to implement the embodiments of the invention, may be part of an operating system or a specific application, component, program, module, object, or sequence of instructions. The computer program of the present invention typically is comprised of a multitude of instructions that will be translated by the native computer into a machine-readable format and hence executable instructions. Also, programs are comprised of variables and data structures that either reside locally to the program or are found in memory or on storage devices. In addition, various programs described hereinafter may be identified based upon the application for which they are implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature that follows is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
Referring now toFIG. 1, acomputing environment100 comprising anillustrative computer system110 according to one aspect of the present invention is shown. Thecomputer system110 may represent any type of computer, computer system or other programmable electronic device, including a client computer, a server computer, a portable computer, an embedded controller, a PC-based server, a minicomputer, a midrange computer, a mainframe computer, and other computers adapted to support the methods, data structure, computer program product and computer-readable mediums of the invention. Illustratively, thecomputer system110 is part of a networked system; however, in other embodiments, thecomputer system110 is a standalone device. Accordingly, the invention may be practiced in a distributed computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
In any case, it is understood thatFIG. 1 is merely one configuration for a computing environment and computer system. Embodiments of the invention can apply to any comparable configuration, regardless of whether thecomputer system110 is a complicated multi-user apparatus, a single-user workstation, or a network appliance that does not have non-volatile storage of its own.
Thecomputer system110 could include a number of operators and peripheral systems as shown, for example, by amass storage interface137 operably connected to a directaccess storage device138, and by avideo interface144 operably connected to adisplay142. Thedisplay142 may be any video output device for outputting viewable information. Thecomputer system110 could further include a number of storage media drives such as a CD-ROM drive146 and adiskette drive148 suitable to read out information permanently stored on non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by CD-ROM drive146) or alterable information stored on writable storage media (e.g., floppy disks readable by diskette drive148). CD-ROM drive146 anddiskette drive148 are illustrated as forming an integral part ofcomputer system110. It should, however, be appreciated that either CD-ROM drive146 ordiskette drive148, or both of them, may be provided as peripheral devices that are operably connected to thecomputer system110.
Computer system110 may further comprise any type of suitable input means (not shown) to give input to thecomputer system100. For example, a keyboard, keypad, light pen, touch screen, button, mouse, track ball, or speech recognition unit could be used. Further, this input means and display142 could be combined. For example, a display screen with an integrated touch screen, a display with an integrated keyboard or a speech recognition unit combined with a text speech converter could be used.
Computer system110 is shown comprising at least oneprocessor112, which obtains instructions, or operation codes, (also known as opcodes), and data via abus114 from amain memory116. Theprocessor112 could be any processor adapted to support the methods of the invention. In particular, thecomputer processor112 is selected to support the features of the present invention. Illustratively, the processor is a PowerPC® processor available from International Business Machines Corporation of Armonk, N.Y.
Themain memory116 is any memory sufficiently large to hold the necessary programs and data structures.Main memory116 could be one or a combination of memory devices, including Random Access Memory, nonvolatile or backup memory, (e.g., programmable or Flash memories, read-only memories, etc.). In addition,memory116 may be considered to include memory physically located elsewhere incomputing environment100, for example, any storage capacity used as virtual memory or stored on a mass storage device (e.g., direct access storage device138) or on another computer coupled to thecomputer system110.
As shown, themain memory116 includes anoperating system118. Theoperating system118 may be any suitable operating system capable of supporting the functions of the present invention. An exemplary operating system, which may be used to advantage, is Linux. Theoperating system118 may include a database management system (DBMS)126, which may be implemented as a relational database management system (RDBMS). The DBMS may be a separate application stored inmain memory116 or any other storage.
The DBMS is used for managing adatabase120 and, in particular, for storing and retrieving data in adata warehouse122 of thedatabase120. Thedatabase120 may be included inmain memory116 for storing any type of information and, in particular,data warehouse122 and adata structure124. Thedata warehouse122 comprises data of at least two different data sources. Thedata structure124 is suitable to correlate the data of the at least two different data sources indata warehouse122.
Computer system110 further comprises anetwork interface140 operably connected todata bus114.Network interface140 is suitable to establish a network connection betweencomputer system110 and another computer system via a communications network160 (as shown inFIG. 2 and 3), preferably via the Internet.
FIG. 2shows computing environment100 comprisingcomputer system110 ofFIG. 1 and one furtherillustrative computer system210.Computer system110 andcomputer system210 are both connected tocommunications network160. Illustratively,computer system110 comprisesnetwork interface140 operably connected tobus114. For the purpose of simplicity, other features ofcomputer system110, which have been described above with respect toFIG. 1, have been omitted inFIG. 2. Further, while only onecomputer system210 is shown connected tocommunications network160, it should be clear that more than onecomputer system210 may be connected tocommunications network160.
Computer system210 preferably comprises aprocessor212 operably connected to anetwork interface240 and amain memory216 viadata bus214. Thenetwork interface240 is suitable to establish the communication withcomputer system110 viacommunications network160.
Themain memory216 is any memory sufficiently large to hold the necessary programs and data structures.Main memory216 could be one or a combination of memory devices, including Random Access Memory, nonvolatile or backup memory, (e.g., programmable or Flash memories, read-only memories, etc.). In addition,memory216 may be considered to include memory physically located elsewhere incomputing environment100, for example, any storage capacity used as virtual memory or stored on a mass storage device (not shown) or on another computer coupled to thecomputer system210.
As shown, themain memory216 preferably includes adatabase220 for storing any type of information and, in particular, adata source222 representing data to be loaded into thedata warehouse122 ofFIG. 1. Themain memory216 may also include adata replication system230, also referred to as data replicator, suitable to replicate data ofdata source222 to thedata warehouse122. Illustratively, the data replicator is a Data Propagator available from International Business Machines Corporation of Armonk, N.Y.
Data replication system230 may be implemented as a software program, which may be launched to replicate the data ofdata source222 todata warehouse122. Thedata replication system230 may comprise a detection unit (not shown) to detect changes in the data ofdata source222 and a replication unit (not shown) to replicate the data ofdata source222 todata warehouse122 when a change occurs.
InFIG. 2,data replication system230 illustratively resides inmain memory216. However, it should be appreciated thatdata replication system230 may be provided independently ofmain memory216, e.g., in a specifically designed hardware component ofcomputer system210 or on a remote computer system that is connected withcomputer system210, for example, viacommunications network160. Accordingly,data replication system230 may also reside oncomputer system110 ofFIG.1.
FIG. 3shows computing environment100 comprisingcomputer system110 ofFIG. 1 and one furtherillustrative computer system310.Computer system110 andcomputer system310 are both connected tocommunications network160. Illustratively,computer system110 comprisesnetwork interface140 operably connected tobus114. For the purpose of simplicity, other features ofcomputer system110, which have been described above with respect toFIG. 1, have been omitted inFIG. 3. Further, while only onecomputer system310 is shown connected tocommunications network160, it should be clear that more than onecomputer system310 may be connected tocommunications network160. In one embodiment, at least onecomputer system110 ofFIG. 1, at least onecomputer system210 ofFIG. 2 and a plurality ofcomputer systems310 are connected to communications network.
Computer system310 preferably comprises aprocessor312 operably connected to anetwork interface340 and amain memory316 viadata bus314. Thenetwork interface340 is suitable to establish the communication withcomputer system110 viacommunications network160.
Themain memory316 is any memory sufficiently large to hold the necessary programs and data structures.Main memory316 could be one or a combination of memory devices, including Random Access Memory, nonvolatile or backup memory, (e.g., programmable or Flash memories, read-only memories, etc.). In addition,memory316 may be considered to include memory physically located elsewhere incomputing environment100, for example, any storage capacity used as virtual memory or stored on a mass storage device (not shown) or on another computer coupled to thecomputer system310.
As shown, themain memory316 preferably includes adatabase320 for storing any type of information and, in particular, adata source322 representing data to be loaded into thedata warehouse122 ofFIG. 1. Themain memory216 may also include adata extraction system330, also referred to as data extractor, suitable to extract data ofdata source322 to thedata warehouse122. Illustratively, the data extractor is a DB2 Warehouse Manager available from International Business Machines Corporation of Armonk, N.Y.
Data extraction system330 may be implemented as a software program, which may be launched to extract the data ofdata source322 todata warehouse122. Thedata extraction system330 may comprise a timer (not shown) adapted to count time units and an extraction unit (not shown) to extract the data ofdata source322 todata warehouse122 dependent on the time units counted by the timer.
InFIG. 3,data extraction system330 is illustratively resides inmain memory316. However, it should be appreciated thatdata extraction system330 may be provided independently ofmain memory316, e.g., in a specifically designed hardware component ofcomputer system310 or on a remote computer system that is connected withcomputer system310, for example, viacommunications network160. Accordingly,data extraction system330 may also reside oncomputer system110 ofFIG.1. Furthermore, it should be noted that any one ofcomputer systems110 ofFIG. 1,210 ofFIG. 2 and 310 ofFIG. 3 may comprises arbitrary combinations of the above described features. For instance,main memory116 ofcomputer system110 ofFIG. 1 may comprisedata source222 anddata replication system230 ofFIG. 2 and/ordata source322 anddata extraction system330 ofFIG. 3 in addition to the features described above with reference toFIG. 1.
FIG. 4 shows a relational view of a data management system (DMS)400 according to one embodiment of the invention. The DMS400 comprises a first data source410 (e.g.,data source222 ofFIG. 2), a second data source420 (e.g.,data source322 ofFIG. 3) and a database management system (DBMS)430 (e.g.,DBMS126 ofFIG. 1), which may be implemented on one or more computer systems (e.g.,computer systems110 ofFIG.1,210 ofFIG. 2 and 310 ofFIG. 3).
Illustratively, theDBMS430 comprises a data warehouse432 (e.g.,data warehouse122 ofFIG. 1) and anupdating system440. Thedata warehouse432 comprises a first plurality of tables434 comprising data loaded from thefirst data source410. Thefirst data source410 represents a real-time data source, i.e., data from thefirst data source410 is replicated to the first plurality of tables434. Thedata warehouse432 further comprises a second plurality of tables436 comprising data loaded from thesecond data source410. Thesecond data source420 represents an extracted data source, i.e., data from thesecond data source420 is extracted to the second plurality of tables436. Thedata warehouse432 illustratively comprises a mapping table438 for correlating the data comprised in the first plurality of tables434 and the second plurality of tables436. In one embodiment, the updatingsystem440 may be implemented as a trigger program. The trigger program may be activated when data is loaded from thefirst data source410 and/or thesecond data source420. The trigger program is adapted to detect changes in the loaded data and to update the mapping table438 on the basis of the detected changes.
The DMS400 further comprises a data replicator412 (e.g.,data replicator230 ofFIG. 2) and a data extractor422 (e.g.,data extractor330 ofFIG. 3), which may be implemented separately on different computer systems (e.g.,computer systems110 ofFIG.1,210 ofFIG. 2 and 310 ofFIG. 3) or, alternatively, together on one computer system (e.g., any one ofcomputer systems110 ofFIG.1,210 ofFIG. 2 and 310 ofFIG. 3). Thedata replicator412 is implemented as replication software adapted to monitor thefirst data source410 and to replicate data from thefirst data source410 to the first plurality of tables434 when a change occurs. Thedata extractor422 is implemented as extraction software adapted to extract data from thesecond data source420 to the second plurality of tables436 periodically.
The DMS400 initially performs a loading process to create mapping table438, which is preferably primed with internal keys of data records of thefirst data source410 and comprises a plurality of mapping data records. The mapping table438 includes at least one warehouse internal identifier representing a primary key, and at least one external identifier it maps to. The data from thefirst data source410 is loaded first to the first plurality of tables434 ofdata warehouse432, whereby the first plurality of tables434 is populated. For each data record in thefirst data source410, the warehouse internal identifier is the same as a corresponding internal key of a data record of thefirst data source410.
After loading the data from thefirst data source410, data from at least onesecond data source420 is loaded to the second plurality of tables436 ofdata warehouse432, whereby the second plurality of tables436 is populated. Each data record of thesecond data source420 may comprise an internal key and at least one external key it maps to. If an external key of a data record from thesecond data source420 matches an external identifier of thefirst data source410 or an external identifier in a data record of a table representing a historical external identifier mapping table, the corresponding data record in the mapping table438 is updated to include the extracted data sources internal key.
Otherwise, a new mapping data record is inserted into the mapping table438 with a value of warehouse internal identifier being a generated value in a range different from a range of possible values of the internal keys of the data records of thefirst data source410. For instance, if the values of the first data sources internal keys represent 32-bit integers, the warehouse internal keys may represent 64-bit integers. Accordingly, by adding 5 billion to the second data sources internal key, it may be guaranteed that the generated values of warehouse internal keys would not overlap the values of internal keys of data records of thefirst data source410. If data of a third data source is loaded to thedata warehouse432, an additional range of possible values may be included, so that the internal keys of a third data source may, for example, start at 10 billion.
According to the data records of the mapping table438, the data records of the second plurality of tables436 in thedata warehouse432 are updated to use the internal keys of the mapping table438. This completes the initial loading process of the DMS400. Ongoing updates to thedata warehouse432 require handling changes to the mapping table438 and to tables from thesecond data source420.
When, for instance, the initial mapping of the internal identifiers of thefirst data source410 to associated external identifiers changes, the mapping table438 must also be updated. In this case, the updatingsystem440 may replace the mapping table438 with an internal-to-external identifier mapping table of thefirst data source410, e.g., via a copying process. Thus, when data records are inserted, updated, or deleted in the internal-to-external identifier mapping table of thefirst data source410, the corresponding changes are made to the mapping table438. As these changes are simple changes to the mapping table438, they occur quickly and do not significantly impact the replication process.
Changes to thesecond data source420 are processed in a manner similar to the processing during the initial load. If the external key of an updated data record from thesecond data source420 matches an external identifier in the mapping table438 or in the historical external identifier mapping table, the corresponding data record in the mapping table438 is updated as described above. Otherwise, a new data record is inserted as described above. If any updates are made to the mapping table438, the corresponding data records of the second plurality of tables436 in thedata warehouse432 are accordingly updated.
FIGS. 5A–C show amethod500 of correlating at least a first plurality of data records and a second plurality of data records. Each data record of the first plurality of data records is uniquely identified within a corresponding data source by an associated internal identifier and each data record of the first and second plurality of data records comprises at least one external identifier. In general, according toFIG. 5A, a data record of the first plurality of data records and at least one data record of the second plurality of data records having an identical external identifier are determined. According toFIGS. 5B and 5C, the at least one determined data record of the second plurality of data records is mapped to the internal identifier associated with the determined data record of the first plurality of data records.
Referring now toFIG. 5A, instep510 the first plurality of data records is received, for example, indata warehouse122 ofFIG. 1 or432 ofFIG. 4. The first plurality of data records may be loaded from a first data source (e.g.,data source222 ofFIG. 2 or410 ofFIG. 4). Loading the data from the first data source comprises loading the data from a real-time data source, wherein loading data from the real-time data source comprises replicating the data. Replicating the data comprises monitoring the data in the real-time data source, and, if a portion of the data has changed, launching a replication program suitable for copying the portion of the data. Instep520, for each data record of the first plurality of data records, an associated internal identifier is retrieved from the data record. Instep530, at least one external identifier is retrieved from each data record. Instep540, for each data record, a mapping data record is created in a mapping data structure (e.g., mapping table438 ofFIG. 4), the mapping data record comprising the retrieved associated internal identifier and the at least one retrieved external identifier of a corresponding data record of the first plurality of data records.
Instep550 the second plurality of data records is received, for example, indata warehouse122 ofFIG. 1 or432 ofFIG. 4. The second plurality of data records may be loaded from a second data source (e.g.,data source322 ofFIG. 3 or420 ofFIG. 4). Loading the data from the second data source comprises loading the data from an extracted data source, wherein loading data from the extracted data source comprises extracting the data. Extracting the data comprises launching an extraction program suitable for copying the data. Instep560, for each data record of the second plurality of data records, at least one external identifier is retrieved from the data record. Instep570, the at least one retrieved external identifier is compared with the at least one external identifier of each mapping data record in the mapping data structure. Instep580, a determination is made whether the at least one retrieved external identifier matches the at least one external identifier of a mapping data record in the mapping data structure. Accordingly, the determination is made for each data record in the second data record.
Referring now toFIG. 5B, if the retrieved external identifier of the second data source matches an external identifier of a specific mapping data record in the mapping data structure, an internal identifier associated with the data record of the second plurality of data records is retrieved from the data record instep582. The retrieved internal identifier uniquely identifies the data record of the second plurality of data records within a corresponding data source. Instep592, the retrieved internal identifier and an indication of the corresponding data source is copied to the matching mapping data record in the mapping data structure.
Referring now toFIG. 5C, if the retrieved external identifier of the second data source does not match an external identifier of any mapping data record in the mapping data structure, an available internal identifier (generated from a range beyond that of the first data source such as adding 5 billion to the internal identifier of the second data source) is associated with the non-matching data record of the second plurality of data records instep584. The available internal identifier uniquely identifies the non-matching data record within the mapping data structure. Instep586, an internal identifier associated with the non-matching data record of the second plurality of data records is retrieved from the data record. The retrieved internal identifier uniquely identifies the data record of the second plurality of data records within a corresponding data source. Instep594, the associated available internal identifier, the retrieved external identifier and an indication of the corresponding data source is copied to mapping data structure. Accordingly, a new mapping data record is created in the mapping data structure. The new mapping data record comprises the associated available internal identifier, the retrieved external identifier and the indication of the corresponding data source.
For purposes of illustration,FIGS. 6A–11D show a detailed example of correlating data of two different data sources according to an aspect of the invention.
FIGS. 6A–C, in the following collectively referred to asFIG. 6, show relational views of exemplary components of a first data source600 (e.g.,data source222 ofFIG. 2 or410 ofFIG. 4). Thefirst data source600 illustratively represents data relating to a credit card system. The data will be replicated in real-time to a data warehouse, for example, todata warehouse122 ofFIG. 1 or432 ofFIG. 4.
The first data source comprises tables610,620,630 and640, which are labeled “CC_Account”, “CC_Transaction”, “CC_Customer” and “CC_Account History”, respectively. Table610 comprises afirst column612 labeled “Account Internal Key”, which comprises internal keys uniquely identifying data records representing accounts within the credit card system, and asecond column614 labeled “Credit Card #”, which comprises external keys. Table610 maps the internal keys ofcolumn612 to the external keys ofcolumn614. Table620 is a table of transactions and comprisescolumns622 labeled “Transaction Number”,624 labeled “Account Internal Key”,626 labeled “Date” and628 labeled “Price”.Column624 comprises keys that match internal keys incolumn612 of table610. Table620 illustrates transactions, each transaction representing a data record, i.e. a row in table620, which have been performed, for example, by customers using corresponding credit cards. The transactions are mapped to the external keys incolumn614 of table610 via the keys incolumn624. Table630 is a table of customers and comprisescolumns612,632 labeled “Name”,634 labeled “Address”,636 labeled “City” and638 labeled “State”. Table630 maps the credit card numbers, i.e., the external keys ofcolumn614 of table610 via the keys incolumn612 to specific customers and provides personal information on each customer. Table640 comprisescolumns642 labeled “Account Internal Key” and644 labeled “Past Credit Card #”.Column642 comprises keys that match internal keys incolumn612 of table610. Table640 illustrates past credit card numbers an account was associated with and maps the past credit card numbers with the credit card numbers ofcolumn614 of table610 using the internal keys. For instance,data record618 of table610 indicates that internal key “5” is associated with current credit card number “111111110011” and row646 of table640 indicates that the internal key “5” was previously associated with past credit card number “000000000011”.
By using an internal key ofcolumn612 of table610, data records in tables610,620,630 and640 may be joined. Joined data records can provide information relating to the credit card system. For instance, frequency of use of credit cards may be determined or transactions performed with a credit card may easily been mapped to a specific customer. Accordingly, by joining tables610,620 and630 usinginternal key616 of table610, i.e., “1”, it can be determined that “John Smith” used credit card number “777777770011” to make a purchase of “$10.00” on “Apr. 5, 2002” and to make another purchase of “$701.12” on “Apr. 6, 2002”.
FIGS. 7A–C, in the following collectively referred to asFIG. 7, show relational views of exemplary components of a second data source700 (e.g.,data source322 ofFIG. 3 or420 ofFIG. 4). Thesecond data source700 illustratively represents data relating to a reward system. The data will be extracted, e.g., as a daily extract to a data warehouse, for example, todata warehouse122 ofFIG. 1 or432 ofFIG. 4.
The second data source comprises, in analogy to tables610,620 and630 ofFIG. 6, tables710,720 and730, which are labeled “R_Account”, “R_Transaction” and “R_Customer”, respectively. Table710 comprises afirst column712 labeled “Account Internal Key”, which comprises internal keys uniquely identifying data records representing accounts within the reward system, and asecond column714 labeled “Reward Card #”, which comprises external keys. The external keys ofcolumn714 represent reward card numbers, which are identical to external keys ofcolumn614 ofFIG. 6, which represent credit card numbers. Table710 maps the internal keys ofcolumn712 to the external keys ofcolumn714. Table720 is a table of transactions and comprisescolumns722 labeled “Transaction Number”,724 labeled “Account Internal Key”,726 labeled “Date” and728 labeled “Points”.Column724 comprises keys that match internal keys incolumn712 of table710. Table720 illustrates transactions, each transaction representing a data record, i.e. a row in table720, which have been performed, for example, by customers using corresponding reward cards and for which a certain amount of points has been rewarded according tocolumn728. The transactions are mapped to the external keys incolumn714 of table710 via the keys incolumn724. Table730 is a table of customers and comprisescolumns712,732 labeled “Name”,734 labeled “Address”,736 labeled “City” and738 labeled “State”. Table730 maps the reward card Nos., i.e., the external keys ofcolumn714 of table710 via the keys incolumn712 to specific customers and provides personal information on each customer.
By using an internal key ofcolumn712 of table710, data records in tables710,720 and730 may be joined. Joined data records can provide information relating to the reward system. For instance, frequency of use of reward cards may be determined or transactions performed with a reward card may easily been mapped to a specific customer. Accordingly, by joining tables710,720 and730 usinginternal key716 of table710, i.e., “101”, it can be determined that “John Smith” used reward card number “777777770011” on “Apr. 5, 2002” and earned “10” points and on “Apr. 6, 2002” and earned “500” points.
FIGS. 8A–11D illustrate a method of correlating the data of thefirst data source600 ofFIG. 6 and the data of thesecond data source700 ofFIG. 7 in a data warehouse (e.g.,data warehouse122 ofFIG. 1 or432 ofFIG. 4). This may, for example, be useful to determine the number of reward points a customer earned by using a corresponding credit card.
FIGS. 8A–B, in the following collectively referred to asFIG. 8, show an exemplary relational view of adatabase800.Database800 illustratively represents a portion of a data warehouse, e.g.,data warehouse122 ofFIG. 1 or432 ofFIG. 4, used to initially create a mapping data structure810 (e.g.,data structure124 ofFIG. 1 or mapping table438 ofFIG. 4).Database800 comprises tables610 and640 of thefirst data source600 ofFIG. 6, table710 of thesecond data source700 ofFIG. 7 and themapping data structure810 labeled “WH_Mapping”. Tables610,640 and710 have been loaded from the corresponding data sources todatabase800.
Mapping data structure810 illustratively represents a table that comprises afirst column812 labeled “WHKEY”, which comprises internal keys uniquely identifying data records representing accounts withindatabase800, asecond column814 labeled “EXTID” which comprises external keys associated with data records of the first and second data sources, and athird column816 labeled “XTRCT1KEY” comprising internal keys uniquely identifying data records within thesecond data source700. The external keys ofcolumn814 represent external keys, which are common to data records of the first and second data sources. Accordingly,mapping data structure810 correlates the data records of the first and second data sources by mapping associated internal identifiers to common external identifiers of the data records.
Mapping data structure810 is generated by creating a mapping data record for each of the data records of table610, for example, by copying the data records of table610 intomapping data structure810. Thus, for each data record of table610, a corresponding mapping data record is created inmapping data structure810, whereby the internal keys ofcolumn612 of table610 are copied tocolumn812 and the external keys ofcolumn614 of table610 are copied tocolumn814. Furthermore, for each row in table710, a mapping data record in mappingdata structure810 is determined, which comprises an external identifier incolumn814 that matches the external identifier ofcolumn714 of table710. The internal identifier ofcolumn712 of table710 associated with a matching external identifier is copied tocolumn816 of the determined mapping data record.
For each data record associated with an external identifier incolumn814 ofmapping data structure810, where no data record with a matching external identifier incolumn714 of table710 could be determined, no value is created incolumn816, which results in the value being a “null” value. This indicates that there is no match as, e.g., indata record822. For each data record associated with an external identifier incolumn714 of table710, where no data record with a matching external identifier incolumn814 ofmapping data structure810 could be determined, table640 is accessed, as, e.g., for row830 of table710. Ifcolumn644 of table640 comprises an external identifier that matches an external identifier incolumn714 of a data record in table710, which does not match an external identifier incolumn814 ofmapping data structure810, the associated internal identifier is retrieved fromcolumn642 of table640. As shown inFIG. 8, data record830 of table710 and row646 of table640 comprise an identical external key. As described above with respect toFIG. 6, the associated internal identifier matches an internal key incolumn612 of table610 and, thus, matches an internal key incolumn812 ofmapping data structure810. Accordingly, the associated internal identifier may be copied to the corresponding data record in mappingdata structure810 intocolumn816. Ifcolumn644 of table640 does not contain a matching external identifier, a new value is created incolumn812, for example, as described above with respect toFIG. 4 (e.g., value “5G+106” in data record824).
FIGS. 9A–D, in the following collectively referred to asFIG. 9, show an exemplary relational view of adatabase900.Database900 illustratively represents a data warehouse,e.g. data warehouse122 ofFIG. 1 or432 ofFIG. 4, wherein the tables of thesecond data source700 have been updated so that queries may be issued against thedata warehouse900.Database900 comprises tables610,620,630 and640 of thefirst data source600 ofFIG. 6,mapping data structure810 ofFIG. 8 and tables920 and930.
Tables920 and930 represent tables720 and730 ofFIG. 7, respectively, which have been updated on the basis ofmapping data structure810. Table710 ofFIGS. 7 and 8 labeled “R_Account” is not shown inFIG. 9, but is unchanged and not used in queries of thedatabase900. Tables720 and730 ofFIG. 7 are updated to use the internal identifiers ofcolumn812 labeled “WHKEY” ofmapping data structure810 by retrieving the internal identifiers incolumns724 and712 of tables720 and730, respectively, and replacing the internal identifiers with the corresponding internal keys ofcolumn812 ofmapping data structure810. To construct queries against thedatabase900, themapping data structure810 is used to join the tables620,630,920 and930 by mapping the internal keys ofcolumn812 to the internal identifiers incolumns624,612,724 and712 of the tables620,630,920 and930, respectively.
FIGS.10A–D, in the following collectively referred to asFIG. 10, show an exemplary relational view of adatabase1000.Database1000 illustratively representsdatabase900, which has been updated. Illustratively,database1000 comprises tables920 and930 ofFIG. 9,mapping data structure1050 and tables1010,1020,1030 and1040.
Tables1010,1020,1030 and1040 represent tables610,620,630 and640 ofFIG. 6, respectively, which have been updated. For brevity, only the updates will be described (as other aspects ofdatabase1000 have been described above). Data record1012 of table1010 anddata record1042 indicate that the account with credit card number “555555550011” was given a new credit card number “999999990011” in the credit card system. Rows1014 of table1010 and1032 of table1030 indicate that a new customer account was opened for “Lynn Nelson” and that credit card number “121212120011” is associated with the new account in the credit card system.Row1022 of table1020 indicates a new transaction associated with the new account.Mapping data structure1050 was updated as described above with respect toFIG. 8 and comprises, accordingly, an updated row1052 and anew row1054.
FIGS. 11A–D, in the following collectively referred to asFIG. 11, show an exemplary relational view of adatabase1100.Database1100 illustratively representsdatabase1000, which has been updated. In more detail,database1100 comprises tables920,1020,1030 and1040 ofFIG. 10,mapping data structure1150 and tables1110 and1130.
Tables1110 and1130 represent tables710 ofFIG. 7 and 930 ofFIG. 10, which have been updated. In the following, only the updates will be described in more detail in the light of the above description. Data record1112 of table1110 indicates that the reward card number “000000000011” was updated to reward card number “111111110011” in the reward system and, thus, reflects the current credit card number associated with the corresponding account in the credit card system, as described with respect toFIG.6. Rows1114 and1116 of table1110 androws1132 and1134 of table1130, respectively, indicate that new customer accounts were opened for “Lynn Nelson” and “Mike Golding” and that reward card numbers “121212120011” and “131313130011” are associated with the new accounts, respectively.Mapping data structure1150 was updated as described above with respect toFIG. 8 and comprises, accordingly,new rows1152 and1154.
Some of the foregoing embodiments are described with reference to data warehouses. However, it should be appreciated that the invention is not limited to use in the context of a data warehouse. Furthermore, the invention is not limited to tables comprising data records in a relational database and/or replicating/extracting data from data sources to tables in a data warehouse. Therefore, the described embodiments should be understood as being rather exemplary for purposes of illustration and are, however, not intended to limit the invention thereto. Instead, the invention is intended for use in correlating any type of data from at least a first and a second data source via a mapping data structure, the data representing a data record or comprising at least one data record.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (33)

1. A method of correlating at least a first plurality of data records and a second plurality of data records stored in a computer database, each data record of the first plurality of data records being uniquely identified within a corresponding data source by an associated internal identifier and each data record of the first and second plurality of data records comprising at least one external identifier, the method comprising:
receiving the first and second plurality of data records, wherein the first plurality of data records is received from a real-time data source and the second plurality of data records is received periodically from a non real-time data source;
determining a data record of the first plurality of data records and at least one data record of the second plurality of data records having an identical external identifier;
mapping the at least one determined data record of the second plurality of data records to the internal identifier associated with the determined data record of the first plurality of data records; and
storing the mapping in storage.
2. A method of creating a data warehouse mapping data structure stored on computer readable storage media to correlate at least two different data sources, the method comprising:
creating a plurality of mapping data records, each mapping data record comprising:
a first value representing an internal identifier uniquely identifying the mapping data record in the warehouse mapping data structure;
a second value representing an external identifier of one of a data record of a first data source, wherein the first data source comprises a real-time data source, and a data record of a second data source, wherein the second data source comprises a non real-time data source; and
a third value representing an internal identifier uniquely identifying a data record of the second data source in the second data source, the data record of the second data source having the second value as external identifier;
whereby a correlation between the first and the second data sources is established; and
storing the plurality of mapping data records in the warehouse mapping data structure.
3. A method of correlating data from at least two different data sources stored on computer readable storage media in a data warehouse, the method comprising:
loading data from a first data source, wherein the first data source comprises a real-time data source, into the data warehouse, the data from the first data source comprising a plurality of first internal identifiers and a plurality of first external identifiers;
creating a warehouse mapping data structure on the basis of the plurality of internal identifiers each mapped to an associated first external identifier;
loading data from a second data source, wherein the second data source comprises a non-real time data source, into the data warehouse, the data from the second data source comprising a plurality of second internal identifiers each associated with a second external identifier, wherein at least one of the associated second external identifiers is identical to one of the first external identifiers;
mapping each second internal identifier associated with a second external identifier that is identical to one of the first external identifiers in the warehouse mapping data structure to the first internal identifier of the identical matching first external identifier, whereby a correlation between data of the first and the second data sources is established; and
storing the warehouse mapping data structure.
4. A computer-readable storage medium comprising computer-executable instructions which, when run on a computer system, perform an operation, comprising:
correlating at least a first plurality of data records and a second plurality of data records, wherein the first plurality of data records is received from a real-time data source and the second plurality of data records is received periodically from a non real-time data source, and wherein each data record of the first plurality of data records is uniquely identified within a corresponding data source by an associated internal identifier and each data record of the first and second plurality of data records comprising at least one external identifier, the correlating comprising:
determining a data record of the first plurality of data records and at least one data record of the second plurality of data records having an identical external identifier; and
mapping the at least one determined data record of the second plurality of data records to the internal identifier associated with the determined data record of the first plurality of data records; and storing the mapping.
6. The computer-readable medium ofclaim 5, wherein the determining further comprises,
for each data record of the second plurality of data records:
retrieving an internal identifier associated with the data record, the retrieved internal identifier uniquely identifying the data record of the second plurality of data records within a corresponding data source;
retrieving the at least one external identifier from the data record; and
comparing the at least one retrieved external identifier with the at least one external identifier of each mapping data record in the mapping data structure; and
wherein the mapping further comprises, if the at least one retrieved external identifier matches an external identifier of a specific mapping data record in the mapping data structure:
copying the retrieved internal identifier and an indication of the corresponding data source to the mapping data record.
7. The computer-readable medium ofclaim 6, wherein the mapping further comprises, if the at least one retrieved external identifier does not match an external identifier of any mapping data record in the mapping data structure:
associating an available internal identifier with the data record of the second plurality of data records, the available internal identifier uniquely identifying the data record within the mapping data structure; and
copying the associated available internal identifier, the retrieved external identifier and an indication of the corresponding data source to the mapping data structure; and
creating a new mapping data record in the mapping data structure, the new mapping data record comprising the associated available internal identifier, the retrieved external identifier and the indication of the corresponding data source.
8. A computer-readable storage medium comprising computer-executable instructions for performing, when run on a computer system, an operation, for correlating data records, comprising:
creating a data warehouse mapping data structure to correlate data records from at least two different data sources, the correlating comprising:
creating a plurality of mapping data records in the warehouse mapping data structure, each mapping data record comprising:
a first value representing an internal identifier uniquely identifying the mapping data record in the warehouse mapping data structure;
a second value representing an external identifier of one of a data record of a first data source, wherein the first data source comprises a real-time data source, and a data record of a second data source, wherein the second data source comprises a non real-time data source; and
a third value representing an internal identifier uniquely identifying a data record of the second data source, the data record of the second data source having the second value as the external identifier;
whereby a correlation between the first and the second data sources is established; and
storing the data warehouse mapping data structure.
17. The computer-readable medium ofclaim 8, wherein the operation further comprises:
replicating the data from the first data source from a real-time data source into the data warehouse;
for each data record in the replicated data:
copying an internal identifier uniquely identifying the data record in the first data source from the data record to the first value in a mapping data record;
copying the external identifier from the data record to the second value in the mapping data record;
extracting the data from the second data source from an extracted data source to the data warehouse;
for each data record in the extracted data:
comparing the external identifier of the data record with the second value in each created mapping data record; and
if the external identifier of the data record matches the second value in a specific mapping data record:
copying the internal identifier from the data record in the extracted data to the third value in the specific mapping data record.
18. The computer-readable medium ofclaim 8, wherein the operation further comprises:
extracting the data from the second data source from an extracted data source to the data warehouse;
for each data record in the extracted data:
comparing the external identifier of the data record with the second value in each created mapping data record; and
if the external identifier of the data record matches the second value in a specific mapping data record:
copying the internal identifier from the data record in the extracted data to the third value in the specific mapping data record; and
if the external identifier of the data record does not match the second value in a mapping data record:
creating a new mapping data record;
attributing a first value to the new mapping value; and
copying the external identifier from the data record to the second value in the new mapping data record.
19. A computer-readable storage medium comprising computer-executable instructions for performing, when run on a computer system, an operation of correlating data from at least two different data sources in a data warehouse, the operation comprising:
loading data from a first data source into the data warehouse, wherein the first data source comprises a real-time data source, the data from the first data source comprising a plurality of first internal identifiers and a plurality of shared external identifiers;
creating a warehouse mapping data structure on the basis of the plurality of internal identifiers each mapped to an associated first shared external identifier;
loading data from a second data source into the data warehouse, wherein the second data source comprises a non real-time data source, the data from the second data source comprising a plurality of second internal identifiers each associated with a second external identifier;
mapping each second internal identifier associated with a second external identifier that is identical to one of the first external identifiers in the warehouse mapping data structure to the first internal identifier of the identical matching first external identifier, whereby a correlation between data of the first and the second data sources is established; and
storing the mapping.
27. A mapping data structure residing on a computer-readable storage medium, the mapping data structure comprising a plurality of data records, each data record comprising:
a first portion comprising a warehouse internal identifier;
a second portion comprising an external identifier common to a first data source, wherein the first data source is a real time data source, and a second data source, wherein the second data source is a non-real time data source;
a third portion comprising an internal identifier of the second data source; wherein at least one data record of the plurality of data records comprises a warehouse internal identifier representing an internal identifier of the first data source, a common external identifier associated with the internal identifier of the first data source and an internal identifier of the second data source associated with the common external identifier, whereby a correlation between data of the first and the second data sources is established; and
storing the mapping data structure.
29. A computer, comprising: a memory containing at least:
a data warehouse for storing data of a first and a second data, source wherein the first data source is a real time data source, and wherein the second data source is a non-real time data source; and
a mapping data structure for correlating the data of the first and second data sources in the data warehouse, the mapping data structure comprising a plurality of data records, each data record in the mapping data structure comprising:
a first portion comprising a warehouse internal identifier;
a second portion comprising an external identifier common to a first data source and a second data source; and
a third portion comprising an internal identifier of the second data source; wherein at least one data record of the plurality of data records comprises a warehouse internal identifier representing an internal identifier of the first data source, a common external identifier associated with the internal identifier of the first data source and an internal identifier of the second data source associated with the common external identifier; and
a processor adapted to execute contents of the memory.
US10/161,2512002-05-312002-05-31Method and process to optimize correlation of replicated with extracted data from disparate data sourcesExpired - Fee RelatedUS7200619B2 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US10/161,251US7200619B2 (en)2002-05-312002-05-31Method and process to optimize correlation of replicated with extracted data from disparate data sources

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
US10/161,251US7200619B2 (en)2002-05-312002-05-31Method and process to optimize correlation of replicated with extracted data from disparate data sources

Publications (2)

Publication NumberPublication Date
US20030225742A1 US20030225742A1 (en)2003-12-04
US7200619B2true US7200619B2 (en)2007-04-03

Family

ID=29583384

Family Applications (1)

Application NumberTitlePriority DateFiling Date
US10/161,251Expired - Fee RelatedUS7200619B2 (en)2002-05-312002-05-31Method and process to optimize correlation of replicated with extracted data from disparate data sources

Country Status (1)

CountryLink
US (1)US7200619B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP3923154A1 (en)*2020-06-112021-12-15Ivalua SasIdentifying and generating links between data

Families Citing this family (84)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9710852B1 (en)2002-05-302017-07-18Consumerinfo.Com, Inc.Credit report timeline user interface
US9400589B1 (en)2002-05-302016-07-26Consumerinfo.Com, Inc.Circular rotational interface for display of consumer credit information
EP1403794A1 (en)*2002-09-272004-03-31Sap AgMethod and system for automatic storage of business data
US7284016B2 (en)*2002-12-032007-10-16Emc CorporationClient-server protocol for directory access of snapshot file systems in a storage system
US7069269B2 (en)*2002-12-032006-06-27International Business Machines CorporationMethod, system and program product for mapping data fields between a data source and a data target
EP1601163A1 (en)*2004-05-282005-11-30moxite GmbHSystem and method for replication, integration, consolidation and mobilisation of data
US8160947B2 (en)*2004-09-152012-04-17Adobe Systems IncorporatedMethods and systems for identifying and tracking potential property rights in software products
US20060256739A1 (en)*2005-02-192006-11-16Kenneth SeierFlexible multi-media data management
US10121152B2 (en)*2006-09-292018-11-06Visa U.S.A. Inc.Consumer specific conditional rewards
US8036979B1 (en)2006-10-052011-10-11Experian Information Solutions, Inc.System and method for generating a finance attribute from tradeline data
US20080294540A1 (en)2007-05-252008-11-27Celka Christopher JSystem and method for automated detection of never-pay data sets
WO2009039411A1 (en)*2007-09-212009-03-26Presenceid, Inc.Systems and methods for receiving and sending messages about changes to data attributes
DE102007057248A1 (en)*2007-11-162009-05-20T-Mobile International Ag Connection layer for databases
US9990674B1 (en)2007-12-142018-06-05Consumerinfo.Com, Inc.Card registry systems and methods
US8127986B1 (en)2007-12-142012-03-06Consumerinfo.Com, Inc.Card registry systems and methods
US8312033B1 (en)2008-06-262012-11-13Experian Marketing Solutions, Inc.Systems and methods for providing an integrated identifier
US9256904B1 (en)2008-08-142016-02-09Experian Information Solutions, Inc.Multi-bureau credit file freeze and unfreeze
US8060424B2 (en)2008-11-052011-11-15Consumerinfo.Com, Inc.On-line method and system for monitoring and reporting unused available credit
US8478705B2 (en)2010-01-152013-07-02International Business Machines CorporationPortable data management using rule definitions
US9652802B1 (en)2010-03-242017-05-16Consumerinfo.Com, Inc.Indirect monitoring and reporting of a user's credit data
US20110321167A1 (en)*2010-06-232011-12-29Google Inc.Ad privacy management
US8468119B2 (en)*2010-07-142013-06-18Business Objects Software Ltd.Matching data from disparate sources
US8666998B2 (en)2010-09-142014-03-04International Business Machines CorporationHandling data sets
US8886596B2 (en)*2010-10-112014-11-11Sap SeMethod for reorganizing or moving a database table
US8949166B2 (en)2010-12-162015-02-03International Business Machines CorporationCreating and processing a data rule for data quality
US8725759B2 (en)*2011-01-042014-05-13Bank Of America CorporationExposing data through simple object access protocol messages
WO2012112781A1 (en)2011-02-182012-08-23Csidentity CorporationSystem and methods for identifying compromised personally identifiable information on the internet
US8510446B1 (en)*2011-03-152013-08-13Symantec CorporationDynamically populating an identity-correlation data store
US9607336B1 (en)2011-06-162017-03-28Consumerinfo.Com, Inc.Providing credit inquiry alerts
US9483606B1 (en)2011-07-082016-11-01Consumerinfo.Com, Inc.Lifescore
US8898104B2 (en)2011-07-262014-11-25International Business Machines CorporationAuto-mapping between source and target models using statistical and ontology techniques
US9106691B1 (en)2011-09-162015-08-11Consumerinfo.Com, Inc.Systems and methods of identity protection and management
US8738516B1 (en)2011-10-132014-05-27Consumerinfo.Com, Inc.Debt services candidate locator
US11030562B1 (en)2011-10-312021-06-08Consumerinfo.Com, Inc.Pre-data breach monitoring
US20130208880A1 (en)*2011-12-222013-08-15Shoregroup, Inc.Method and apparatus for evolutionary contact center business intelligence
US8996539B2 (en)*2012-04-132015-03-31Microsoft Technology Licensing, LlcComposing text and structured databases
US9853959B1 (en)2012-05-072017-12-26Consumerinfo.Com, Inc.Storage and maintenance of personal data
US9654541B1 (en)2012-11-122017-05-16Consumerinfo.Com, Inc.Aggregating user web browsing data
US9916621B1 (en)2012-11-302018-03-13Consumerinfo.Com, Inc.Presentation of credit score factors
US10255598B1 (en)2012-12-062019-04-09Consumerinfo.Com, Inc.Credit card account data extraction
US9947007B2 (en)*2013-01-272018-04-17Barry GreenbaumPayment information technologies
US9870589B1 (en)2013-03-142018-01-16Consumerinfo.Com, Inc.Credit utilization tracking and reporting
US10102570B1 (en)2013-03-142018-10-16Consumerinfo.Com, Inc.Account vulnerability alerts
US8812387B1 (en)2013-03-142014-08-19Csidentity CorporationSystem and method for identifying related credit inquiries
US9406085B1 (en)2013-03-142016-08-02Consumerinfo.Com, Inc.System and methods for credit dispute processing, resolution, and reporting
US10685398B1 (en)2013-04-232020-06-16Consumerinfo.Com, Inc.Presenting credit score information
US9443268B1 (en)2013-08-162016-09-13Consumerinfo.Com, Inc.Bill payment and reporting
US9734230B2 (en)*2013-09-122017-08-15Sap SeCross system analytics for in memory data warehouse
US9734221B2 (en)2013-09-122017-08-15Sap SeIn memory database warehouse
US9773048B2 (en)2013-09-122017-09-26Sap SeHistorical data for in memory data warehouse
US10102536B1 (en)2013-11-152018-10-16Experian Information Solutions, Inc.Micro-geographic aggregation system
US10325314B1 (en)2013-11-152019-06-18Consumerinfo.Com, Inc.Payment reporting systems
US9477737B1 (en)2013-11-202016-10-25Consumerinfo.Com, Inc.Systems and user interfaces for dynamic access of multiple remote databases and synchronization of data based on user rules
USD760256S1 (en)2014-03-252016-06-28Consumerinfo.Com, Inc.Display screen or portion thereof with graphical user interface
USD759689S1 (en)2014-03-252016-06-21Consumerinfo.Com, Inc.Display screen or portion thereof with graphical user interface
USD759690S1 (en)2014-03-252016-06-21Consumerinfo.Com, Inc.Display screen or portion thereof with graphical user interface
US9892457B1 (en)2014-04-162018-02-13Consumerinfo.Com, Inc.Providing credit data in search results
US9576030B1 (en)2014-05-072017-02-21Consumerinfo.Com, Inc.Keeping up with the joneses
US10339527B1 (en)2014-10-312019-07-02Experian Information Solutions, Inc.System and architecture for electronic fraud detection
US10445152B1 (en)2014-12-192019-10-15Experian Information Solutions, Inc.Systems and methods for dynamic report generation based on automatic modeling of complex data structures
US11151468B1 (en)2015-07-022021-10-19Experian Information Solutions, Inc.Behavior analysis using distributed representations of event data
US10621524B2 (en)2015-11-092020-04-14Dassault Systemes Americas Corp.Exporting hierarchical data from a source code management (SCM) system to a product lifecycle management (PLM) system
US10140350B2 (en)*2015-11-092018-11-27Dassault Systemes Americas Corp.Bi-directional synchronization of data between a product lifecycle management (PLM) system and a source code management (SCM) system
US10621526B2 (en)2015-11-092020-04-14Dassault Systemes Americas Corp.Exporting hierarchical data from a product lifecycle management (PLM) system to a source code management (SCM) system
CN105930331B (en)*2015-12-292018-12-28中国银联股份有限公司A kind of processing method and processing device of card number data
US20170270153A1 (en)*2016-03-162017-09-21Linkedin CorporationReal-time incremental data audits
US10437853B2 (en)*2016-03-162019-10-08Microsoft Technology Licensing, LlcTracking data replication and discrepancies in incremental data audits
US9881053B2 (en)2016-05-132018-01-30Maana, Inc.Machine-assisted object matching
US20180060954A1 (en)2016-08-242018-03-01Experian Information Solutions, Inc.Sensors and system for detection of device movement and authentication of device user based on messaging service data from service provider
AU2018267280B2 (en)*2017-05-122023-08-03Bae Systems PlcA system for improved data storage and retrieval
US11132375B2 (en)2017-05-122021-09-28Bae Systems PlcSystem for data storage and retrieval
EP3401801A1 (en)*2017-05-122018-11-14BAE SYSTEMS plcA system for improved data storage and retrieval
AU2018267278A1 (en)2017-05-122019-11-07Bae Systems PlcA system for improved data storage and retrieval
SG10201705221TA (en)*2017-06-232019-01-30Mastercard International IncAdaptive payment card system and process
US10699028B1 (en)2017-09-282020-06-30Csidentity CorporationIdentity security architecture systems and methods
US10896472B1 (en)2017-11-142021-01-19Csidentity CorporationSecurity and identity verification system and architecture
US11562361B2 (en)*2017-12-202023-01-24Mastercard International IncorporatedEntity identification based on a record pattern
US11265324B2 (en)2018-09-052022-03-01Consumerinfo.Com, Inc.User permissions for access to secure data at third-party
US11315179B1 (en)2018-11-162022-04-26Consumerinfo.Com, Inc.Methods and apparatuses for customized card recommendations
US11238656B1 (en)2019-02-222022-02-01Consumerinfo.Com, Inc.System and method for an augmented reality experience via an artificial intelligence bot
US11663146B2 (en)*2019-06-272023-05-30Microchip Technology IncorporatedSecurity of embedded devices through a device lifecycle with a device identifier
US11941065B1 (en)2019-09-132024-03-26Experian Information Solutions, Inc.Single identifier platform for storing entity data
US12430646B2 (en)2021-04-122025-09-30Csidentity CorporationSystems and methods of generating risk scores and predictive fraud modeling
US12211106B2 (en)*2021-08-022025-01-28Mastercard International IncorporatedMethod to determine that a credit card number change has occurred

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5884324A (en)1996-07-231999-03-16International Business Machines CorporationAgent for replicating data based on a client defined replication period
US6263433B1 (en)1998-09-302001-07-17Ncr CorporationProvision of continuous database service and scalable query performance using active redundant copies
US20020133508A1 (en)*1999-07-032002-09-19Starfish Software, Inc.System and methods for synchronizing datasets using cooperation among multiple synchronization engines
US20030154194A1 (en)*2001-12-282003-08-14Jonas Jeffrey JamesReal time data warehousing
US6725227B1 (en)1998-10-022004-04-20Nec CorporationAdvanced web bookmark database system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5884324A (en)1996-07-231999-03-16International Business Machines CorporationAgent for replicating data based on a client defined replication period
US6263433B1 (en)1998-09-302001-07-17Ncr CorporationProvision of continuous database service and scalable query performance using active redundant copies
US6725227B1 (en)1998-10-022004-04-20Nec CorporationAdvanced web bookmark database system
US20020133508A1 (en)*1999-07-032002-09-19Starfish Software, Inc.System and methods for synchronizing datasets using cooperation among multiple synchronization engines
US20030154194A1 (en)*2001-12-282003-08-14Jonas Jeffrey JamesReal time data warehousing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP3923154A1 (en)*2020-06-112021-12-15Ivalua SasIdentifying and generating links between data
WO2021249873A1 (en)*2020-06-112021-12-16Ivalua SasIdentifying and generating links between data

Also Published As

Publication numberPublication date
US20030225742A1 (en)2003-12-04

Similar Documents

PublicationPublication DateTitle
US7200619B2 (en)Method and process to optimize correlation of replicated with extracted data from disparate data sources
US7051020B2 (en)Intelligent query re-execution
US7856416B2 (en)Automated latent star schema discovery tool
US10572508B2 (en)Consistent query execution in hybrid DBMS
US5613113A (en)Consistent recreation of events from activity logs
JP6325246B2 (en) System and method for in-memory database processing
US8443006B1 (en)Data propagation in a multi-shard database system
US9563662B2 (en)Detecting and processing cache hits for queries with aggregates
US8359337B2 (en)Apparatus, system and method for member matching
US6879989B2 (en)Modification system for supporting localized data changes in a mobile device
MX2010007273A (en) METHODS AND APPLIANCES FOR IMPLEMENTING A TRADE GROUP PREMISE SYSTEM.
US11636078B2 (en)Personally identifiable information storage detection by searching a metadata source
US20170255677A1 (en)Preventing staleness in query results when using asynchronously updated indexes
US7653663B1 (en)Guaranteeing the authenticity of the data stored in the archive storage
US8280907B2 (en)System and method for managing access to data in a database
US9390111B2 (en)Database insert with deferred materialization
US20160132543A1 (en)Automatically aggregating data in database tables
Dakrory et al.Automated ETL testing on the data quality of a data warehouse
US20080005077A1 (en)Encoded version columns optimized for current version access
EP4058900B1 (en)Systems and methods for generation and application of schema-agnostic query templates
US20120185451A1 (en)Data processing method and system for database management system
US10459913B2 (en)Database query processing
Hinrichs et al.An ISO 9001: 2000 Compliant Quality Management System for Data Integration in Data Warehouse Systems.
US20170329830A1 (en)Read-optimized database changes
US20180173805A1 (en)Application programming interface for detection and extraction of data changes

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name:INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TENNER, JEFFREY WAYNE;WILL, ERIC W.;REEL/FRAME:012971/0104;SIGNING DATES FROM 20020528 TO 20020530

FPAYFee payment

Year of fee payment:4

REMIMaintenance fee reminder mailed
LAPSLapse for failure to pay maintenance fees
STCHInformation on status: patent discontinuation

Free format text:PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FPLapsed due to failure to pay maintenance fee

Effective date:20150403


[8]ページ先頭

©2009-2025 Movatter.jp