Movatterモバイル変換

Data integration

From Wikipedia, the free encyclopedia

Combining data from multiple sources

Data integration refers to the process of combining, sharing, or synchronizingdata from multiple sources to provide users with a unified view.^[1] There are a wide range of possible applications for data integration, from commercial (such as when a business merges multipledatabases) to scientific (combining research data from differentbioinformatics repositories).

The decision to integrate data tends to arise when the volume, complexity (that is,big data) and need to share existing dataexplodes.^[2] It has become the focus of extensive theoretical work, and numerous open problems remain unsolved.

Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from aheterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients.^[3] A common use of data integration is indata mining when analyzing and extracting information from existing databases that can be useful forBusiness information.^[4]

History

[edit]

Figure 1: Simple schematic for a data warehouse. TheExtract, transform, load (ETL) process extracts information from the source databases, transforms it and then loads it into the data warehouse.

Figure 2: Simple schematic for a data-integration solution. A system designer constructs a mediated schema against which users can run queries. Thevirtual database interfaces with the source databases viawrapper code if required.

Issues with combiningheterogeneous data sources, often referred to asinformation silos, under a single query interface have existed for some time. In the early 1980s, computer scientists began designing systems for interoperability of heterogeneous databases.^[5]

The first data integration system driven by structured metadata was designed in 1991 at theUniversity of Minnesota for theIntegrated Public Use Microdata Series (IPUMS). IPUMS used adata warehousing approach, whichextracts, transforms, and loads data from heterogeneous sources into a unique viewschema so data from different sources become compatible.^[6] By making thousands of population databases interoperable, IPUMS demonstrated the feasibility of large-scale data integration. The data warehouse approach offers atightly coupled architecture because the data are already physically reconciled in a single queryable repository, so it usually takes little time to resolve queries.^[7]

Thedata warehouse approach is less feasible for data sets that are frequently updated, requiring theextract, transform, load (ETL) process to be continuously re-executed for synchronization. Difficulties also arise in constructing data warehouses when one has only a query interface to summary data sources and no access to the full data. This problem frequently emerges when integrating several commercial query services like travel or classified advertisement web applications.

A trend began in 2009 favoring theloose coupling of data^[8] and providing a unified query-interface to access real time data over amediated schema (see Figure 2), which allows information to be retrieved directly from original databases. This is consistent with theSOA approach popular in that era. This approach relies on mappings between the mediated schema and the schema of original sources, and translating a query into decomposed queries to match the schema of the original databases. Such mappings can be specified in two ways: as a mapping from entities in the mediated schema to entities in the original sources (the "Global-as-View"^[9] (GAV) approach), or as a mapping from entities in the original sources to the mediated schema (the "Local-as-View"^[10] (LAV) approach). The latter approach requires more sophisticated inferences to resolve a query on the mediated schema, but makes it easier to add new data sources to a (stable) mediated schema.

As of 2010^[update], some of the work in data integration research concerns thesemantic integration problem. This problem addresses not the structuring of the architecture of the integration, but how to resolvesemantic conflicts between heterogeneous data sources. For example, if two companies merge their databases, certain concepts and definitions in their respective schemas like "earnings" inevitably have different meanings. In one database it may mean profits in dollars (a floating-point number), while in the other it might represent the number of sales (an integer). A common strategy for the resolution of such problems involves the use ofontologies which explicitly define schema terms and thus help to resolve semantic conflicts. This approach representsontology-based data integration. On the other hand, the problem of combining research results from different bioinformatics repositories requires bench-marking of the similarities, computed from different data sources, on a single criterion such as positive predictive value. This enables the data sources to be directly comparable and can be integrated even when the natures of experiments are distinct.^[11]

As of 2011^[update], it was determined that currentdata modeling methods were imparting data isolation into everydata architecture in the form of islands of disparate data and information silos. This data isolation is an unintended artifact of the data modeling methodology that results in the development of disparate data models. Disparate data models, when instantiated as databases, form disparate databases. Enhanced data model methodologies have been developed to eliminate the data isolation artifact and to promote the development of integrated data models.^[12] One enhanced data modeling method recasts data models by augmenting them with structuralmetadata in the form of standardized data entities. As a result of recasting multiple data models, the set of recast data models will now share one or more commonality relationships that relate the structural metadata now common to these data models. Commonality relationships are a peer-to-peer type of entity relationships that relate the standardized data entities of multiple data models. Multiple data models that contain the same standard data entity may participate in the same commonality relationship. When integrated data models are instantiated as databases and are properly populated from a common set of master data, then these databases are integrated.

Since 2011,data hub approaches have been of greater interest than fully structured (typically relational) Enterprise Data Warehouses. Since 2013,data lake approaches have risen to the level of Data Hubs. (See all three search terms popularity on Google Trends.^[13]) These approaches combine unstructured or varied data into one location, but do not necessarily require an (often complex) master relational schema to structure and define all data in the Hub.

In recent times, as the number of applications being used have increased many fold and application to application integration have become critical and this has given rise to [Unified APIs] that help application developers integrate their apps with other apps and more recently with [MCP - Model Context Protocol] taking it a step further for AI Agents.

Data integration plays a big role in business regardingdata collection used for studying the market. Converting the raw data retrieved from consumers into coherent data is something businesses try to do when considering what steps they should take next.^[14] Organizations are more frequently usingdata mining for collecting information and patterns from their databases, and this process helps them develop new business strategies to increase business performance and perform economic analyses more efficiently. Compiling the large amount of data they collect to be stored in their system is a form of data integration adapted forBusiness intelligence to improve their chances of success.^[15]

Example

[edit]

Consider aweb application where a user can query a variety of information about cities (such as crime statistics, weather, hotels, demographics, etc.). Traditionally, the information must be stored in a single database with a single schema. But any single enterprise would find information of this breadth somewhat difficult and expensive to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data.

A data-integration solution may address this problem by considering these external resources asmaterialized views over avirtual mediated schema, resulting in "virtual data integration". This means application-developers construct a virtual schema—themediated schema—to best model the kinds of answers their users want. Next, they design "wrappers" or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by the respective websites or databases) into an easily processed form for the data integration solution (see figure 2). When an application-user queries the mediated schema, the data-integration solution transforms this query into appropriate queries over the respective data sources. Finally, the virtual database combines the results of these queries into the answer to the user's query.

This solution offers the convenience of adding new sources by simply constructing an adapter or an application software blade for them. It contrasts withETL systems or with a single database solution, which require manual integration of entire new data set into the system. The virtual ETL solutions leveragevirtual mediated schema to implement data harmonization; whereby the data are copied from the designated "master" source to the defined targets, field by field. Advanceddata virtualization is also built on the concept of object-oriented modeling in order to construct virtual mediated schema or virtual metadata repository, usinghub and spoke architecture.

Each data source is disparate and as such is not designed to support reliable joins between data sources. Therefore, data virtualization as well as data federation depends upon accidental data commonality to support combining data and information from disparate data sets. Because of the lack of data value commonality across data sources, the return set may be inaccurate, incomplete, and impossible to validate.

One solution is to recast disparate databases to integrate these databases without the need forETL. The recast databases support commonality constraints where referential integrity may be enforced between databases. The recast databases provide designed data access paths with data value commonality across databases.

Theory

[edit]

The theory of data integration^[1] forms a subset of database theory and formalizes the underlying concepts of the problem infirst-order logic. Applying the theories gives indications as to the feasibility and difficulty of data integration. While its definitions may appear abstract, they have sufficient generality to accommodate all manner of integration systems,^[16] including those that include nested relational / XML databases^[17] and those that treat databases as programs.^[18] Connections to particular databases systems such as Oracle or DB2 are provided by implementation-level technologies such asJDBC and are not studied at the theoretical level.

Definitions

[edit]

Data integration systems are formally defined as atuple $\left\langle G,S,M\right\rangle$ where $G {\displaystyle G}$ is the global (or mediated) schema, $S {\displaystyle S}$ is the heterogeneous set of source schemas, and $M {\displaystyle M}$ is the mapping that maps queries between the source and the global schemas. Both $G {\displaystyle G}$ and $S {\displaystyle S}$ are expressed inlanguages overalphabets composed of symbols for each of their respectiverelations. Themapping $M {\displaystyle M}$ consists of assertions between queries over $G {\displaystyle G}$ and queries over $S {\displaystyle S}$ . When users pose queries over the data integration system, they pose queries over $G {\displaystyle G}$ and the mapping then asserts connections between the elements in the global schema and the source schemas.

A database over a schema is defined as a set of sets, one for each relation (in a relational database). The database corresponding to the source schema $S {\displaystyle S}$ would comprise the set of sets of tuples for each of the heterogeneous data sources and is called thesource database. Note that this single source database may actually represent a collection of disconnected databases. The database corresponding to the virtual mediated schema $G {\displaystyle G}$ is called theglobal database. The global database must satisfy the mapping $M {\displaystyle M}$ with respect to the source database. The legality of this mapping depends on the nature of the correspondence between $G {\displaystyle G}$ and $S {\displaystyle S}$ . Two popular ways to model this correspondence exist:Global as View or GAV andLocal as View or LAV.

Figure 3: Illustration of tuple space of the GAV and LAV mappings.^[19] In GAV, the system is constrained to the set of tuples mapped by the mediators while the set of tuples expressible over the sources may be much larger and richer. In LAV, the system is constrained to the set of tuples in the sources while the set of tuples expressible over the global schema can be much larger. Therefore, LAV systems must often deal with incomplete answers.

GAV systems model the global database as a set ofviews over $S {\displaystyle S}$ . In this case $M {\displaystyle M}$ associates to each element of $G {\displaystyle G}$ a query over $S {\displaystyle S}$ .Query processing becomes a straightforward operation due to the well-defined associations between $G {\displaystyle G}$ and $S {\displaystyle S}$ . The burden of complexity falls on implementing mediator code instructing the data integration system exactly how to retrieve elements from the source databases. If any new sources join the system, considerable effort may be necessary to update the mediator, thus the GAV approach appears preferable when the sources seem unlikely to change.

In a GAV approach to the example data integration system above, the system designer would first develop mediators for each of the city information sources and then design the global schema around these mediators. For example, consider if one of the sources served a weather website. The designer would likely then add a corresponding element for weather to the global schema. Then the bulk of effort concentrates on writing the proper mediator code that will transform predicates on weather into a query over the weather website. This effort can become complex if some other source also relates to weather, because the designer may need to write code to properly combine the results from the two sources.

On the other hand, in LAV, the source database is modeled as a set ofviews over $G {\displaystyle G}$ . In this case $M {\displaystyle M}$ associates to each element of $S {\displaystyle S}$ a query over $G {\displaystyle G}$ . Here the exact associations between $G {\displaystyle G}$ and $S {\displaystyle S}$ are no longer well-defined. As is illustrated in the next section, the burden of determining how to retrieve elements from the sources is placed on the query processor. The benefit of an LAV modeling is that new sources can be added with far less work than in a GAV system, thus the LAV approach should be favored in cases where the mediated schema is less stable or likely to change.^[1]

In an LAV approach to the example data integration system above, the system designer designs the global schema first and then simply inputs the schemas of the respective city information sources. Consider again if one of the sources serves a weather website. The designer would add corresponding elements for weather to the global schema only if none existed already. Then programmers write an adapter or wrapper for the website and add a schema description of the website's results to the source schemas. The complexity of adding the new source moves from the designer to the query processor.

Query processing

[edit]

The theory of query processing in data integration systems is commonly expressed using conjunctivequeries andDatalog, a purely declarativelogic programming language.^[20] One can loosely think of aconjunctive query as a logical function applied to the relations of a database such as " $f(A,B)$ where $A<B$ ". If a tuple or set of tuples is substituted into the rule and satisfies it (makes it true), then we consider that tuple as part of the set of answers in the query. While formal languages like Datalog express these queries concisely and without ambiguity, commonSQL queries count as conjunctive queries as well.

In terms of data integration, "query containment" represents an important property of conjunctive queries. A query $A {\displaystyle A}$ contains another query $B {\displaystyle B}$ (denoted $A\supset B$ ) if the results of applying $B {\displaystyle B}$ are a subset of the results of applying $A {\displaystyle A}$ for any database. The two queries are said to be equivalent if the resulting sets are equal for any database. This is important because in both GAV and LAV systems, a user poses conjunctive queries over avirtual schema represented by a set ofviews, or "materialized" conjunctive queries. Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query. This corresponds to the problem of answering queries using views (AQUV).^[21]

In GAV systems, a system designer writes mediator code to define the query-rewriting. Each element in the user's query corresponds to a substitution rule just as each element in the global schema corresponds to a query over the source. Query processing simply expands the subgoals of the user's query according to the rule specified in the mediator and thus the resulting query is likely to be equivalent. While the designer does the majority of the work beforehand, some GAV systems such asTsimmis involve simplifying the mediator description process.

In LAV systems, queries undergo a more radical process of rewriting because no mediator exists to align the user's query with a simple expansion strategy. The integration system must execute a search over the space of possible queries in order to find the best rewrite. The resulting rewrite may not be an equivalent query but maximally contained, and the resulting tuples may be incomplete. As of 2011^[update] the GQR algorithm^[22] is the leading query rewriting algorithm for LAV data integration systems.

In general, the complexity of query rewriting isNP-complete.^[21] If the space of rewrites is relatively small, this does not pose a problem — even for integration systems with hundreds of sources.

Medicine and life sciences

[edit]

Large-scale questions in science, such asreal world evidence,global warming,invasive species spread, andresource depletion, are increasingly requiring the collection of disparate data sets formeta-analysis. This type of data integration is especially challenging for ecological and environmental data becausemetadata standards are not agreed upon and there are many different data types produced in these fields.National Science Foundation initiatives such asDatanet are intended to make data integration easier for scientists by providingcyberinfrastructure and setting standards. The five fundedDatanet initiatives areDataONE,^[23] led by William Michener at theUniversity of New Mexico; The Data Conservancy,^[24] led by Sayeed Choudhury ofJohns Hopkins University; SEAD: Sustainable Environment through Actionable Data,^[25] led byMargaret Hedstrom of theUniversity of Michigan; the DataNet Federation Consortium,^[26] led by Reagan Moore of theUniversity of North Carolina; andTerra Populus,^[27] led bySteven Ruggles of theUniversity of Minnesota. TheResearch Data Alliance,^[28] has more recently explored creating global data integration frameworks. TheOpenPHACTS project, funded through theEuropean Union Innovative Medicines Initiative, built a drug discovery platform by linking datasets from providers such asEuropean Bioinformatics Institute,Royal Society of Chemistry,UniProt,WikiPathways andDrugBank.

References

[edit]

^^a ^b ^cMaurizio Lenzerini (2002)."Data Integration: A Theoretical Perspective"(PDF).PODS 2002. pp. 233–246.
^Frederick Lane (2006)."IDC: World Created 161 Billion Gigs of Data in 2006". Archived fromthe original on 2015-07-15.
^mikben."Data Coherency - Win32 apps".docs.microsoft.com.Archived from the original on 2020-06-12. Retrieved2020-11-23.
^Chung, P.; Chung, S. H. (2013-05). "On data integration and data mining for developing business intelligence".2013 IEEE Long Island Systems, Applications and Technology Conference (LISAT): 1–6.doi:10.1109/LISAT.2013.6578235.
^John Miles Smith; et al. (1982)."Multibase: integrating heterogeneous distributed database systems".AFIPS '81 Proceedings of the May 4–7, 1981, National Computer Conference. pp. 487–499.
^Steven Ruggles, J. David Hacker, and Matthew Sobek (1995). "Order out of Chaos: The Integrated Public Use Microdata Series".Historical Methods. Vol. 28. pp. 33–39.{{cite news}}: CS1 maint: multiple names: authors list (link)
^Jennifer Widom (1995)."Research problems in data warehousing".CIKM '95 Proceedings of the Fourth International Conference on Information and Knowledge Management. pp. 25–30.
^Pautasso, Cesare; Wilde, Erik (2009-04-20)."Why is the web loosely coupled?".Proceedings of the 18th international conference on World wide web. WWW '09. Madrid, Spain: Association for Computing Machinery. pp. 911–920.doi:10.1145/1526709.1526832.ISBN 978-1-60558-487-4.S2CID 207172208.
^"What is GAV (Global as View)?".GeeksforGeeks. 2020-04-18.Archived from the original on 2020-11-30. Retrieved2020-11-23.
^"Local-as-View",Wikipedia (in German), 2020-07-24, retrieved2020-11-23
^Shubhra S. Ray; et al. (2009)."Combining Multi-Source Information through Functional Annotation based Weighting: Gene Function Prediction in Yeast"(PDF).IEEE Transactions on Biomedical Engineering.56 (2):229–236.CiteSeerX 10.1.1.150.7928.doi:10.1109/TBME.2008.2005955.PMID 19272921.S2CID 10848834.Archived(PDF) from the original on 2010-05-08. Retrieved2012-05-17.
^Michael Mireku Kwakye (2011). "A Practical Approach To Merging Multidimensional Data Models".hdl:10393/20457.
^"Hub Lake and Warehouse search trends".Archived from the original on 2017-02-17. Retrieved2016-01-12.
^"Data mining in business analytics".Western Governors University. May 15, 2020.Archived from the original on December 23, 2020. RetrievedNovember 22, 2020.
^Surani, Ibrahim (2020-03-30)."Data Integration for Business Intelligence: Best Practices".DATAVERSITY.Archived from the original on 2020-11-30. Retrieved2020-11-23.
^Alagić, Suad; Bernstein, Philip A. (2002).Database Programming Languages. Lecture Notes in Computer Science. Vol. 2397. pp. 228–246.doi:10.1007/3-540-46093-4_14.ISBN 978-3-540-44080-2.
^"Nested Mappings: Schema Mapping Reloaded"(PDF).Archived(PDF) from the original on 2015-10-28. Retrieved2015-09-10.
^"The Common Framework Initiative for algebraic specification and development of software"(PDF).Archived(PDF) from the original on 2016-03-04. Retrieved2015-09-10.
^Christoph Koch (2001)."Data Integration against Multiple Evolving Autonomous Schemata"(PDF). Archived fromthe original(PDF) on 2007-09-26.
^Jeffrey D. Ullman (1997)."Information Integration Using Logical Views".ICDT 1997. pp. 19–40.
^^a ^bAlon Y. Halevy (2001)."Answering queries using views: A survey"(PDF).The VLDB Journal. pp. 270–294.
^George Konstantinidis; et al. (2011)."Scalable Query Rewriting: A Graph-based Approach"(PDF).in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD'11, June 12–16, 2011, Athens, Greece.
^William Michener; et al."DataONE: Observation Network for Earth". www.dataone.org.Archived from the original on 2013-01-22. Retrieved2013-01-19.
^Sayeed Choudhury; et al."Data Conservancy". dataconservancy.org.Archived from the original on 2013-01-13. Retrieved2013-01-19.
^Margaret Hedstrom; et al."SEAD Sustainable Environment - Actionable Data". sead-data.net.Archived from the original on 2012-09-20. Retrieved2013-01-19.
^Reagan Moore; et al."DataNet Federation Consortium". datafed.org. Archived from the original on 2013-04-15. Retrieved2013-01-19.
^Steven Ruggles; et al."Terra Populus: Integrated Data on Population and the Environment". terrapop.org.Archived from the original on 2013-05-18. Retrieved2013-01-19.
^Bill Nichols."Research Data Alliance". rd-alliance.org.Archived from the original on 2014-11-18. Retrieved2014-10-01.

External links

[edit]

Look updata integration in Wiktionary, the free dictionary.

v t e Data
Acquisition Augmentation Analysis Anonymization Archaeology Big Cleansing Collection Compression Corruption Curation Deduplication Degradation De-identification Ecosystem Editing Engineering Erasure ETL/ELT Extract Transform Load Ethics Exhaust Exploration Farming Format management Fusion Governance Cooperatives Infrastructure Integration Integrity Library Lineage Loss Management Meta Migration Mining Philanthropy Pre-processing Preservation Processing Protection (privacy) Publishing Open data Recovery Reduction Redundancy Re-identification Remanence Rescue Retention Quality Science Scraping Scrubbing Security Sharing Stewardship Storage Structure Synchronization Topological data analysis Type Validation Warehouse Wrangling/munging

Data