
Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entirelife-cycle.[1] It is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context even under the same general umbrella ofcomputing. It is at times used as a proxy term fordata quality,[2] whiledata validation is a prerequisite for data integrity.[3]
Data integrity is the opposite ofdata corruption.[4] The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended (such as a database correctly rejecting mutually exclusive possibilities). Moreover, upon laterretrieval, ensure the data is the same as when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused withdata security, the discipline of protecting data from unauthorized parties.
Any unintended changes to data as the result of a storage, retrieval or processing operation, including malicious intent, unexpected hardware failure, andhuman error, is failure of data integrity. If the changes are the result of unauthorized access, it may also be a failure of data security. Depending on the data involved this could manifest itself as benign as a single pixel in an image appearing a different color than was originally recorded, to the loss of vacation pictures or a business-critical database, to even catastrophic loss of human life in alife-critical system.
Physical integrity deals with challenges which are associated with correctly storing and fetching the data itself. Challenges with physical integrity may includeelectromechanical faults, design flaws, materialfatigue,corrosion,power outages, natural disasters, and other special environmental hazards such asionizing radiation, extreme temperatures, pressures andg-forces. Ensuring physical integrity includes methods such asredundant hardware, anuninterruptible power supply, certain types ofRAID arrays,radiation hardened chips,error-correcting memory, use of aclustered file system, using file systems that employ block levelchecksums such asZFS, storage arrays that compute parity calculations such asexclusive or or use acryptographic hash function and even having awatchdog timer on critical subsystems.
Physical integrity often makes extensive use of error detecting algorithms known aserror-correcting codes. Human-induced data integrity errors are often detected through the use of simpler checks and algorithms, such as theDamm algorithm orLuhn algorithm. These are used to maintain data integrity after manual transcription from one computer system to another by a human intermediary (e.g. credit card or bank routing numbers). Computer-induced transcription errors can be detected throughhash functions.
In production systems, these techniques are used together to ensure various degrees of data integrity. For example, a computerfile system may be configured on a fault-tolerant RAID array, but might not provide block-level checksums to detect and preventsilent data corruption. As another example, a database management system might be compliant with theACID properties, but the RAID controller or hard disk drive's internal write cache might not be.
This type of integrity is concerned with thecorrectness orrationality of a piece of data, given a particular context. This includes topics such asreferential integrity andentity integrity in arelational database or correctly ignoring impossible sensor data in robotic systems. These concerns involve ensuring that the data "makes sense" given its environment. Challenges includesoftware bugs, design flaws, and human errors. Common methods of ensuring logical integrity include things such ascheck constraints,foreign key constraints, programassertions, and other run-time sanity checks.
Physical and logical integrity often share many challenges such as human errors and design flaws, and both must appropriately deal with concurrent requests to record and retrieve data, the latter of which is entirely a subject on its own.
If a data sector only has a logical error, it can be reused by overwriting it with new data. In case of a physical error, the affected data sector is permanently unusable.
Data integrity contains guidelines fordata retention, specifying or guaranteeing the length of time data can be retained in a particular database (typically arelational database). To achieve data integrity, these rules are consistently and routinely applied to all data entering the system, and any relaxation of enforcement could cause errors in the data. Implementing checks on the data as close as possible to the source of input (such as human data entry), causes less erroneous data to enter the system. Strict enforcement of data integrity rules results in lower error rates, and time saved troubleshooting and tracing erroneous data and the errors it causes to algorithms.
Data integrity also includes rules defining the relations a piece of data can have to other pieces of data, such as aCustomer record being allowed to link to purchasedProducts, but not to unrelated data such asCorporate Assets. Data integrity often includes checks and correction for invalid data, based on a fixedschema or a predefined set of rules. An example being textual data entered where a date-time value is required. Rules for data derivation are also applicable, specifying how a data value is derived based on algorithm, contributors and conditions. It also specifies the conditions on how the data value could be re-derived.
Data integrity is normally enforced in adatabase system by a series of integrity constraints or rules. Three types of integrity constraints are an inherent part of therelational data model: entity integrity, referential integrity and domain integrity.
If a database supports these features, it is the responsibility of the database to ensure data integrity as well as theconsistency model for the data storage and retrieval. If a database does not support these features, it is the responsibility of the applications to ensure data integrity while the database supports theconsistency model for the data storage and retrieval.
Having a single, well-controlled, and well-defined data-integrity system increases:
Moderndatabases support these features (seeComparison of relational database management systems), and it has become the de facto responsibility of the database to ensure data integrity. Companies, and indeed many database systems, offer products and services to migrate legacy systems to modern databases.
An example of a data-integrity mechanism is the parent-and-child relationship of related records. If a parent record owns one or more related child records all of the referential integrity processes are handled by the database itself, which automatically ensures the accuracy and integrity of the data so that no child record can exist without a parent (also called being orphaned) and that no parent loses their child records. It also ensures that no parent record can be deleted while the parent record owns any child records. All of this is handled at the database level and does not require coding integrity checks into each application.
Various research results show that neither widespreadfilesystems (includingUFS,Ext,XFS,JFS andNTFS) norhardware RAID solutions provide sufficient protection against data integrity problems.[5][6][7][8][9]
Some filesystems (includingBtrfs andZFS) provide internal data andmetadata checksumming that is used for detectingsilent data corruption and improving data integrity. If a corruption is detected that way and internal RAID mechanisms provided by those filesystems are also used, such filesystems can additionally reconstruct corrupted data in a transparent way.[10] This approach allows improved data integrity protection covering the entire data paths, which is usually known asend-to-end data protection.[11]