Discrete, discontinuous representation of information
This article is about the concept in information theory and information systems. For the electronics concept, seeDigital signal. For other uses, seeDigital.
Digital clock. The time shown by the digits on the face at any instant is digital data. The actual precise time is analog data.
Digital data, ininformation theory andinformation systems, is information represented as a string ofdiscrete symbols, each of which can take on one of only a finite number of values from somealphabet, such as letters or digits. An example is atext document, which consists of a string ofalphanumeric characters. The most common form of digital data in modern information systems isbinary data, which is represented by a string ofbinary digits (bits) each of which can have one of two values, either 0 or 1.
Digital data can be contrasted withanalog data, which is represented by a value from acontinuous range ofreal numbers. Analog data is transmitted by ananalog signal, which not only takes on continuous values but can vary continuously with time, a continuousreal-valued function of time. An example is the air pressure variation in asound wave.
Data requiresinterpretation to becomeinformation. In modern (post-1960) computer systems, all data is digital.
The worddigital comes from the same source as the wordsdigit anddigitus (theLatin word forfinger), as fingers are often used for counting. MathematicianGeorge Stibitz ofBell Telephone Laboratories used the worddigital in reference to the fast electric pulses emitted by a device designed to aim and fire anti-aircraft guns in 1942.[1] The term is most commonly used incomputing andelectronics, especially where real-world information is converted tobinary numeric form as indigital audio anddigital photography.
A symbol input device usually consists of a group of switches that are polled at regular intervals to see which switches are switched. Data will be lost if, within a single polling interval, two switches are pressed, or a switch is pressed, released, and pressed again. This polling can be done by a specialized processor in the device to prevent burdening the mainCPU.[2] When a new symbol has been entered, the device typically sends aninterrupt, in a specialized format, so that the CPU can read it.
For devices with only a few switches (such as the buttons on ajoystick), the status of each can be encoded as bits (usually 0 for released and 1 for pressed) in a single word. This is useful when combinations of key presses are meaningful, and is sometimes used for passing the status of modifier keys on a keyboard (such as shift and control). But it does not scale to support more keys than the number of bits in a single byte or word.
Devices with many switches (such as acomputer keyboard) usually arrange these switches in a scan matrix, with the individual switches on the intersections of x and y lines. When a switch is pressed, it connects the corresponding x and y lines together. Polling (often called scanning in this case) is done by activating each x line in sequence and detecting which y lines then have asignal, thus which keys are pressed. When the keyboard processor detects that a key has changed state, it sends a signal to the CPU indicating the scan code of the key and its new state. The symbol is thenencoded or converted into a number based on the status of modifier keys and the desiredcharacter encoding.
A customencoding can be used for a specific application with no loss of data. However, using a standard encoding such asASCII is problematic if a symbol such as 'ß' needs to be converted but is not in the standard.
It is estimated that in the year 1986, less than 1% of the world's technological capacity to store information was digital and in 2007 it was already 94%.[3] The year 2002 is assumed to be the year when humankind was able to store more information in digital than in analog format (the "beginning of thedigital age").[4][5]
Data at rest ininformation technology means data that is housed physically oncomputer data storage in any digital form (e.g.cloud storage,file hosting services,databases,data warehouses,spreadsheets, archives, tapes, off-site or cloud backups,mobile devices etc.). Data at rest includes both structured andunstructured data.[9] This type of data is subject to threats from hackers and other malicious threats to gain access to the data digitally or physical theft of the data storage media. To prevent this data from being accessed, modified or stolen, organizations will often employ security protection measures such as password protection, data encryption, or a combination of both. The security options used for this type of data are broadly referred to asdata-at-rest protection (DARP).[10]
Definitions include:
"...all data in computer storage while excluding data that is traversing a network or temporarily residing in computer memory to be read or updated."[11]
"...all data in storage but excludes any data that frequently traverses the network or that which resides in temporary memory. Data at rest includes but is not limited to archived data, data which is not accessed or changed frequently, files stored on hard drives, USB thumb drives, files stored on backup tape and disks, and also files stored off-site or on astorage area network (SAN)."[12]
While it is generally accepted that archive data (i.e. which never changes), regardless of its storage medium, is data at rest and active data subject to constant or frequent change is data in use. “Inactive data” could be taken to mean data which may change, but infrequently. The imprecise nature of terms such as “constant” and “frequent” means that some stored data cannot be comprehensively defined as either data at rest or in use. These definitions could be taken to assume that Data at Rest is a superset of data in use; however, data in use, subject to frequent change, has distinct processing requirements from data at rest, whether completely static or subject to occasional change.
Because of its nature data at rest is of increasing concern to businesses, government agencies and other institutions.[11] Mobile devices are often subject to specific security protocols to protect data at rest from unauthorized access when lost or stolen[13] and there is an increasing recognition that database management systems and file servers should also be considered as at risk;[14] the longer data is left unused in storage, the more likely it might be retrieved by unauthorized individuals outside the network.
Data encryption, which prevents data visibility in the event of its unauthorized access or theft, is commonly used to protect data in motion and increasingly promoted for protecting data at rest.[15] The encryption of data at rest should only include strong encryption methods such asAES orRSA. Encrypted data should remain encrypted when access controls such as usernames and password fail. Increasing encryption on multiple levels is recommended.Cryptography can be implemented on the database housing the data and on the physical storage where the databases are stored. Data encryption keys should be updated on a regular basis. Encryption keys should be stored separately from the data. Encryption also enablescrypto-shredding at the end of the data or hardware lifecycle. Periodic auditing of sensitive data should be part of policy and should occur on scheduled occurrences. Finally, only store the minimum possible amount of sensitive data.[16]
Tokenization is a non-mathematical approach to protecting data at rest that replaces sensitive data with non-sensitive substitutes, referred to as tokens, which have no extrinsic or exploitable meaning or value. This process does not alter the type or length of data, which means it can be processed by legacy systems such as databases that may be sensitive to data length and type. Tokens require significantly less computational resources to process and less storage space in databases than traditionally encrypted data. This is achieved by keeping specific data fully or partially visible for processing and analytics while sensitive information is kept hidden. Lower processing and storage requirements makes tokenization an ideal method of securing data at rest in systems that manage large volumes of data.
A further method of preventing unwanted access to data at rest is the use of data federation[17] especially when data is distributed globally (e.g. in off-shore archives). An example of this would be a European organisation which stores its archived data off-site in the US. Under the terms of theUSA PATRIOT Act[18] the American authorities can demand access to all data physically stored within its boundaries, even if it includes personal information on European citizens with no connections to the US. Data encryption alone cannot be used to prevent this as the authorities have the right to demand decrypted information. A data federation policy which retains personal citizen information with no foreign connections within its country of origin (separate from information which is either not personal or is relevant to off-shore authorities) is one option to address this concern. However, data stored in foreign countries can be accessed using legislation in theCLOUD Act.
Data in use has also been taken to mean “active data” in the context of being in a database or being manipulated by an application. For example, someenterprise encryption gateway solutions for the cloud claim to encrypt data at rest,data in transit anddata in use.[20]
Some cloudsoftware as a service (SaaS) providers refer to data in use as any data currently being processed by applications, as the CPU and memory are utilized.[21]
Because of its nature, data in use is of increasing concern to businesses, government agencies and other institutions. Data in use, or memory, can contain sensitive data including digital certificates, encryption keys, intellectual property (software algorithms, design data), andpersonally identifiable information. Compromising data in use enables access to encrypted data at rest and data in motion. For example, someone with access to random access memory can parse that memory to locate the encryption key for data at rest. Once they have obtained that encryption key, they can decrypt encrypted data at rest. Threats to data in use can come in the form ofcold boot attacks, malicious hardware devices,rootkits and bootkits.
Encryption, which prevents data visibility in the event of its unauthorized access or theft, is commonly used to protect Data in Motion and Data at Rest and increasingly recognized as an optimal method for protecting Data in Use. There have been multiple projects to encrypt memory. MicrosoftXbox systems are designed to provide memory encryption and the companyPrivateCore presently has a commercial software product vCage to provide attestation along with full memory encryption for x86 servers.[22] Several papers have been published highlighting the availability of security-enhanced x86 and ARM commodity processors.[19][23] In that work, anARM Cortex-A8 processor is used as the substrate on which a full memory encryption solution is built. Process segments (for example, stack, code or heap) can be encrypted individually or in composition. This work marks the first full memory encryption implementation on a mobile general-purpose commodity processor. The system provides both confidentiality and integrity protections of code and data which are encrypted everywhere outside the CPU boundary.
For x86 systems, AMD has a Secure Memory Encryption (SME) feature introduced in 2017 withEpyc.[24] Intel has promised to deliver its Total Memory Encryption (TME) feature in an upcoming CPU.[25][26]
Operating system kernel patches such asTRESOR and Loop-Amnesia modify the operating system so that CPU registers can be used to store encryption keys and avoid holding encryption keys in RAM. While this approach is not general purpose and does not protect all data in use, it does protect against cold boot attacks. Encryption keys are held inside the CPU rather than in RAM so that data at rest encryption keys are protected against attacks that might compromise encryption keys in memory.
Enclaves enable an “enclave” to be secured with encryption in RAM so that enclave data is encrypted while in RAM but available as clear text inside the CPU and CPU cache. Intel Corporation has introduced the concept of “enclaves” as part of itsSoftware Guard Extensions. Intel revealed an architecture combining software and CPU hardware in technical papers published in 2013.[27]
Several cryptographic tools, includingsecure multi-party computation andhomomorphic encryption, allow for the private computation of data on untrusted systems. Data in use could be operated upon while encrypted and never exposed to the system doing the processing.
Data in transit, also referred to asdata in motion[28] anddata in flight,[29] is data en route between source and destination, typically on acomputer network.
Data in transit can be separated into two categories: information that flows over the public or untrusted network such as the Internet and data that flows in the confines of a private network such as a corporate or enterpriselocal area network (LAN).[30]
Various types of data which can be visualized through a computer device.
Physicalcomputer memory elements consist of an address and a byte/word of data storage. Digital data are often stored inrelational databases, liketables or SQL databases, and can generally be represented as abstract key/value pairs. Data can be organized in many different types ofdata structures, including arrays,graphs, andobjects. Data structures can store data of many differenttypes, includingnumbers,strings and even otherdata structures.
Metadata helps translate data to information. Metadata is data about the data. Metadata may be implied, specified or given.
Data relating to physical events or processes will have a temporal component. This temporal component may be implied. This is the case when a device such as a temperature logger receives data from a temperaturesensor. When the temperature is received it is assumed that the data has a temporal reference ofnow. So the device records the date, time and temperature together. When the data logger communicates temperatures, it must also report the date and time as metadata for each temperature reading.
Fundamentally, computers follow a sequence of instructions they are given in the form of data. A set of instructions to perform a given task (or tasks) is called aprogram. A program is data in the form of coded instructions to control the operation of a computer or other machine.[32] In the nominal case, the program, asexecuted by the computer, will consist ofmachine code. The elements ofstorage manipulated by the program, but not actually executed by thecentral processing unit (CPU), are also data. At its most essential, a single datum is avalue stored at a specific location. Therefore, it is possible for computer programs to operate on other computer programs, by manipulating their programmatic data.
To store databytes in a file, they have to beserialized in afile format. Typically, programs are stored in special file types, different from those used for other data.Executable files contain programs; all other files are alsodata files. However, executable files may also contain data used by the program which is built into the program. In particular, some executable files have adata segment, which nominally contains constants and initial values for variables, both of which can be considered data.
The line between program and data can become blurry. Aninterpreter, for example, is a program. The input data to an interpreter is itself a program, just not one expressed in nativemachine language. In many cases, the interpreted program will be a human-readabletext file, which is manipulated with atext editor program.Metaprogramming similarly involves programs manipulating other programs as data. Programs likecompilers,linkers,debuggers, program updaters,virus scanners and such use other programs as their data.
For example, auser might first instruct theoperating system to load aword processor program from one file, and then use the running program to open and edit adocument stored in another file. In this example, the document would be considered data. If the word processor also features aspell checker, then the dictionary (word list) for the spell checker would also be considered data. Thealgorithms used by the spell checker to suggest corrections would be eithermachine code data or text in some interpretableprogramming language.
Keys in data provide the context for values. Regardless of the structure of data, there is always a key component present. Keys in data and data-structures are essential for giving meaning to data values. Without a key that is directly or indirectly associated with a value, or collection of values in a structure, the values become meaningless and cease to be data. That is to say, there has to be a key component linked to a value component in order for it to be considered data.[citation needed]
Data can be represented in computers in multiple ways, as per the following examples:
Random access memory (RAM) holds data that the CPU has direct access to. A CPU may only manipulate data within itsprocessor registers or memory. This is as opposed to data storage, where the CPU must direct the transfer of data between the storage device (disk, tape...) and memory. RAM is an array of linear contiguous locations that a processor may read or write by providing an address for the read or write operation. The processor may operate on any location in memory at any time in any order. In RAM the smallest element of data is the binarybit. The capabilities and limitations of accessing RAM are processor specific. In generalmain memory is arranged as an array oflocations beginning at address 0 (hexadecimal 0). Each location can store usually 8 or 32 bits depending on thecomputer architecture.
Data keys need not be a direct hardware address in memory.Indirect, abstract and logical keys codes can be stored in association with values to form adata structure. Data structures have predeterminedoffsets (or links or paths) from the start of the structure, in which data values are stored. Therefore, the data key consists of the key to the structure plus the offset (or links or paths) into the structure. When such a structure is repeated, storing variations of the data values and the data keys within the same repeating structure, the result can be considered to resemble atable, in which each element of the repeating structure is considered to be a column and each repetition of the structure is considered as a row of the table. In such an organization of data, the data key is usually a value in one (or a composite of the values in several) of the columns.
Thetabular view of repeating data structures is only one of many possibilities. Repeating data structures can be organisedhierarchically, such that nodes are linked to each other in a cascade of parent-child relationships. Values and potentially more complex data-structures are linked to the nodes. Thus the nodal hierarchy provides the key for addressing the data structures associated with the nodes. This representation can be thought of as aninverted tree. Modern computer operating systemfile systems are a common example; andXML is another.
Data has some inherent features when it issorted on a key. All the values for subsets of the key appear together. When passing sequentially through groups of the data with the same key, or a subset of the key changes, this is referred to in data processing circles as a break, or acontrol break. It particularly facilitates the aggregation of data values on subsets of a key.
Until the advent of bulknon-volatile memory likeflash, persistent data storage was traditionally achieved by writing the data toexternal block devices like magnetic tape and disk drives. These devices typically seek to a location on the magnetic media and then read or writeblocks of data of a predetermined size. In this case, the seek location on the media, is the data key and the blocks are the data values. Early usedraw disk data file-systems or disc operating systems reservedcontiguous blocks on the disc drive fordata files. In those systems, the files could be filled up, running out of data space before all the data had been written to them. Thus much unused data space was reserved unproductively to ensure adequate free space for each file. Later file-systems introducedpartitions. They reserved blocks of disc data space for partitions and used the allocated blocks more economically, by dynamically assigning blocks of a partition to a file as needed. To achieve this, the file system had to keep track of which blocks were used or unused by data files in a catalog or file allocation table. Though this made better use of the disc data space, it resulted in fragmentation of files across the disc, and a concomitant performance overhead due additional seek time to read the data. Modern file systems reorganize fragmented files dynamically to optimize file access times. Further developments in file systems resulted invirtualization of disc drives i.e. where a logical drive can be defined as partitions from a number of physical drives.
Retrieving a small subset of data from a much larger set may imply inefficiently searching through the data sequentially.Indexes are a way to copy out keys and location addresses from data structures in files, tables and data sets, then organize them usinginverted tree structures to reduce the time taken to retrieve a subset of the original data. In order to do this, the key of the subset of data to be retrieved must be known before retrieval begins. The most popular indexes are theB-tree and the dynamichash key indexing methods. Indexing is overhead for filing and retrieving data. There are other ways of organizing indexes, e.g. sorting the keys and using abinary search algorithm.
The taxonomic rank-structure ofclasses, which is an example of a hierarchical data structure; and
at run time, the creation of references to in-memory data-structures of objects that have beeninstantiated from aclass library.
It is only after instantiation that an object of a specified class exists. After an object's reference is cleared, the object also ceases to exist. The memory locations where the object's data was stored aregarbage and are reclassified as unused memory available for reuse.
Modern scalable and high-performance data persistence technologies, such asApache Hadoop, rely on massively parallel distributed data processing across many commodity computers on a high bandwidth network. In such systems, the data is distributed across multiple computers and therefore any particular computer in the system must be represented in the key of the data, either directly, or indirectly. This enables the differentiation between two identical sets of data, each being processed on a different computer at the same time.
All digital information possesses common properties that distinguish it from analog data with respect to communications:
Synchronization: Since digital information is conveyed by the sequence in which symbols are ordered, all digital schemes have some method for determining the beginning of a sequence. In written or spoken human languages, synchronization is typically provided bypauses (spaces),capitalization, andpunctuation. Machine communications typically use specialsynchronization sequences.
Language: All digital communications require aformal language, which in this context consists of all the information that the sender and receiver of the digital communication must both possess, in advance, for the communication to be successful. Languages are generally arbitrary and specify the meaning to be assigned to particular symbol sequences, the allowed range of values, methods to be used for synchronization, etc.
Errors: Disturbances (noise) in analog communications invariably introduce some, generally small deviation or error between the intended and actual communication. Disturbances in digital communication only result in errors when the disturbance is so large as to result in a symbol being misinterpreted as another symbol or disturbing the sequence of symbols. It is generally possible to have near-error-free digital communication. Further, techniques such as check codes may be used todetect errors and correct them through redundancy or re-transmission. Errors in digital communications can take the form ofsubstitution errors, in which a symbol is replaced by another symbol, orinsertion/deletion errors, in which an extra incorrect symbol is inserted into or deleted from a digital message. Uncorrected errors in digital communications have an unpredictable and generally large impact on the information content of the communication.
Copying: Because of the inevitable presence of noise, making many successive copies of an analog communication is infeasible because each generation increases the noise. Because digital communications are generally error-free, copies of copies can be made indefinitely.
Granularity: The digital representation of a continuously variable analog value typically involves a selection of the number of symbols to be assigned to that value. The number of symbols determines the precision or resolution of the resulting datum. The difference between the actual analog value and the digital representation is known asquantization error. For example, if the actual temperature is 23.234456544453 degrees, but only two digits (23) are assigned to this parameter in a particular digital representation, the quantizing error is 0.234456544453. This property of digital communication is known asgranularity.
Compressible: According to Miller, "Uncompressed digital data is very large, and in its raw form, it would actually produce a larger signal (therefore be more difficult to transfer) than analog data. However, digital data can be compressed. Compression reduces the amount of bandwidth space needed to send information. Data can be compressed, sent, and then decompressed at the site of consumption. This makes it possible to send much more information and results in, for example,digital television signals offering more room on the airwave spectrum for more television channels."[1]
Even though digital signals are generally associated with the binary electronic digital systems used in modern electronics and computing, digital systems are actually ancient, and need not be binary or electronic.
DNAgenetic code is a naturally occurring form of digital data storage.
Written text (due to the limited character set and the use of discrete symbols – the alphabet in most cases)
Theabacus was created sometime between 1000 BC and 500 BC, it later became a form of calculation frequency. Nowadays it can be used as a very advanced, yet basic digital calculator that uses beads on rows to represent numbers. Beads only have meaning in discrete up and down states, not in analog in-between states.
Abeacon is perhaps the simplest non-electronic digital signal, with just two states (on and off). In particular,smoke signals are one of the oldest examples of a digital signal, where an analog "carrier" (smoke) ismodulated with a blanket to generate a digital signal (puffs) that conveys information.
Morse code uses six digital states—dot, dash, intra-character gap (between each dot or dash), short gap (between each letter), medium gap (between words), and long gap (between sentences)—to send messages via a variety of potential carriers such as electricity or light, for example using anelectrical telegraph or a flashing light.
TheBraille uses a six-bit code rendered as dot patterns.
Flag semaphore uses rods or flags held in particular positions to send messages to the receiver watching them some distance away.
International maritime signal flags have distinctive markings that represent letters of the alphabet to allow ships to send messages to each other.
More recently invented, amodem modulates an analog "carrier" signal (such as sound) to encode binary electrical digital information, as a series of binary digital sound pulses. A slightly earlier, surprisingly reliable version of the same concept was to bundle a sequence of audio digital "signal" and "no signal" information (i.e. "sound" and "silence") onmagnetic cassette tape for use with earlyhome computers.
^Miller, Vincent (2011).Understanding digital culture. London: Sage Publications. sec. "Convergence and the contemporary media experience".ISBN978-1-84787-497-9.