Movatterモバイル変換

[0]ホーム

Jump to content

Binary data

Edit links

From Wikipedia, the free encyclopedia

Data whose unit can take on only two possible states

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Binary data" – news ·newspapers ·books ·scholar ·JSTOR(April 2019) (Learn how and when to remove this message)

Binary data isdata whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with thebinary numeral system andBoolean algebra.

Binary data occurs in many different technical and scientific fields, where it can be called by different names includingbit (binary digit) incomputer science,truth value inmathematical logic and related domains andbinary variable in statistics.

Mathematical and combinatoric foundations

[edit]

Adiscrete variable that can take onlyone state contains zeroinformation, and2 is the nextnatural number after 1. That is why thebit, a variable with only two possible values, is a standard primaryunit of information.

A collection ofn bits may have2ⁿ states: seebinary number for details. Number of states of a collection of discrete variables dependsexponentially on the number of variables, and only as apower law on number of states of each variable. Ten bits have more (1024) states than threedecimal digits (1000).10k bits are more than sufficient to represent an information (anumber or anything else) that requires3k decimal digits, so information contained in discrete variables with3, 4, 5, 6, 7, 8, 9,10... states can be ever superseded by allocating two, three, or four times more bits. So, the use of any other small number than 2 does not provide an advantage.

AHasse diagram: representation of a Boolean algebra as adirected graph

Moreover, Boolean algebra provides a convenient mathematical structure for collection of bits, with a semantic of a collection ofpropositional variables. Boolean algebra operations are known as "bitwise operations" in computer science.Boolean functions are also well-studied theoretically and easily implementable, either withcomputer programs or by so-namedlogic gates indigital electronics. This contributes to the use of bits to represent different data, even those originally not binary.

In statistics

[edit]

Instatistics,binary data is astatistical data type consisting ofcategorical data, that can take exactly two possible values, such as "A" and "B", or "heads" and "tails". It is also calleddichotomous data, and an older term isquantal data.^[1] The two values are often referred to generically as "success" and "failure".^[1] As a form of categorical data, binary data isnominal data, meaning the values arequalitatively different and cannot be compared numerically. However, the values are frequently represented as 1 or 0, which corresponds to counting the number of successes in a single trial: 1 (success…) or 0 (failure); see§ Counting. More intuitively, binary data can be represented ascount data.

Often, binary data is used to represent one of two conceptually opposed values, e.g.:

the outcome of an experiment ("success" or "failure")
the response to a yes–no question ("yes" or "no")
presence or absence of some feature ("is present" or "is not present")
the truth or falsehood of a proposition ("true" or "false", "correct" or "incorrect")

However, it can also be used for data that is assumed to have only two possible values, even if they are not conceptually opposed or conceptually represent all possible values in the space. For example, binary data is often used to represent the party choices of voters in elections in the United States, i.e.Republican orDemocratic. In this case, there is no inherent reason why only twopolitical parties should exist, and indeed, other parties do exist in the U.S., but they are so minor that they are generally simply ignored. Modeling continuous data (or categorical data of more than 2 categories) as a binary variable for analysis purposes is calleddichotomization (creating adichotomy). Like all discretization, it involvesdiscretization error, but the goal is to learn something valuable despite the error: treating it asnegligible for the purpose at hand, but remembering that it cannot be assumed to be negligible in general.

Binary variables

[edit]

Abinary variable is arandom variable of binary type, meaning with two possible values.Independent and identically distributed (i.i.d.) binary variables follow aBernoulli distribution, but in general binary data need not come from i.i.d. variables. Total counts of i.i.d. binary variables (equivalently, sums of i.i.d. binary variables coded as 1 or 0) follow abinomial distribution, but when binary variables are not i.i.d., the distribution need not be binomial.

Counting

[edit]

Like categorical data, binary data can be converted to avector ofcount data by writing one coordinate for each possible value, and counting 1 for the value that occurs, and 0 for the value that does not occur.^[2] For example, if the values are A and B, then the data set A, A, B can be represented in counts as (1, 0), (1, 0), (0, 1). Once converted to counts, binary data can begrouped and the counts added. For instance, if the set A, A, B is grouped, the total counts are (2, 1): 2 A's and 1 B (out of 3 trials).

Since there are only two possible values, this can be simplified to a single count (a scalar value) by considering one value as "success" and the other as "failure", coding a value of the success as 1 and of the failure as 0 (using only the coordinate for the "success" value, not the coordinate for the "failure" value). For example, if the value A is considered "success" (and thus B is considered "failure"), the data set A, A, B would be represented as 1, 1, 0. When this is grouped, the values are added, while the number of trial is generally tracked implicitly. For example, A, A, B would be grouped as 1 + 1 + 0 = 2 successes (out of $n=3$ trials). Going the other way, count data with $n=1$ is binary data, with the two classes being 0 (failure) or 1 (success).

Counts of i.i.d. binary variables follow a binomial distribution, with⁠ $n {\displaystyle n}$ ⁠ the total number of trials (points in the grouped data).

Regression

[edit]

Main article:Binary regression

Regression analysis on predicted outcomes that are binary variables is known asbinary regression; when binary data is converted to count data and modeled as i.i.d. variables (so they have a binomial distribution),binomial regression can be used. The most common regression methods for binary data arelogistic regression,probit regression, or related types ofbinary choice models.

Similarly, counts of i.i.d. categorical variables with more than two categories can be modeled with amultinomial regression. Counts of non-i.i.d. binary data can be modeled by more complicated distributions, such as thebeta-binomial distribution (acompound distribution). Alternatively, therelationship can be modeled without needing to explicitly model the distribution of the output variable using techniques fromgeneralized linear models, such asquasi-likelihood and aquasibinomial model; seeOverdispersion § Binomial.