This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Statistical data type" – news ·newspapers ·books ·scholar ·JSTOR(June 2014) (Learn how and when to remove this message) |
Instatistics,data can have any of varioustypes. Statistical data types includecategorical (e.g.country),directional (angles ordirections, e.g. wind measurements),count (awhole number of events), orreal intervals (e.g. measures oftemperature).
The data type is a fundamental concept in statistics and controls what sorts ofprobability distributions can logically be used to describe the variable, the permissible operations on the variable, the type ofregression analysis used to predict the variable, etc. The concept of data type is similar to the concept oflevel of measurement, but more specific. For example, count data requires a different distribution (e.g. aPoisson distribution orbinomial distribution) than non-negativereal-valued data require, but both fall under the samelevel of measurement (aratio scale).
Various attempts have been made to produce a taxonomy oflevels of measurement. The psychophysicistStanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case withlongitude andtemperature measurements in degreeCelsius or degreeFahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.
Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together ascategorical variables, whereas ratio and interval measurements are grouped together asquantitative variables, which can be eitherdiscrete orcontinuous, due to their numerical nature. Such distinctions can often be loosely correlated withdata type in computer science, in that dichotomous categorical variables may be represented with theBoolean data type, polytomous categorical variables with arbitrarily assignedintegers in theintegral data type, and continuous variables with thereal data type involvingfloating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.
Other categorizations have been proposed. For example,Mosteller andTukey (1977)[1] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990)[2] described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998),[3] van den Berg (1991).[4]
The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82).[5]
The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded usingreal numbers, because the theory ofrandom variables often explicitly assumes that they hold real numbers.
Data Type | Possible values | Example usage | Level of measurement | Common Distributions | Scale of relative differences | Permissible statistics | Common model |
---|---|---|---|---|---|---|---|
0, 1 (arbitrary labels) | binary outcome ("yes/no", "true/false", "success/failure", etc.) | Bernoulli | mode,chi-squared | logistic,probit | |||
"name1", "name2", "name3", ... "nameK" (arbitrary labels) | categorical outcome with names or places like "Rome", "Amsterdam", "Madrid", "London", "Washington" (specificblood type,political party, word, etc.) | categorical | multinomial logit,multinomial probit | ||||
ordering categories orinteger orreal number (arbitrary scale) | Ordering adverbs like "Small", "Medium", "Large", relative score, significant only for creating a ranking | categorical | relative comparison | ordinal regression (ordered logit,ordered probit) | |||
0, 1, ..., N | number of successes (e.g. yes votes) out ofN possible | binomial,beta-binomial | additive | mean,median,mode,standard deviation,correlation | binomial regression (logistic,probit) | ||
nonnegativeintegers (0, 1, ...) | number of items (telephone calls, people,molecules, births, deaths, etc.) in given interval/area/volume | Poisson,negative binomial | multiplicative | All statistics permitted for interval scales plus the following:geometric mean,harmonic mean,coefficient of variation | Poisson, negative binomial regression | ||
real-valued additive | real number | temperature in degree Celsius or degree Fahrenheit, relative distance,location parameter, etc. (or approximately, anything not varying over a large scale) | normal, etc. (usually symmetric about themean) | additive | mean,median,mode,standard deviation,correlation | standardlinear regression | |
real-valued multiplicative | positivereal number | temperature inkelvin, price, income, size,scale parameter, etc. (especially when varying over a large scale) | log-normal,gamma,exponential, etc. (usually askewed distribution) | multiplicative | All statistics permitted for interval scales plus the following:geometric mean,harmonic mean,coefficient of variation | generalized linear model withlogarithmic link |
Data that cannot be described using a single number are often shoehorned intorandom vectors of real-valuedrandom variables, although there is an increasing tendency to treat them on their own. Some examples:
These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.
Most data types in statistics have comparable types in computer programming, and vice versa, as shown in the following table:
Statistics | Programming |
---|---|
real-valued (interval scale) | floating-point |
real-valued (ratio scale) | |
count data (usually non-negative) | integer |
binary data | Boolean |
categorical data | enumerated type |
random vector | list orarray |
random matrix | two-dimensionalarray |
random tree | tree |