Movatterモバイル変換

[0]ホーム

Jump to content

Statistical data type

Add links

From Wikipedia, the free encyclopedia

Taxonomy of statistical data elements

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Statistical data type" – news ·newspapers ·books ·scholar ·JSTOR(June 2014) (Learn how and when to remove this message)

Instatistics,data can have any of varioustypes. Statistical data types includecategorical (e.g.country),directional (angles ordirections, e.g. wind measurements),count (awhole number of events), orreal intervals (e.g. measures oftemperature).

The data type is a fundamental concept in statistics and controls what sorts ofprobability distributions can logically be used to describe the variable, the permissible operations on the variable, the type ofregression analysis used to predict the variable, etc. The concept of data type is similar to the concept oflevel of measurement, but more specific. For example, count data requires a different distribution (e.g. aPoisson distribution orbinomial distribution) than non-negativereal-valued data require, but both fall under the samelevel of measurement (aratio scale).

Various attempts have been made to produce a taxonomy oflevels of measurement. The psychophysicistStanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case withlongitude andtemperature measurements in degreeCelsius or degreeFahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.

Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together ascategorical variables, whereas ratio and interval measurements are grouped together asquantitative variables, which can be eitherdiscrete orcontinuous, due to their numerical nature. Such distinctions can often be loosely correlated withdata type in computer science, in that dichotomous categorical variables may be represented with theBoolean data type, polytomous categorical variables with arbitrarily assignedintegers in theintegral data type, and continuous variables with thereal data type involvingfloating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.

Other categorizations have been proposed. For example,Mosteller andTukey (1977)^[1] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990)^[2] described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998),^[3] van den Berg (1991).^[4]

The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82).^[5]

Simple data types

[edit]

The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded usingreal numbers, because the theory ofrandom variables often explicitly assumes that they hold real numbers.

Data Type	Possible values	Example usage	Level of measurement	Common Distributions	Scale of relative differences	Permissible statistics	Common model
binary	0, 1 (arbitrary labels)	binary outcome ("yes/no", "true/false", "success/failure", etc.)	nominal scale	Bernoulli	incomparable	mode,chi-squared	logistic,probit
categorical	"name1", "name2", "name3", ... "nameK" (arbitrary labels)	categorical outcome with names or places like "Rome", "Amsterdam", "Madrid", "London", "Washington" (specificblood type,political party, word, etc.)	nominal scale	categorical	incomparable	mode,chi-squared	multinomial logit,multinomial probit
ordinal	ordering categories orinteger orreal number (arbitrary scale)	Ordering adverbs like "Small", "Medium", "Large", relative score, significant only for creating a ranking	ordinal scale	categorical	relative comparison		ordinal regression (ordered logit,ordered probit)
binomial	0, 1, ..., N	number of successes (e.g. yes votes) out ofN possible	interval scale	binomial,beta-binomial	additive	mean,median,mode,standard deviation,correlation	binomial regression (logistic,probit)
count	nonnegativeintegers (0, 1, ...)	number of items (telephone calls, people,molecules, births, deaths, etc.) in given interval/area/volume	ratio scale	Poisson,negative binomial	multiplicative	All statistics permitted for interval scales plus the following:geometric mean,harmonic mean,coefficient of variation	Poisson, negative binomial regression
real-valued additive	real number	temperature in degree Celsius or degree Fahrenheit, relative distance,location parameter, etc. (or approximately, anything not varying over a large scale)	interval scale	normal, etc. (usually symmetric about themean)	additive	mean,median,mode,standard deviation,correlation	standardlinear regression
real-valued multiplicative	positivereal number	temperature inkelvin, price, income, size,scale parameter, etc. (especially when varying over a large scale)	ratio scale	log-normal,gamma,exponential, etc. (usually askewed distribution)	multiplicative	All statistics permitted for interval scales plus the following:geometric mean,harmonic mean,coefficient of variation	generalized linear model withlogarithmic link

Multivariate data types

[edit]

Data that cannot be described using a single number are often shoehorned intorandom vectors of real-valuedrandom variables, although there is an increasing tendency to treat them on their own. Some examples:

Random vectors. The individual elements may or may not becorrelated. Examples of distributions used to describe correlated random vectors are themultivariate normal distribution andmultivariate t-distribution. In general, there may be arbitrary correlations between any elements and any others; however, this often becomes unmanageable above a certain size, requiring further restrictions on the correlated elements.
Random matrices. Random matrices can be laid out linearly and treated as random vectors; however, this may not be an efficient way of representing the correlations between different elements. Some probability distributions are specifically designed for random matrices, e.g. thematrix normal distribution andWishart distribution.
Random sequences. These are sometimes considered to be the same as random vectors, but in other cases the term is applied specifically to cases where each random variable is only correlated with nearby variables (as in aMarkov model). This is a particular case of aBayes network and often used for very long sequences, e.g. gene sequences or lengthy text documents. A number of models are specifically designed for such sequences, e.g.hidden Markov models.
Random processes. These are similar to random sequences, but where the length of the sequence is indefinite or infinite and the elements in the sequence are processed one-by-one. This is often used for data that can be described as atime series, e.g. the price of a stock on successive days. Random processes are also used to model values that vary continuously (e.g. the temperature at successive moments in time), rather than at discrete intervals.
Bayes networks. These correspond to aggregates of random variables described usinggraphical models, where individual random variables are linked in agraph structure withconditional distributions relating variables to nearby variables.
- Multilevel models are subclasses of Bayes networks that can be thought of as having multiple levels oflinear regression.
- Random trees. These are a subclass of Bayes network, where the variables are linked in atree structure. An example is the problem ofparsing a sentence, when statistical parsing techniques are used, such asprobabilistic context-free grammars (PCFG's).
Random fields. These represent the extension ofrandom processes to multiple dimensions, and are common inphysics, where they are used instatistical mechanics to describe properties such asforce orelectric field that can vary continuously over three dimensions (or four dimensions, when time is included).

These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.

Comparison to programming data types

[edit]

Most data types in statistics have comparable types in computer programming, and vice versa, as shown in the following table:

Statistics	Programming
real-valued (interval scale)	floating-point
real-valued (ratio scale)	floating-point
count data (usually non-negative)	integer
binary data	Boolean
categorical data	enumerated type
random vector	list orarray
random matrix	two-dimensionalarray
random tree	tree

References

[edit]

^Mosteller, F.;Tukey, J.W. (1977).Data analysis and regression. Addison-Wesley.ISBN 978-0-201-04854-4.
^Nelder, J.A. (1990). "The knowledge needed to computerise the analysis and interpretation of statistical information".Expert systems and artificial intelligence: the need for information about data. London: Library Association.OCLC 27042489.
^Chrisman, Nicholas R. (1998). "Rethinking Levels of Measurement for Cartography".Cartography and Geographic Information Science.25 (4):231–242.Bibcode:1998CGISy..25..231C.doi:10.1559/152304098782383043.
^van den Berg, G. (1991).Choosing an analysis method. Leiden: DSWO Press.ISBN 978-90-6695-062-7.
^Hand, D.J. (2004).Measurement theory and practice: The world through quantification. Wiley. p. 82.ISBN 978-0-470-68567-9.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Statistical_data_type&oldid=1278987225"

Category:

Statistical data types

Hidden categories:

[8]ページ先頭