Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Censoring (statistics)

From Wikipedia, the free encyclopedia
Condition in which the value of a measurement or observation is only partially known

Instatistics,censoring is a condition in which thevalue of ameasurement orobservation is only partially known.

For example, suppose a study is conducted to measure the impact of a drug onmortality rate. In such a study, it may be known that an individual's age at death isat least 75 years (but may be more). Such a situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the age of 75.

Censoring also occurs when a value occurs outside the range of ameasuring instrument. For example, a bathroom scale might only measure up to 140 kg, after which it rolls over 0 and continues to count up from there. If a 160 kg individual is weighed using the scale, the observer would only know that the individual's weight is 20mod 140 kg (in addition to 160kg, they could weigh 20kg, 300kg, 440kg, and so on).

The problem of censored data, in which the observed value of some variable is partially known, is related to the problem ofmissing data, where the observed value of some variable is unknown.

Censoring should not be confused with the related idea oftruncation. With censoring, observations result either in knowing the exact value that applies, or in knowing that the value lies within aninterval. With truncation, observations never result in values outside a given range: values in the population outside the range are never seen or never recorded if they are seen. Note that in statistics, truncation is not the same asrounding.

Types

[edit]
  • Left censoring – a data point is below a certain value but it is unknown by how much.
  • Interval censoring – a data point is somewhere on an interval between two values.
  • Right censoring – a data point is above a certain value but it is unknown by how much.
  • Type I censoring occurs if an experiment has a set number of subjects or items and stops the experiment at a predetermined time, at which point any subjects remaining are right-censored.
  • Type II censoring occurs if an experiment has a set number of subjects or items and stops the experiment when a predetermined number are observed to have failed; the remaining subjects are then right-censored.
  • Random (ornon-informative)censoring is when each subject has a censoring time that isstatistically independent of their failure time. The observed value is the minimum of the censoring and failure times; subjects whose failure time is greater than their censoring time are right-censored.

Interval censoring can occur when observing a value requires follow-ups or inspections. Left and right censoring are special cases of interval censoring, with the beginning of the interval at zero or the end at infinity, respectively.

Estimation methods for using left-censored data vary, and not all methods of estimation may be applicable to, or the most reliable, for all data sets.[1]

A common misconception with time interval data is to class asleft censored intervals when the start time is unknown. In these cases, we have a lower bound on the timeinterval; thus, the data isright censored (despite the fact that the missing start point is to the left of the known interval when viewed as a timeline!).

Analysis

[edit]

Special techniques may be used to handle censored data. Tests with specific failure times are coded as actual failures; censored data are coded for the type of censoring and the known interval or limit. Special software programs (oftenreliability oriented) can conduct amaximum likelihood estimation for summary statistics, confidence intervals, etc.

Epidemiology

[edit]

One of the earliest attempts to analyse a statistical problem involving censored data wasDaniel Bernoulli's 1766 analysis ofsmallpox morbidity and mortality data to demonstrate the efficacy ofvaccination.[2] An early paper to use theKaplan–Meier estimator for estimating censored costs was Quesenberry et al. (1989),[3] however this approach was found to be invalid by Lin et al.[4] unless all patients accumulated costs with a common deterministic rate function over time, they proposed an alternative estimation technique known as the Lin estimator.[5]

Operating life testing

[edit]
Example of fivereplicate tests resulting in four failures and one suspended time resulting in censoring.

Reliability testing often consists of conducting a test on an item (under specified conditions) to determine the time it takes for a failure to occur.

  • Sometimes a failure is planned and expected but does not occur: operator error, equipment malfunction, test anomaly, etc. The test result was not the desired time-to-failure but can be (and should be) used as a time-to-termination. The use of censored data is unintentional but necessary.
  • Sometimes engineers plan a test program so that, after a certain time limit or number of failures, all other tests will be terminated. These suspended times are treated as right-censored data. The use of censored data is intentional.

An analysis of the data from replicate tests includes both the times-to-failure for the items that failed and the time-of-test-termination for those that did not fail.

Censored regression

[edit]

An earlier model forcensored regression, thetobit model, was proposed byJames Tobin in 1958.[6]

Likelihood

[edit]

Thelikelihood is the probability or probability density of what was observed, viewed as a function of parameters in an assumed model. To incorporate censored data points in the likelihood the censored data points are represented by the probability of the censored data points as a function of the model parameters given a model, i.e. a function of CDF(s) instead of the density or probability mass.

The most general censoring case is interval censoring:Pr(a<xb)=F(b)F(a){\displaystyle Pr(a<x\leqslant b)=F(b)-F(a)}, whereF(x){\displaystyle F(x)} is the CDF of the probability distribution, and the two special cases are:

For continuous probability distributions:Pr(a<xb)=Pr(a<x<b){\displaystyle Pr(a<x\leqslant b)=Pr(a<x<b)}

Example

[edit]

Suppose we are interested in survival times,T1,T2,...,Tn{\displaystyle T_{1},T_{2},...,T_{n}}, but we don't observeTi{\displaystyle T_{i}} for alli{\displaystyle i}. Instead, we observe

(Ui,δi){\displaystyle (U_{i},\delta _{i})}, withUi=Ti{\displaystyle U_{i}=T_{i}} andδi=1{\displaystyle \delta _{i}=1} ifTi{\displaystyle T_{i}} is actually observed, and
(Ui,δi){\displaystyle (U_{i},\delta _{i})}, withUi<Ti{\displaystyle U_{i}<T_{i}} andδi=0{\displaystyle \delta _{i}=0} if all we know is thatTi{\displaystyle T_{i}} is longer thanUi{\displaystyle U_{i}}.

WhenTi>Ui,Ui{\displaystyle T_{i}>U_{i},U_{i}} is called thecensoring time.[7]

If the censoring times are all known constants, then the likelihood is

L=i,δi=1f(ui)i,δi=0S(ui){\displaystyle L=\prod _{i,\delta _{i}=1}f(u_{i})\prod _{i,\delta _{i}=0}S(u_{i})}

wheref(ui){\displaystyle f(u_{i})} = the probability density function evaluated atui{\displaystyle u_{i}},

andS(ui){\displaystyle S(u_{i})} = the probability thatTi{\displaystyle T_{i}} is greater thanui{\displaystyle u_{i}}, called thesurvival function.

This can be simplified by defining thehazard function, the instantaneous force of mortality, as

λ(u)=f(u)/S(u){\displaystyle \lambda (u)=f(u)/S(u)}

so

f(u)=λ(u)S(u){\displaystyle f(u)=\lambda (u)S(u)}.

Then

L=iλ(ui)δiS(ui){\displaystyle L=\prod _{i}\lambda (u_{i})^{\delta _{i}}S(u_{i})}.

For theexponential distribution, this becomes even simpler, because the hazard rate,λ{\displaystyle \lambda }, is constant, andS(u)=exp(λu){\displaystyle S(u)=\exp(-\lambda u)}. Then:

L(λ)=λkexp(λui){\displaystyle L(\lambda )=\lambda ^{k}\exp(-\lambda \sum {u_{i}})},

wherek=δi{\displaystyle k=\sum {\delta _{i}}}.

From this we easily computeλ^{\displaystyle {\hat {\lambda }}}, themaximum likelihood estimate (MLE) ofλ{\displaystyle \lambda }, as follows:

l(λ)=log(L(λ))=klog(λ)λui{\displaystyle l(\lambda )=\log(L(\lambda ))=k\log(\lambda )-\lambda \sum {u_{i}}}.

Then

dl/dλ=k/λui{\displaystyle dl/d\lambda =k/\lambda -\sum {u_{i}}}.

We set this to 0 and solve forλ{\displaystyle \lambda } to get:

λ^=k/ui{\displaystyle {\hat {\lambda }}=k/\sum u_{i}}.

Equivalently, themean time to failure is:

1/λ^=ui/k{\displaystyle 1/{\hat {\lambda }}=\sum u_{i}/k}.

This differs from the standard MLE for theexponential distribution in that the censored observations are considered only in the numerator.

See also

[edit]

References

[edit]
  1. ^Helsel, D. (2010)."Much Ado About Next to Nothing: Incorporating Nondetects in Science".Annals of Occupational Hygiene.54 (3):257–262.doi:10.1093/annhyg/mep092.PMID 20032004.
  2. ^Bernoulli, D. (1766). "Essai d'une nouvelle analyse de la mortalité causée par la petite vérole".Mem. Math. Phy. Acad. Roy. Sci. Paris, reprinted in Bradley (1971) 21 and Blower (2004)
  3. ^Quesenberry, C. P. Jr.; et al. (1989)."A survival analysis of hospitalization among patients with acquired immunodeficiency syndrome".American Journal of Public Health.79 (12):1643–1647.doi:10.2105/AJPH.79.12.1643.PMC 1349769.PMID 2817192.
  4. ^Lin, D. Y.; et al. (1997). "Estimating medical costs from incomplete follow-up data".Biometrics.53 (2):419–434.doi:10.2307/2533947.JSTOR 2533947.PMID 9192444.
  5. ^Wijeysundera, H. C.; et al. (2012)."Techniques for estimating health care costs with censored data: an overview for the health services researcher".ClinicoEconomics and Outcomes Research.4:145–155.doi:10.2147/CEOR.S31552.PMC 3377439.PMID 22719214.
  6. ^Tobin, James (1958)."Estimation of relationships for limited dependent variables"(PDF).Econometrica.26 (1):24–36.doi:10.2307/1907382.JSTOR 1907382.
  7. ^Lu Tian,Likelihood Construction, Inference for Parametric Survival Distributions(PDF),Wikidata Q98961801.

Further reading

[edit]

External links

[edit]
  • "Engineering Statistics Handbook", NIST/SEMATEK,[1]
Continuous data
Center
Dispersion
Shape
Count data
Summary tables
Dependence
Graphics
Study design
Survey methodology
Controlled experiments
Adaptive designs
Observational studies
Statistical theory
Frequentist inference
Point estimation
Interval estimation
Testing hypotheses
Parametric tests
Specific tests
Goodness of fit
Rank statistics
Bayesian inference
Correlation
Regression analysis
Linear regression
Non-standard predictors
Generalized linear model
Partition of variance
Categorical
Multivariate
Time-series
General
Specific tests
Time domain
Frequency domain
Survival
Survival function
Hazard function
Test
Biostatistics
Engineering statistics
Social statistics
Spatial statistics
Retrieved from "https://en.wikipedia.org/w/index.php?title=Censoring_(statistics)&oldid=1291846083"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp