Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Survival analysis

From Wikipedia, the free encyclopedia
(Redirected fromDuration analysis)
Branch of statistics
This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Survival analysis" – news ·newspapers ·books ·scholar ·JSTOR
(April 2021) (Learn how and when to remove this message)

Survival analysis is a branch ofstatistics for analyzing the expected duration of time until one event occurs, such as death inbiological organisms and failure in mechanical systems.[1] This topic is calledreliability theory,reliability analysis orreliability engineering inengineering,duration analysis orduration modelling ineconomics, andevent history analysis insociology. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability ofsurvival?

To answer such questions, it is necessary to define "lifetime". In the case of biological survival,death is unambiguous, but for mechanical reliability,failure may not be well-defined, for there may well be mechanical systems in which failure is partial, a matter of degree, or not otherwise localized intime. Even in biological problems, some events (for example,heart attack or other organ failure) may have the same ambiguity. Thetheory outlined below assumes well-defined events at specific times; other cases may be better treated by models which explicitly account for ambiguous events.

More generally, survival analysis involves the modelling of time to event data; in this context, death or failure is considered an "event" in the survival analysis literature – traditionally only a single event occurs for each subject, after which the organism or mechanism is dead or broken.Recurring event orrepeated event models relax that assumption. The study of recurring events is relevant insystems reliability, and in many areas of social sciences and medical research.

Introduction to survival analysis

[edit]

Survival analysis is used in several ways:

Definitions of common terms in survival analysis

[edit]

The following terms are commonly used in survival analyses:

  • Event: Death, disease occurrence, disease recurrence, recovery, or other experience of interest
  • Time: The time from the beginning of an observation period (such as surgery or beginning treatment) to (i) an event, or (ii) end of the study, or (iii) loss of contact or withdrawal from the study.
  • Censoring / Censored observation: Censoring occurs when we have some information about individual survival time, but we do not know the survival time exactly. The subject is censored in the sense that nothing is observed or known about that subject after the time of censoring. A censored subject may or may not have an event after the end of observation time.
  • Survival function S(t): The probability that a subject survives longer than time t.

Example: Acute myelogenous leukemia survival data

[edit]

This example uses theAcute Myelogenous Leukemia survival data set "aml" from the "survival" package in R. The data set is from Miller (1997)[2] and the question is whether the standard course of chemotherapy should be extended ('maintained') for additional cycles.

The aml data set sorted by survival time is shown in the box.

Aml data set sorted by survival time
observationtime

(weeks)

statusx
1251Nonmaintained
1351Nonmaintained
1481Nonmaintained
1581Nonmaintained
191Maintained
16121Nonmaintained
2131Maintained
3130Maintained
17160Nonmaintained
4181Maintained
5231Maintained
18231Nonmaintained
19271Nonmaintained
6280Maintained
20301Nonmaintained
7311Maintained
21331Nonmaintained
8341Maintained
22431Nonmaintained
9450Maintained
23451Nonmaintained
10481Maintained
111610Maintained
  • Time is indicated by the variable "time", which is the survival or censoring time
  • Event (recurrence of aml cancer) is indicated by the variable "status". 0 = no event (censored), 1 = event (recurrence)
  • Treatment group: the variable "x" indicates if maintenance chemotherapy was given

The last observation (11), at 161 weeks, is censored. Censoring indicates that the patient did not have an event (no recurrence of aml cancer). Another subject, observation 3, was censored at 13 weeks (indicated by status=0). This subject was in the study for only 13 weeks, and the aml cancer did not recur during those 13 weeks. It is possible that this patient was enrolled near the end of the study, so that they could be observed for only 13 weeks. It is also possible that the patient was enrolled early in the study, but was lost to follow up or withdrew from the study. The table shows that other subjects were censored at 16, 28, and 45 weeks (observations 17, 6, and 9 with status=0). The remaining subjects all experienced events (recurrence of aml cancer) while in the study. The question of interest is whether recurrence occurs later in maintained patients than in non-maintained patients.

Kaplan–Meier plot for the aml data

[edit]

Thesurvival functionS(t), is the probability that a subject survives longer than timet.S(t) is theoretically a smooth curve, but it is usually estimated using theKaplan–Meier (KM) curve. The graph shows the KM plot for the aml data and can be interpreted as follows:

  • Thex axis is time, from zero (when observation began) to the last observed time point.
  • They axis is the proportion of subjects surviving. At time zero, 100% of the subjects are alive without an event.
  • The solid line (similar to a staircase) shows the progression of event occurrences.
  • A vertical drop indicates an event. In the aml table shown above, two subjects had events at five weeks, two had events at eight weeks, one had an event at nine weeks, and so on. These events at five weeks, eight weeks and so on are indicated by the vertical drops in the KM plot at those time points.
  • At the far right end of the KM plot there is a tick mark at 161 weeks. The vertical tick mark indicates that a patient was censored at this time. In the aml data table five subjects were censored, at 13, 16, 28, 45 and 161 weeks. There are five tick marks in the KM plot, corresponding to these censored observations.

Life table for the aml data

[edit]

Alife table summarizes survival data in terms of the number of events and the proportion surviving at each event time point. The life table for the aml data, created using the R software, is shown.

Life Table for the aml Data
timen.riskn.eventsurvivalstd.errlower 95% CIupper 95% CI
52320.9130.05880.80491
82120.82610.0790.68480.996
91910.78260.0860.6310.971
121810.73910.09160.57980.942
131710.69570.09590.53090.912
181410.6460.10110.47530.878
231320.54660.10730.37210.803
271110.49690.10840.3240.762
30910.44170.10950.27170.718
31810.38650.10890.22250.671
33710.33130.10640.17650.622
34610.27610.1020.13380.569
43510.22080.09540.09470.515
45410.16560.0860.05980.458
48210.08280.07270.01480.462

The life table summarizes the events and the proportion surviving at each event time point. The columns in the life table have the following interpretation:

  • time gives the time points at which events occur.
  • n.risk is the number of subjects at risk immediately before the time point, t. Being "at risk" means that the subject has not had an event before time t, and is not censored before or at time t.
  • n.event is the number of subjects who have events at time t.
  • survival is the proportion surviving, as determined using the Kaplan–Meier product-limit estimate.
  • std.err is the standard error of the estimated survival. The standard error of the Kaplan–Meier product-limit estimate it is calculated using Greenwood's formula, and depends on the number at risk (n.risk in the table), the number of deaths (n.event in the table) and the proportion surviving (survival in the table).
  • lower 95% CI and upper 95% CI are the lower and upper 95% confidence bounds for the proportion surviving.

Log-rank test: Testing for differences in survival in the aml data

[edit]

Thelog-rank test compares the survival times of two or more groups. This example uses a log-rank test for a difference in survival in the maintained versus non-maintained treatment groups in the aml data. The graph shows KM plots for the aml data broken out by treatment group, which is indicated by the variable "x" in the data.

Kaplan–Meier graph by treatment group in aml

The null hypothesis for a log-rank test is that the groups have the same survival. The expected number of subjects surviving at each time point in each is adjusted for the number of subjects at risk in the groups at each event time. The log-rank test determines if the observed number of events in each group is significantly different from the expected number. The formal test is based on a chi-squared statistic. When the log-rank statistic is large, it is evidence for a difference in the survival times between the groups. The log-rank statistic approximately has aChi-squared distribution with one degree of freedom, and thep-value is calculated using theChi-squared test.

For the example data, the log-rank test for difference in survival gives a p-value of p=0.0653, indicating that the treatment groups do not differ significantly in survival, assuming an alpha level of 0.05. The sample size of 23 subjects is modest, so there is littlepower to detect differences between the treatment groups. The chi-squared test is based on asymptotic approximation, so the p-value should be regarded with caution for smallsample sizes.

Cox proportional hazards (PH) regression analysis

[edit]

Kaplan–Meier curves and log-rank tests are most useful when the predictor variable is categorical (e.g., drug vs. placebo), or takes a small number of values (e.g., drug doses 0, 20, 50, and 100 mg/day) that can be treated as categorical. The log-rank test and KM curves don't work easily with quantitative predictors such as gene expression, white blood count, or age. For quantitative predictor variables, an alternative method isCox proportional hazards regression analysis. Cox PH models work also with categorical predictor variables, which are encoded as {0,1} indicator or dummy variables. The log-rank test is a special case of a Cox PH analysis, and can be performed using Cox PH software.

Example: Cox proportional hazards regression analysis for melanoma

[edit]

This example uses the melanoma data set from Dalgaard Chapter 14.[3]

Data are in the R package ISwR. The Cox proportional hazards regression using R gives the results shown in the box.

Cox proportional hazards regression output for melanoma data. Predictor variable is sex 1: female, 2: male.

The Cox regression results are interpreted as follows.

  • Sex is encoded as a numeric vector (1: female, 2: male). The R summary for the Cox model gives the hazard ratio (HR) for the second group relative to the first group, that is, male versus female.
  • coef = 0.662 is the estimated logarithm of the hazard ratio for males versus females.
  • exp(coef) = 1.94 = exp(0.662) - The log of the hazard ratio (coef= 0.662) is transformed to the hazard ratio using exp(coef). The summary for the Cox model gives the hazard ratio for the second group relative to the first group, that is, male versus female. The estimated hazard ratio of 1.94 indicates that males have higher risk of death (lower survival rates) than females, in these data.
  • se(coef) = 0.265 is the standard error of the log hazard ratio.
  • z = 2.5 = coef/se(coef) = 0.662/0.265. Dividing the coef by its standard error gives the z score.
  • p=0.013. The p-value corresponding to z=2.5 for sex is p=0.013, indicating that there is a significant difference in survival as a function of sex.

The summary output also gives upper and lower 95% confidence intervals for the hazard ratio: lower 95% bound = 1.15; upper 95% bound = 3.26.

Finally, the output gives p-values for three alternative tests for overall significance of the model:

  • Likelihood ratio test = 6.15 on 1 df, p=0.0131
  • Wald test = 6.24 on 1 df, p=0.0125
  • Score (log-rank) test = 6.47 on 1 df, p=0.0110

These three tests are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The last row, "Score (logrank) test" is the result for the log-rank test, with p=0.011, the same result as the log-rank test, because the log-rank test is a special case of a Cox PH regression. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

Cox model using a covariate in the melanoma data

[edit]

The Cox model extends the log-rank test by allowing the inclusion of additional covariates.[4] This example use the melanoma data set where the predictor variables include a continuous covariate, the thickness of the tumor (variable name = "thick").

Histograms of melanoma tumor thickness

In the histograms, the thickness values arepositively skewed and do not have aGaussian-like,Symmetric probability distribution. Regression models, including the Cox model, generally give more reliable results with normally-distributed variables.[citation needed] For this example we may use alogarithmic transform. The log of the thickness of the tumor looks to be more normally distributed, so the Cox models will use log thickness. The Cox PH analysis gives the results in the box.

Cox PH output for melanoma data set with covariate log tumor thickness

The p-value for all three overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. The p-value for log(thick) is 6.9e-07, with a hazard ratio HR = exp(coef) = 2.18, indicating a strong relationship between the thickness of the tumor and increased risk of death.

By contrast, the p-value for sex is now p=0.088. The hazard ratio HR = exp(coef) = 1.58, with a 95% confidence interval of 0.934 to 2.68. Because the confidence interval for HR includes 1, these results indicate that sex makes a smaller contribution to the difference in the HR after controlling for the thickness of the tumor, and only trend toward significance. Examination of graphs of log(thickness) by sex and a t-test of log(thickness) by sex both indicate that there is a significant difference between men and women in the thickness of the tumor when they first see the clinician.

The Cox model assumes that the hazards are proportional. The proportional hazard assumption may be tested using the R function cox.zph(). A p-value which is less than 0.05 indicates that the hazards are not proportional. For the melanoma data we obtain p=0.222. Hence, we cannot reject the null hypothesis of the hazards being proportional. Additional tests and graphs for examining a Cox model are described in the textbooks cited.

Extensions to Cox models

[edit]

Cox models can be extended to deal with variations on the simple analysis.

  • Stratification. The subjects can be divided into strata, where subjects within a stratum are expected to be relatively more similar to each other than to randomly chosen subjects from other strata. The regression parameters are assumed to be the same across the strata, but a different baseline hazard may exist for each stratum. Stratification is useful for analyses using matched subjects, for dealing with patient subsets, such as different clinics, and for dealing with violations of the proportional hazard assumption.
  • Time-varying covariates. Some variables, such as gender and treatment group, generally stay the same in a clinical trial. Other clinical variables, such as serum protein levels or dose of concomitant medications may change over the course of a study. Cox models may be extended for such time-varying covariates.

Tree-structured survival models

[edit]

The Cox PH regression model is a linear model. It is similar to linear regression and logistic regression. Specifically, these methods assume that a single line, curve, plane, or surface is sufficient to separate groups (alive, dead) or to estimate a quantitative response (survival time).

In some cases alternative partitions give more accurate classification or quantitative estimates. One set of alternative methods are tree-structured survival models,[5][6][7] including survival random forests.[8] Tree-structured survival models may give more accurate predictions than Cox models. Examining both types of models for a given data set is a reasonable strategy.

Example survival tree analysis

[edit]

This example of a survival tree analysis uses the R package "rpart".[9] The example is based on 146 stage C prostate cancer patients in the data set stagec in rpart. Rpart and the stagec example are described in Atkinson and Therneau (1997),[10] which is also distributed as a vignette of the rpart package.[9]

The variables in stages are:

  • pgtime: time to progression, or last follow-up free of progression
  • pgstat: status at last follow-up (1=progressed, 0=censored)
  • age: age at diagnosis
  • eet: early endocrine therapy (1=no, 0=yes)
  • ploidy: diploid/tetraploid/aneuploid DNA pattern
  • g2: % of cells in G2 phase
  • grade: tumor grade (1-4)
  • gleason: Gleason grade (3-10)

The survival tree produced by the analysis is shown in the figure.

Survival tree for prostate cancer data set

Each branch in the tree indicates a split on the value of a variable. For example, the root of the tree splits subjects with grade < 2.5 versus subjects with grade 2.5 or greater. The terminal nodes indicate the number of subjects in the node, the number of subjects who have events, and the relative event rate compared to the root. In the node on the far left, the values 1/33 indicate that one of the 33 subjects in the node had an event, and that the relative event rate is 0.122. In the node on the far right bottom, the values 11/15 indicate that 11 of 15 subjects in the node had an event, and the relative event rate is 2.7.

Survival random forests

[edit]

An alternative to building a single survival tree is to build many survival trees, where each tree is constructed using a sample of the data, and average the trees to predict survival.[8] This is the method underlying the survival random forest models. Survival random forest analysis is available in the R package "randomForestSRC".[11]

The randomForestSRC package includes an example survival random forest analysis using the data set pbc. This data is from the Mayo Clinic Primary Biliary Cirrhosis (PBC) trial of the liver conducted between 1974 and 1984. In the example, the random forest survival model gives more accurate predictions of survival than the Cox PH model. The prediction errors are estimated bybootstrap re-sampling.

Deep Learning survival models

[edit]

Recent advancements in deep representation learning have been extended to survival estimation. The DeepSurv[12] model proposes to replace the log-linear parameterization of the CoxPH model with a multi-layer perceptron. Further extensions like Deep Survival Machines[13] and Deep Cox Mixtures[14] involve the use of latent variable mixture models to model the time-to-event distribution as a mixture of parametric or semi-parametric distributions while jointly learning representations of the input covariates. Deep learning approaches have shown superior performance especially on complex input data modalities such as images and clinical time-series.

General formulation

[edit]
This sectiondoes notcite anysources. Please helpimprove this section byadding citations to reliable sources. Unsourced material may be challenged andremoved.(April 2021) (Learn how and when to remove this message)

Survival function

[edit]
Main article:Survival function

The object of primary interest is thesurvival function, conventionally denotedS, which is defined as

S(t)=Pr(T>t){\displaystyle S(t)=\Pr(T>t)}wheret is some time,T is arandom variable denoting the time of death, and "Pr" stands forprobability. That is, the survival function is the probability that the time of death is later than some specified timet.The survival function is also called thesurvivor function orsurvivorship function in problems of biological survival, and thereliability function in mechanical survival problems. In the latter case, the reliability function is denotedR(t).

Usually one assumesS(0) = 1, although it could be less than 1 if there is the possibility of immediate death or failure.

The survival function must be non-increasing:S(u) ≤S(t) ifut. This property follows directly becauseT>u impliesT>t. This reflects the notion that survival to a later age is possible only if all younger ages are attained. Given this property, the lifetime distribution function and event density (F andf below) are well-defined.

The survival function is usually assumed to approach zero as age increases without bound (i.e.,S(t) → 0 ast → ∞), although the limit could be greater than zero if eternal life is possible. For instance, we could apply survival analysis to a mixture of stable and unstablecarbon isotopes; unstable isotopes would decay sooner or later, but the stable isotopes would last indefinitely.

Lifetime distribution function and event density

[edit]

Related quantities are defined in terms of the survival function.

Thelifetime distribution function, conventionally denotedF, is defined as the complement of the survival function,

F(t)=Pr(Tt)=1S(t).{\displaystyle F(t)=\Pr(T\leq t)=1-S(t).}IfF isdifferentiable then the derivative, which is the density function of the lifetime distribution, is conventionally denotedf,

f(t)=F(t)=ddtF(t).{\displaystyle f(t)=F'(t)={\frac {d}{dt}}F(t).}The functionf is sometimes called theevent density; it is the rate of death or failure events per unit time.

The survival function can be expressed in terms ofprobability distribution andprobability density functions

S(t)=Pr(T>t)=tf(u)du=1F(t).{\displaystyle S(t)=\Pr(T>t)=\int _{t}^{\infty }f(u)\,du=1-F(t).}Similarly, a survival event density function can be defined as

s(t)=S(t)=ddtS(t)=ddttf(u)du=ddt[1F(t)]=f(t).{\displaystyle s(t)=S'(t)={\frac {d}{dt}}S(t)={\frac {d}{dt}}\int _{t}^{\infty }f(u)\,du={\frac {d}{dt}}[1-F(t)]=-f(t).}In other fields, such as statistical physics, the survival event density function is known as thefirst passage time density.

Hazard function and cumulative hazard function

[edit]

Thehazard functionh{\displaystyle h} is defined as the event rate at timet,{\displaystyle t,} conditional on survival at timet.{\displaystyle t.}

Synonyms forhazard function in different fields include hazard rate,force of mortality (demography andactuarial science, denoted byμ{\displaystyle \mu }), force of failure, orfailure rate (engineering, denotedλ{\displaystyle \lambda }). For example, in actuarial science,μ(x){\displaystyle \mu (x)} denotes rate of death for people agedx{\displaystyle x}, whereas inreliability engineeringλ(t){\displaystyle \lambda (t)} denotes rate of failure of components after operation for timet{\displaystyle t}.

Suppose that an item has survived for a timet{\displaystyle t} and we desire the probability that it will not survive for an additional timedt{\displaystyle dt}:

h(t)=limdt0Pr(tT<t+dt)dtS(t)=f(t)S(t)=S(t)S(t).{\displaystyle h(t)=\lim _{dt\rightarrow 0}{\frac {\Pr(t\leq T<t+dt)}{dt\cdot S(t)}}={\frac {f(t)}{S(t)}}=-{\frac {S'(t)}{S(t)}}.}

Any functionh{\displaystyle h} is a hazard function if and only if it satisfies the following properties:

  1. x0(h(x)0){\displaystyle \forall x\geq 0\left(h(x)\geq 0\right)} ,
  2. 0h(x)dx={\displaystyle \int _{0}^{\infty }h(x)dx=\infty } .

In fact, the hazard rate is usually more informative about the underlying mechanism of failure than the other representations of a lifetime distribution.

The hazard function must be non-negative,λ(t)0{\displaystyle \lambda (t)\geq 0}, and its integral over[0,]{\displaystyle [0,\infty ]} must be infinite, but is not otherwise constrained; it may be increasing or decreasing, non-monotonic, or discontinuous. An example is thebathtub curve hazard function, which is large for small values oft{\displaystyle t}, decreasing to some minimum, and thereafter increasing again; this can model the property of some mechanical systems to either fail soon after operation, or much later, as the system ages.

The hazard function can alternatively be represented in terms of thecumulative hazard function, conventionally denotedΛ{\displaystyle \Lambda } orH{\displaystyle H}:

Λ(t)=logS(t){\displaystyle \,\Lambda (t)=-\log S(t)}so transposing signs and exponentiating

S(t)=exp(Λ(t)){\displaystyle \,S(t)=\exp(-\Lambda (t))}or differentiating (with the chain rule)

ddtΛ(t)=S(t)S(t)=λ(t).{\displaystyle {\frac {d}{dt}}\Lambda (t)=-{\frac {S'(t)}{S(t)}}=\lambda (t).}The name "cumulative hazard function" is derived from the fact that

Λ(t)=0tλ(u)du{\displaystyle \Lambda (t)=\int _{0}^{t}\lambda (u)\,du}which is the "accumulation" of the hazard over time.

From the definition ofΛ(t){\displaystyle \Lambda (t)}, we see that it increases without bound ast tends to infinity (assuming thatS(t){\displaystyle S(t)} tends to zero). This implies thatλ(t){\displaystyle \lambda (t)} must not decrease too quickly, since, by definition, the cumulative hazard has to diverge. For example,exp(t){\displaystyle \exp(-t)} is not the hazard function of any survival distribution, because its integral converges to 1.

The survival functionS(t){\displaystyle S(t)}, the cumulative hazard functionΛ(t){\displaystyle \Lambda (t)}, the densityf(t){\displaystyle f(t)}, the hazard functionλ(t){\displaystyle \lambda (t)}, and the lifetime distribution functionF(t){\displaystyle F(t)} are related throughS(t)=exp[Λ(t)]=f(t)λ(t)=1F(t),t>0.{\displaystyle S(t)=\exp[-\Lambda (t)]={\frac {f(t)}{\lambda (t)}}=1-F(t),\quad t>0.}

Quantities derived from the survival distribution

[edit]

Future lifetime at a given timet0{\displaystyle t_{0}} is the time remaining until death, given survival to aget0{\displaystyle t_{0}}. Thus, it isTt0{\displaystyle T-t_{0}} in the present notation. Theexpected future lifetime is theexpected value of future lifetime. The probability of death at or before aget0+t{\displaystyle t_{0}+t}, given survival until aget0{\displaystyle t_{0}}, is just

P(Tt0+tT>t0)=P(t0<Tt0+t)P(T>t0)=F(t0+t)F(t0)S(t0).{\displaystyle P(T\leq t_{0}+t\mid T>t_{0})={\frac {P(t_{0}<T\leq t_{0}+t)}{P(T>t_{0})}}={\frac {F(t_{0}+t)-F(t_{0})}{S(t_{0})}}.}Therefore, the probability density of future lifetime is

ddtF(t0+t)F(t0)S(t0)=f(t0+t)S(t0){\displaystyle {\frac {d}{dt}}{\frac {F(t_{0}+t)-F(t_{0})}{S(t_{0})}}={\frac {f(t_{0}+t)}{S(t_{0})}}}and the expected future lifetime is

1S(t0)0tf(t0+t)dt=1S(t0)t0S(t)dt,{\displaystyle {\frac {1}{S(t_{0})}}\int _{0}^{\infty }t\,f(t_{0}+t)\,dt={\frac {1}{S(t_{0})}}\int _{t_{0}}^{\infty }S(t)\,dt,}where the second expression is obtained usingintegration by parts.

Fort0=0{\displaystyle t_{0}=0}, that is, at birth, this reduces to the expected lifetime.

In reliability problems, the expected lifetime is called themean time to failure, and the expected future lifetime is called themean residual lifetime.

As the probability of an individual surviving until aget or later isS(t), by definition, the expected number of survivors at aget out of an initialpopulation ofn newborns isn ×S(t), assuming the same survival function for all individuals. Thus the expected proportion of survivors isS(t).If the survival of different individuals is independent, the number of survivors at aget has abinomial distribution with parametersn andS(t), and thevariance of the proportion of survivors isS(t) × (1-S(t))/n.

The age at which a specified proportion of survivors remain can be found by solving the equationS(t) =q fort, whereq is thequantile in question. Typically one is interested in themedian lifetime, for whichq = 1/2, or other quantiles such asq = 0.90 orq = 0.99.

Censoring

[edit]

Censoring is a form of missing data problem in which time to event is not observed for reasons such as termination of study before all recruited subjects have shown the event of interest or the subject has left the study prior to experiencing an event. Censoring is common in survival analysis.

If only the lower limitl for the true event timeT is known such thatT >l, this is calledright censoring. Right censoring will occur, for example, for those subjects whose birth date is known but who are still alive when they arelost to follow-up or when the study ends. We generally encounter right-censored data.

If the event of interest has already happened before the subject is included in the study but it is not known when it occurred, the data is said to beleft-censored.[15] When it can only be said that the event happened between two observations or examinations, this isinterval censoring.

Left censoring occurs for example when a permanent tooth has already emerged prior to the start of a dental study that aims to estimate its emergence distribution. In the same study, an emergence time is interval-censored when the permanent tooth is present in the mouth at the current examination but not yet at the previous examination. Interval censoring often occurs in HIV/AIDS studies. Indeed, time to HIV seroconversion can be determined only by a laboratory assessment which is usually initiated after a visit to the physician. Then one can only conclude that HIV seroconversion has happened between two examinations. The same is true for the diagnosis of AIDS, which is based on clinical symptoms and needs to be confirmed by a medical examination.

It may also happen that subjects with a lifetime less than some threshold may not be observed at all: this is calledtruncation. Note that truncation is different from left censoring, since for a left censored datum, we know the subject exists, but for a truncated datum, we may be completely unaware of the subject. Truncation is also common. In a so-calleddelayed entry study, subjects are not observed at all until they have reached a certain age. For example, people may not be observed until they have reached the age to enter school. Any deceased subjects in the pre-school age group would be unknown. Left-truncated data are common inactuarial work forlife insurance andpensions.[16]

Left-censored data can occur when a person's survival time becomes incomplete on the left side of the follow-up period for the person. For example, in an epidemiological example, we may monitor a patient for an infectious disorder starting from the time when he or she is tested positive for the infection. Although we may know the right-hand side of the duration of interest, we may never know the exact time of exposure to the infectious agent.[17]

Fitting parameters to data

[edit]

Survival models can be usefully viewed as ordinary regression models in which the response variable is time. However, computing the likelihood function (needed for fitting parameters or making other kinds of inferences) is complicated by the censoring. Thelikelihood function for a survival model, in the presence of censored data, is formulated as follows. By definition the likelihood function is theconditional probability of the data given the parameters of the model.It is customary to assume that the data are independent given the parameters. Then the likelihood function is the product of the likelihood of each datum. It is convenient to partition the data into four categories: uncensored, left censored, right censored, and interval censored. These are denoted "unc.", "l.c.", "r.c.", and "i.c." in the equation below.

L(θ)=Tiunc.Pr(T=Tiθ)il.c.Pr(T<Tiθ)ir.c.Pr(T>Tiθ)ii.c.Pr(Ti,l<T<Ti,rθ).{\displaystyle L(\theta )=\prod _{T_{i}\in unc.}\Pr(T=T_{i}\mid \theta )\prod _{i\in l.c.}\Pr(T<T_{i}\mid \theta )\prod _{i\in r.c.}\Pr(T>T_{i}\mid \theta )\prod _{i\in i.c.}\Pr(T_{i,l}<T<T_{i,r}\mid \theta ).}For uncensored data, withTi{\displaystyle T_{i}} equal to the age at death, we have

Pr(T=Tiθ)=f(Tiθ).{\displaystyle \Pr(T=T_{i}\mid \theta )=f(T_{i}\mid \theta ).}For left-censored data, such that the age at death is known to be less thanTi{\displaystyle T_{i}}, we have

Pr(T<Tiθ)=F(Tiθ)=1S(Tiθ).{\displaystyle \Pr(T<T_{i}\mid \theta )=F(T_{i}\mid \theta )=1-S(T_{i}\mid \theta ).}For right-censored data, such that the age at death is known to be greater thanTi{\displaystyle T_{i}}, we have

Pr(T>Tiθ)=1F(Tiθ)=S(Tiθ).{\displaystyle \Pr(T>T_{i}\mid \theta )=1-F(T_{i}\mid \theta )=S(T_{i}\mid \theta ).}For an interval censored datum, such that the age at death is known to be less thanTi,r{\displaystyle T_{i,r}} and greater thanTi,l{\displaystyle T_{i,l}}, we have

Pr(Ti,l<T<Ti,rθ)=S(Ti,lθ)S(Ti,rθ).{\displaystyle \Pr(T_{i,l}<T<T_{i,r}\mid \theta )=S(T_{i,l}\mid \theta )-S(T_{i,r}\mid \theta ).}An important application where interval-censored data arises is current status data, where an eventTi{\displaystyle T_{i}} is known not to have occurred before an observation time and to have occurred before the next observation time.

Non-parametric estimation

[edit]

TheKaplan–Meier estimator can be used to estimate the survival function. TheNelson–Aalen estimator can be used to provide anon-parametric estimate of the cumulative hazard rate function. These estimators require lifetime data. Periodic case (cohort) and death (and recovery) counts are statistically sufficient to make nonparametric maximum likelihood and least squares estimates of survival functions, without lifetime data.

Discrete-time survival models

[edit]

While many parametric models assume a continuous-time, discrete-time survival models can be mapped to a binary classification problem. In a discrete-time survival model the survival period is artificially resampled in intervals where for each interval a binary target indicator is recorded if the event takes place in a certain time horizon.[18] If a binary classifier (potentially enhanced with a different likelihood to take more structure of the problem into account) iscalibrated, then the classifier score is the hazard function (i.e. the conditional probability of failure).[18]

Description of the transformation of continuous-time survival data to discrete-time survival data. Individual 4 is censored and for individual 5 the event happens outside the observation window 5.

Discrete-time survival models are connected toempirical likelihood.[19][20]

Goodness of fit

[edit]

The goodness of fit of survival models can be assessed usingscoring rules.[21]

Computer software for survival analysis

[edit]

The textbook by Kleinbaum has examples of survival analyses using SAS, R, and other packages.[22] The textbooks by Brostrom,[23] Dalgaard[3]and Tableman and Kim[24]give examples of survival analyses using R (or using S, and which run in R).

Distributions used in survival analysis

[edit]

Applications

[edit]

See also

[edit]

References

[edit]
  1. ^Clark, T G; Bradburn, M J; Love, S B; Altman, D G (2003-07-15)."Survival Analysis Part I: Basic concepts and first analyses".British Journal of Cancer.89 (2):232–238.doi:10.1038/sj.bjc.6601118.PMC 2394262.PMID 12865907.
  2. ^Miller, Rupert G. (1997),Survival analysis, John Wiley & Sons,ISBN 0-471-25218-2
  3. ^abDalgaard, Peter (2008),Introductory Statistics with R (Second ed.), Springer,ISBN 978-0387790534
  4. ^Saegusa, Takumi; Di, Chongzhi; Chen, Ying Qing (September 2014)."Hypothesis testing for an extended cox model with time-varying coefficients".Biometrics.70 (3):619–628.doi:10.1111/biom.12185.ISSN 0006-341X.PMC 4247822.PMID 24888739.
  5. ^Segal, Mark Robert (1988)."Regression Trees for Censored Data".Biometrics.44 (1):35–47.doi:10.2307/2531894.JSTOR 2531894.S2CID 60974957.
  6. ^Leblanc, Michael; Crowley, John (1993)."Survival Trees by Goodness of Split".Journal of the American Statistical Association.88 (422):457–467.doi:10.1080/01621459.1993.10476296.ISSN 0162-1459.
  7. ^Ritschard, Gilbert; Gabadinho, Alexis; Muller, Nicolas S.; Studer, Matthias (2008)."Mining event histories: a social science perspective".International Journal of Data Mining, Modelling and Management.1 (1): 68.doi:10.1504/IJDMMM.2008.022538.ISSN 1759-1163.
  8. ^abIshwaran, Hemant; Kogalur, Udaya B.; Blackstone, Eugene H.; Lauer, Michael S. (2008-09-01)."Random survival forests".The Annals of Applied Statistics.2 (3).arXiv:0811.1645.doi:10.1214/08-AOAS169.ISSN 1932-6157.S2CID 2003897.
  9. ^abTherneau, Terry J.; Atkinson, Elizabeth J."rpart: Recursive Partitioning and Regression Trees".CRAN. RetrievedNovember 12, 2021.
  10. ^Atkinson, Elizabeth J.; Therneau, Terry J. (1997).An introduction to recursive partitioning using the RPART routines. Mayo Foundation.
  11. ^Ishwaran, Hemant; Kogalur, Udaya B."randomForestSRC: Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC)".CRAN. RetrievedNovember 12, 2021.
  12. ^Singh, Jared; Katzman, L. (2018). "DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network".BMC Medical Research Methodology.
  13. ^Nagpal, Chirag (2021). "Deep survival machines: Fully parametric survival regression and representation learning for censored data with competing risks".IEEE Journal of Biomedical and Health Informatics.25 (8):3163–3175.arXiv:2003.01176.doi:10.1109/JBHI.2021.3052441.PMID 33460387.S2CID 211817982.
  14. ^Nagpal, Chirag (2021). "Deep Cox mixtures for survival regression".Machine Learning for Healthcare Conference.arXiv:2101.06536.
  15. ^Darity, William A. Jr., ed. (2008)."Censoring, Left and Right".International Encyclopedia of the Social Sciences. Vol. 1 (2nd ed.). Macmillan. pp. 473–474. Retrieved6 November 2016.
  16. ^Richards, S. J. (2012). "A handbook of parametric survival models for actuarial use".Scandinavian Actuarial Journal.2012 (4):233–257.doi:10.1080/03461238.2010.506688.S2CID 119577304.
  17. ^Singh, R.; Mukhopadhyay, K. (2011)."Survival analysis in clinical trials: Basics and must know areas".Perspect Clin Res.2 (4):145–148.doi:10.4103/2229-3485.86872.PMC 3227332.PMID 22145125.
  18. ^abSuresh, K., Severn, C. & Ghosh, D. Survival prediction models: an introduction to discrete-time modeling. BMC Med Res Methodol 22, 207 (2022).https://doi.org/10.1186/s12874-022-01679-6 ,https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-022-01679-6
  19. ^Empirical Likelihood in Survival Analysis, Gang Li (U.S.A.), Runze Li (U.S.A.), and Mai Zhou (U.S.A.), Contemporary Multivariate Analysis and Design of Experiments. March 2005, 337-349,https://www.ms.uky.edu/~mai/research/llz.pdf
  20. ^The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data, Bruce W. Turnbull, Journal of the Royal Statistical Society. Series B (Methodological)Vol. 38, No. 3 (1976), pp. 290-295 (6 pages),https://apps.dtic.mil/sti/tr/pdf/ADA030940.pdf
  21. ^Proper Scoring Rules for Survival Analysis, Hiroki Yanagisawa,https://arxiv.org/abs/2305.00621v3
  22. ^Kleinbaum, David G.; Klein, Mitchel (2012),Survival analysis: A Self-learning text (Third ed.), Springer,ISBN 978-1441966452
  23. ^Brostrom, Göran (2012),Event History Analysis with R (First ed.), Chapman & Hall/CRC,ISBN 978-1439831649
  24. ^Tableman, Mara; Kim, Jong Sung (2003),Survival Analysis Using S (First ed.), Chapman and Hall/CRC,ISBN 978-1584884088
  25. ^Stepanova, Maria; Thomas, Lyn (2002-04-01). "Survival Analysis Methods for Personal Loan Data".Operations Research.50 (2):277–289.doi:10.1287/opre.50.2.277.426.ISSN 0030-364X.
  26. ^Glennon, Dennis; Nigro, Peter (2005). "Measuring the Default Risk of Small Business Loans: A Survival Analysis Approach".Journal of Money, Credit and Banking.37 (5):923–947.doi:10.1353/mcb.2005.0051.ISSN 0022-2879.JSTOR 3839153.S2CID 154615623.
  27. ^Kennedy, Edward H.; Hu, Chen; O’Brien, Barbara; Gross, Samuel R. (2014-05-20)."Rate of false conviction of criminal defendants who are sentenced to death".Proceedings of the National Academy of Sciences.111 (20):7230–7235.Bibcode:2014PNAS..111.7230G.doi:10.1073/pnas.1306417111.ISSN 0027-8424.PMC 4034186.PMID 24778209.
  28. ^de Cos Juez, F. J.; García Nieto, P. J.; Martínez Torres, J.; Taboada Castro, J. (2010-10-01)."Analysis of lead times of metallic components in the aerospace industry through a supported vector machine model".Mathematical and Computer Modelling. Mathematical Models in Medicine, Business & Engineering 2009.52 (7):1177–1184.doi:10.1016/j.mcm.2010.03.017.ISSN 0895-7177.
  29. ^Spivak, Andrew L.; Damphousse, Kelly R. (2006). "Who Returns to Prison? A Survival Analysis of Recidivism among Adult Offenders Released in Oklahoma, 1985 – 2004".Justice Research and Policy.8 (2):57–88.doi:10.3818/jrp.8.2.2006.57.ISSN 1525-1071.S2CID 144566819.
  30. ^Pollock, Kenneth H.; Winterstein, Scott R.; Bunck, Christine M.; Curtis, Paul D. (1989)."Survival Analysis in Telemetry Studies: The Staggered Entry Design".The Journal of Wildlife Management.53 (1):7–15.doi:10.2307/3801296.ISSN 0022-541X.JSTOR 3801296.
  31. ^Saleh, Joseph Homer (2019-12-23)."Statistical reliability analysis for a most dangerous occupation: Roman emperor".Palgrave Communications.5 (1):1–7.doi:10.1057/s41599-019-0366-y.ISSN 2055-1045.
  32. ^Kreer, Markus; Kizilersu, Ayse; Thomas, Anthony W. (2022)."Censored expectation maximization algorithm for mixtures: Application to intertrade waiting times".Physica A: Statistical Mechanics and Its Applications.587 (1): 126456.Bibcode:2022PhyA..58726456K.doi:10.1016/j.physa.2021.126456.ISSN 0378-4371.S2CID 244198364.

Further reading

[edit]
  • Collett, David (2003).Modelling Survival Data in Medical Research (Second ed.). Boca Raton: Chapman & Hall/CRC.ISBN 1584883251.
  • Elandt-Johnson, Regina; Johnson, Norman (1999).Survival Models and Data Analysis. New York: John Wiley & Sons.ISBN 0471349925.
  • Kalbfleisch, J. D.; Prentice, Ross L. (2002).The statistical analysis of failure time data. New York: John Wiley & Sons.ISBN 047136357X.
  • Lawless, Jerald F. (2003).Statistical Models and Methods for Lifetime Data (2nd ed.). Hoboken: John Wiley and Sons.ISBN 0471372153.
  • Rausand, M.; Hoyland, A. (2004).System Reliability Theory: Models, Statistical Methods, and Applications. Hoboken: John Wiley & Sons.ISBN 047147133X.

External links

[edit]
Continuous data
Center
Dispersion
Shape
Count data
Summary tables
Dependence
Graphics
Study design
Survey methodology
Controlled experiments
Adaptive designs
Observational studies
Statistical theory
Frequentist inference
Point estimation
Interval estimation
Testing hypotheses
Parametric tests
Specific tests
Goodness of fit
Rank statistics
Bayesian inference
Correlation
Regression analysis
Linear regression
Non-standard predictors
Generalized linear model
Partition of variance
Categorical
Multivariate
Time-series
General
Specific tests
Time domain
Frequency domain
Survival
Survival function
Hazard function
Test
Biostatistics
Engineering statistics
Social statistics
Spatial statistics
Portal:
Retrieved from "https://en.wikipedia.org/w/index.php?title=Survival_analysis&oldid=1281368761"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp