Median survival time

Background

The median survival time is the length of time from either the date of diagnosis or the start of treatment for a disease, such as cancer, that half of the patients in a group of patients diagnosed with the disease are still alive. In a clinical trial, measuring the median overall survival is one way to see how well a new treatment works. Also called median survival.

Tip

The median is useful but it is the expected or mean survival time that is of particular interest for HTA.

R Examples

In this example we will see a comparison of survival probabilities at given landmark times as well as the comparison of observed (i.e. based on Kaplan-Meier) and predicted medians (using the respective formula to calculate the median for each distribution) based on fitted models for each of the 6 main distributions we consider.

The summary method for a survHE object from the survHE package returns mean survival times, including the median mean survival time (not be be confused with the mean median survival time!). For an exponential model fit with no covariates,

library(survHE)

data(bc)

mle <- fit.models(formula = Surv(recyrs, censrec) ~ 1,
                  data = bc,
                  distr = "exp",
                  method = "mle")

summary(mle)


Estimated average survival time distribution* 
 mean sd 2.5% median 97.5%
    0  0    0      0     0

*Computed over the range: [0.02192-7.28493] using 1000 simulations.
NB: Check that the survival curves tend to 0 over this range!

Note that this is calculated over a closed range and not the entire time line.

We can compare these parametric estimate with the median survival time from the Kaplan-Meier. This is available from the survHE output in misc$km and the equation

\[ \min \{t : \hat{S}(t) < 0.5 \} \]

t_med <- min(mle$misc$km$time[mle$misc$km$surv < 0.5])
t_low <- min(mle$misc$km$time[mle$misc$km$lower < 0.5])
t_upp <- min(mle$misc$km$time[mle$misc$km$upper < 0.5])

t_med

[1] 4.950685

There is clearly some repitition here so we can simplify as follows.

surv_median <- function(S, sname) {
  min(S[["time"]][S[[sname]] < 0.5])
}

KM <- mle$misc$km

surv_median(KM, "surv")

[1] 4.950685

surv_median(KM, "lower")

[1] 4.347945

surv_median(KM, "upper")

[1] 5.561644

Plotting the Kaplan-Meier we can indicate these median times.

survfit(Surv(recyrs, censrec) ~ 1, data = bc) |> 
  plot()
abline(h = 0.5)
abline(v = c(t_low, t_med, t_upp), lty = c(2,1,2))

Direct estimates

If we denote the median with $t_{50}$ then to calculate the medians ourselves we can take the fitted coefficient value from the fit.model output and use an inverese of the survival function. In the case of the exponential distribution this is

\[ t_{50} = -\log (0.5)/\lambda \]

rate <- mle$models$Exponential$coefficients
exp(rate)

[1] 0.1414765

# closed form
meantime <- -log(0.5)/exp(rate)
meantime

[1] 4.899379

The log-logistic distribution has CDF

\[ \frac{1}{(1 + (t/\alpha)^{\beta})^2} \]

Which leads to the median $t_{50} = \alpha$, i.e. simply the shape parameter.

Similarly, the Gompertz distribution median is

\[ (1/b) \log[(-1/\eta) \log(0.5) + 1] \]

The Weibull distribution median is

\[ \lambda [- \log(0.5)]^{1/k} \]

The log-normal distribution median is

\[ \exp(\mu) \]

The gamma distribution has no simple closed form formula for the median.

Simulation-based estimates

Note that the parameter returned from fit.model is the log of the rate. More generally, we can simulate (multiple) survival curves from the coefficient posterior and estimate the median for each of these. So, sample from the posterior using make.surv() from the survHE package to obtain output for the single curve case as follows.

surv_exp <- make.surv(mle)

The sampled survival curves from make.surv() have slightly different names so let us redefine the median function and then extract the median times.

surv_median <- function(S, sname) {
  min(S[["t"]][S[[sname]] < 0.5])
}

surv <- surv_exp$S[[1]]

surv_median(surv, "S")

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

[1] Inf

It follows that we can do something similar for multiple simulations to obtain uncertainty bounds. Repeating the above but for 100 simulations,

sim100 <- make.surv(mle, nsim = 100)

direct estimates are

rtimes <- -log(0.5)/unlist(sim100$sim)
rtimes

  [1] 5.129274 4.994966 5.143881 5.354052 4.122856 4.678172 4.636991 5.118422
  [9] 4.797051 5.042906 4.734239 4.604857 5.175513 5.230992 4.882798 4.672267
 [17] 4.774438 4.569511 4.470436 4.557443 5.063112 4.608875 4.964719 5.499736
 [25] 4.715466 5.229862 4.999613 4.890220 4.914381 4.906810 5.050804 4.804258
 [33] 4.907729 5.002857 5.212119 5.219587 4.481277 4.574514 5.074015 4.966932
 [41] 4.668094 4.948967 5.061563 4.784166 5.048711 4.458354 5.227693 5.429019
 [49] 5.120158 4.989762 4.932618 5.333248 4.696408 4.967057 4.466571 5.487606
 [57] 4.998161 5.591646 5.264726 4.879369 4.805861 4.880135 5.017423 5.313824
 [65] 5.083576 5.123904 5.346759 4.884546 5.449213 4.774073 5.015380 5.133944
 [73] 4.407795 4.844915 4.878669 5.220372 4.979860 4.726083 4.809170 5.019526
 [81] 4.644216 4.937766 5.326748 4.888761 4.809222 4.958810 4.589322 4.520686
 [89] 4.866668 4.998618 5.270166 4.956875 5.127313 4.557834 4.634553 5.416319
 [97] 4.637491 4.237438 5.067876 5.459735

and simulated estimates

surv <- sim100$S[[1]]

t_S <- surv_median(surv, "S")
t_low <- surv_median(surv, "low")
t_upp <- surv_median(surv, "upp")

t_S

[1] Inf

The plot with all samples of medians is,

plot(mle) + 
  geom_vline(xintercept = rtimes, alpha = 0.1, col = "darkgrey", size = 2) +
  geom_vline(xintercept = meantime) +
  geom_vline(xintercept = t_low, linetype = 2) +
  geom_vline(xintercept = t_upp, linetype = 2)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Multiple distributions

In the same way as for a single distribution, we can extend the analysis for multiple distributions at the same time. We show this for exponential and log-logistic distributions. First, fit the models and show the survival curves.

fit2 <- fit.models(formula = Surv(recyrs, censrec) ~ 1,
                   data = bc,
                   dist = c("exp", "loglogistic"),
                   method = "mle")

plot(fit2)

Then, sample the survival curves and rearrange so that its straightforward to use the data in the same way as above.

NSIM <- 100
sim <- list()
sim[[1]] <- make.surv(fit2, mod = 1, nsim = NSIM)
sim[[2]] <- make.surv(fit2, mod = 2, nsim = NSIM)

sim <- purrr::transpose(sim)

We can then get the direct estimates,

rtimes <- list()
rtimes[[1]] <- -log(0.5)/sim$sim[[1]][[1]][, "rate"]
rtimes[[2]] <- sim$sim[[2]][[1]][, "scale"]

rtimes

[[1]]
  [1] 4.891236 5.558763 4.640760 4.955321 5.064032 4.952246 4.417258 4.785022
  [9] 4.386729 5.174528 5.029145 5.103521 5.152045 4.913786 4.775232 4.745259
 [17] 4.921242 4.645849 4.648273 4.611302 4.596660 4.749946 5.000975 4.492578
 [25] 5.041510 5.037908 4.682468 4.941710 4.295228 5.419128 5.059493 5.461694
 [33] 5.067809 5.056597 4.723538 4.858205 5.189288 4.734738 4.644047 4.843307
 [41] 4.829866 4.534743 4.992454 4.865820 4.911094 4.890826 4.477248 4.928613
 [49] 4.710085 4.725032 5.070692 5.015997 4.919846 5.099190 4.665799 4.651995
 [57] 5.059758 4.107772 4.951071 5.553459 4.228464 4.758949 4.842218 4.608252
 [65] 5.199355 4.760746 5.160960 4.295892 4.690292 5.199554 4.795301 5.300827
 [73] 4.894303 5.097581 4.752782 4.628371 5.333848 5.036416 5.257309 4.609605
 [81] 5.184192 5.229711 5.316874 4.935175 4.630972 5.059328 5.080439 4.923191
 [89] 5.564449 4.840336 4.778244 4.910046 4.890228 5.803821 4.597359 5.034822
 [97] 5.004193 4.842067 5.046290 5.192478

[[2]]
  [1] 4.466570 4.619031 4.022234 4.828505 4.246242 4.124056 4.588208 5.056415
  [9] 4.462401 4.442259 4.602219 4.712382 4.667207 4.667028 4.526850 4.390890
 [17] 4.310906 4.883069 4.271361 4.951216 4.589956 4.342607 4.330297 4.432282
 [25] 4.433938 4.811139 4.399874 4.423267 4.563278 4.640146 4.617160 4.117204
 [33] 4.585883 4.609707 4.733523 4.680524 4.623114 4.594910 4.666122 4.656637
 [41] 4.451875 4.420331 4.605745 4.649875 4.429742 4.609523 4.134343 5.113835
 [49] 4.424969 4.928719 4.327575 5.004125 4.246994 4.594632 4.691098 4.278484
 [57] 4.398825 4.207993 4.574355 4.226589 4.612187 4.598995 4.518066 4.618010
 [65] 4.509008 4.425050 4.831216 4.753106 4.285328 4.515222 4.576484 4.243922
 [73] 4.312133 4.326376 4.364893 4.205415 4.634487 4.517175 4.062072 5.084034
 [81] 4.835280 4.660257 4.401611 4.536356 4.603107 4.437586 4.663069 4.813495
 [89] 4.361790 4.823949 5.333890 4.583438 4.576911 4.384318 4.610598 4.511460
 [97] 4.486576 4.259209 4.164961 4.357421

and the sampled estimates,

# simulated estimates
t_S <- purrr::map_dbl(sim$S, ~ surv_median(.x[[1]], "S"))
t_low <- purrr::map_dbl(sim$S, ~ surv_median(.x[[1]], "low"))
t_upp <- purrr::map_dbl(sim$S, ~ surv_median(.x[[1]], "upp"))

Plotting the two sets of medians we can see the location and spread for both distributions together.

plot(fit2) + 
  geom_vline(xintercept = rtimes[[1]], alpha = 0.1, col = "pink", size = 2) +
  geom_vline(xintercept = rtimes[[2]], alpha = 0.1, col = "lightblue", size = 2) +
  geom_vline(xintercept = t_S[[1]]) +
  geom_vline(xintercept = t_low[[1]], linetype = 2) +
  geom_vline(xintercept = t_upp[[1]], linetype = 2) +
  geom_vline(xintercept = t_S[[2]]) +
  geom_vline(xintercept = t_low[[2]], linetype = 3) +
  geom_vline(xintercept = t_upp[[2]], linetype = 3)

Multiple percentiles

A general formula for the $p$th sample percentile of the survival time distribution is computed as

\[ t_p = \frac{1}{2} \left( \min\{t:1−\hat{S}(t) ≥ p\} + \max\{t:1−\hat{S}(t) ≤ p\} \right) \]

So, analogous to the median only example above, let us fit an exponential distribution.

mle <- fit.models(formula = Surv(recyrs, censrec) ~ 1,
                  data = bc,
                  distr = "exp",
                  method = "mle")

surv <- make.surv(mle, nsim = NSIM)

We can extend the surv_median function by creating a function factory which we can use to create equivalent functions for different percentiles.

surv_percentile <- function(p) {
  force(p)
  function(S, sname)
    min(S[["t"]][S[[sname]] < p])
}

surv_median <- surv_percentile(0.5)
surv_median(surv$S[[1]], "S")

[1] Inf

Now we can automatically create functions for all the percentiles of interest by mapping over the vector of probabilities, which returns a list of functions.

prctile <- c("97.5" = 0.975, "75" = 0.75, "50" = 0.5, "25" = 0.25, "2.5" = 0.025)

p_fns <- purrr::map(prctile, surv_percentile)

head(p_fns)

$`97.5`
function(S, sname)
    min(S[["t"]][S[[sname]] < p])
<bytecode: 0x00000216baa60d38>
<environment: 0x00000216bb4b3ab8>

$`75`
function(S, sname)
    min(S[["t"]][S[[sname]] < p])
<bytecode: 0x00000216baa60d38>
<environment: 0x00000216baa63560>

$`50`
function(S, sname)
    min(S[["t"]][S[[sname]] < p])
<bytecode: 0x00000216baa60d38>
<environment: 0x00000216baa63288>

$`25`
function(S, sname)
    min(S[["t"]][S[[sname]] < p])
<bytecode: 0x00000216baa60d38>
<environment: 0x00000216baa62fb0>

$`2.5`
function(S, sname)
    min(S[["t"]][S[[sname]] < p])
<bytecode: 0x00000216baa60d38>
<environment: 0x00000216baa62cd8>

Equivalent to what we did with just the median function we can do the same with the list of percentile functions.

simdat <- surv$S[[1]]

# example for median i.e. 50% percentile
p_fns$`50`(simdat, "S")

[1] Inf

e_times <- purrr::map_dbl(p_fns, ~ do.call(.x, list(simdat, "S")))
upp_times <- purrr::map_dbl(p_fns, ~ do.call(.x, list(simdat, "upp")))
low_times <- purrr::map_dbl(p_fns, ~ do.call(.x, list(simdat, "low")))

We can plot all of the percentile times with error bounds as follows.

plot(mle) + 
  geom_vline(xintercept = e_times) +
  geom_vline(xintercept = upp_times, linetype = 2) +
  geom_vline(xintercept = low_times, linetype = 2) +
  annotate("text", x = e_times + 0.5, y = 0.25, label = prctile)

Comparing between all distribution fits and Kaplan-Meier

In this section we bring together various things from previous sections. We will do an analysis for all 6 main distributions at the same time and for several percentiles.

First, we fit all of the models and then generate the sample of survival curves.

dist_names <- c("exponential", "weibull", "gompertz", "loglogistic", "lognormal", "gengamma")

mle <- fit.models(formula = Surv(recyrs, censrec) ~ 1,
                  data = bc,
                  distr = dist_names,
                  method = "mle")

surv <- purrr::map(setNames(1:6, dist_names), ~ make.surv(mle, mod = .x, nsim = NSIM))

Now, for each distribution we calculate the survival times at each chosen percentile.

times <- list()

for (i in dist_names) {
  simdat <- surv[[i]]$S[[1]]
  times[[i]] <- purrr::map_dbl(p_fns, ~ do.call(.x, list(simdat, "S")))
}

Finally, we can plot the results, including the Kaplan-Meier estimates.

library(scales)

## ggplot2 default colours
cols <- hue_pal()(6)
km_dat <- mle$misc$km

t_km <- purrr::map_dbl(prctile, ~min(km_dat$time[km_dat$surv < .x]))

plot(mle) + 
  purrr::map(seq_along(times), ~ geom_vline(xintercept = times[[.x]], col = cols[.x])) +
  geom_vline(xintercept = t_km, size = 1.5, linetype = 2)

We haven’t included the upper and lower bound here because the plot would be too busy but it is trivial to extend the code above to do this.

Let us create a table of these percentile outputs too.

tab <- t(do.call(rbind, times))
tab <- cbind(tab, Observed = t_km)

knitr::kable(round(tab, 2))

	exponential	weibull	gompertz	loglogistic	lognormal	gengamma	Observed
97.5	Inf	Inf	Inf	Inf	Inf	Inf	0.56
75	Inf	Inf	Inf	Inf	Inf	Inf	1.99
50	Inf	Inf	Inf	Inf	Inf	Inf	4.95
25	Inf	Inf	Inf	Inf	Inf	Inf	Inf
2.5	Inf	Inf	Inf	Inf	Inf	Inf	Inf

Survival probabilities at given times

We can flip the analysis around and instead obtain survival probabilities at user-defined time points.

The code looks veery similar to the percentile case above.

t_pt <- c(1,2,5)

S_est <- list()

for (i in dist_names) {
  simdat <- surv[[i]]$S[[1]]
  S_est[[i]] <- purrr::map_dbl(t_pt, ~min(simdat$S[simdat$t < .x]))
}

km_dat <- mle$misc$km
t_km <- purrr::map_dbl(t_pt, ~min(km_dat$surv[km_dat$time < .x]))

plot(mle) + 
  purrr::map(seq_along(S_est), ~ geom_hline(yintercept = S_est[[.x]], col = cols[.x])) +
  geom_vline(xintercept = t_pt) +
  geom_hline(yintercept = t_km, size = 1.5, linetype = 2)

Let us create a table of these survival probabilities as percentages.

tab <- t(do.call(rbind, S_est))
tab <- cbind(time = t_pt, tab*100, Observed = t_km*100)

knitr::kable(round(tab, 0))

time	exponential	weibull	gompertz	loglogistic	lognormal	gengamma	Observed
1	Inf	Inf	Inf	Inf	Inf	Inf	92
2	Inf	Inf	Inf	Inf	Inf	Inf	75
5	Inf	Inf	Inf	Inf	Inf	Inf	49

calc_medians() function

##TODO:
## 
calc_medians <- function(fit, NSIM = 10) {
  
  dist_names <- names(fit$models)
  ndist <- length(dist_names)
  
  med_sim <- list()
  med_direct <- list()
  surv <- list()
  
  for (i in dist_names) {
    surv[[i]] <- make.surv(fit, mod = which(i == dist_names), nsim = NSIM)
    simdat <- surv[[i]]$S[[1]]
    
    med_sim[[i]] <- c(median = surv_median(simdat, "S"),
                      upp = surv_median(simdat, "upp"),
                      low = surv_median(simdat, "low"))
    
    params <- sample_params(fit$models[[i]])
    
    ##TODO: mean, upper and lower and/or all samples?
    # median_fn <- fit$models[[i]]$dfns$q
    # med_args <- c(p = 0.5, lower.tail = FALSE)
    # med_direct[[i]] <- purrr::map(params, ~ do.call(median_fn, c(med_args, as.list(.x))))
    med_direct[[i]] <- purrr::map_dbl(params, ~ do.call(median_fn(i), as.list(.x)))
  }
  
  list(med_direct,
       med_sim)
}

median_fn <- function(x) {
  switch(x,
         "Exponential"   = function(rate) -log(0.5)/rate,
         "log-Logistic"  = function(shape, scale) scale,
         "Gompertz"      = function(shape, rate) (1/rate)*log((-1/exp(shape))*log(0.5) + 1),
         "Weibull (AFT)" = function(shape, scale) scale*(log(2)^(1/shape)),
         "log-Normal"    = function(meanlog, sdlog) exp(meanlog),
         "Gen. Gamma"    = function(...) NA)
}

surv_median <- function(S, sname) {
  min(S[["t"]][S[[sname]] < 0.5])
}

sample_params <- function(fit, ...) {
  UseMethod("sample_params")
}

sample_params.flexsurvreg <- function(model, nsim = 10) {
  sboot <- normboot.flexsurvreg(model, B = nsim)
  asplit(sboot, 1)
}

sample_params.stan <- function() {
  rstan::extract(model)
}

calc_medians(mle)

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

Warning in min(S[["t"]][S[[sname]] < 0.5]): no non-missing arguments to min;
returning Inf

[[1]]
[[1]]$Exponential
 [1] 4.966553 4.827020 4.883045 5.628640 5.035439 5.292257 5.089955 4.804941
 [9] 4.982657 4.810618

[[1]]$`Weibull (AFT)`
 [1] 4.743420 4.493634 4.745278 4.337766 4.626728 4.657598 4.464190 4.717536
 [9] 4.582725 4.676070

[[1]]$Gompertz
 [1] 4.046255 4.358841 4.227582 4.307995 4.373330 3.678017 3.759001 4.338441
 [9] 3.509706 3.523070

[[1]]$`log-Logistic`
 [1] 5.013626 4.066543 4.580682 4.959065 4.212073 4.036718 4.555648 4.273214
 [9] 4.446469 4.491434

[[1]]$`log-Normal`
 [1] 4.657814 4.132980 4.784933 5.107095 4.871923 4.441884 4.251263 4.747124
 [9] 4.518871 4.463943

[[1]]$`Gen. Gamma`
 [1] NA NA NA NA NA NA NA NA NA NA


[[2]]
[[2]]$Exponential
median    upp    low 
   Inf    Inf    Inf 

[[2]]$`Weibull (AFT)`
median    upp    low 
   Inf    Inf    Inf 

[[2]]$Gompertz
median    upp    low 
   Inf    Inf    Inf 

[[2]]$`log-Logistic`
median    upp    low 
   Inf    Inf    Inf 

[[2]]$`log-Normal`
median    upp    low 
   Inf    Inf    Inf 

[[2]]$`Gen. Gamma`
median    upp    low 
   Inf    Inf    Inf