Average Lifespan with Right Censored Data

Business Question

Say your boss poses the following question: what is the average lifespan of an account? It seems like an easy question at first. Just take the difference between the end date and start date for each account, and compute the average from those deltas. Here is the rub...what happens when accounts haven't been closed yet? Computing the delta isn't possible when you're missing the end date. This type of missing data is called right censored in survival analysis.

Dealing with Right Censored Data

The initial temptation may be to just compute the average for the deltas that have an end date, ignoring the accounts that are still open, but that will underestimate the true average. This approach is a conditional mean. It answers the question, what is the average lifespan of an account, given that it has closed already? That's not our business question.

The other temptation may be to adjust every observation that doesn't yet have an end date, and give it an end date the corresponds with the end of your study. However, this prematurely limits the lifespan of accounts that would otherwise continue, providing a biased estimate of the average.

One way to appropriately deal with this data and still answer the original business question is to build a survival curve using the Kaplan Meier (KM) estimator, and then take the area under the curve to generate our estimate for the average lifespan of an account. KM incorporates the right censored data so that we're not throwing away information in the estimation. It is an iterative calculation, that uses the censored data to determine the at-risk population (n) right up until the time they are censored, then discards them.

Kaplan Meier Estimator

kaplanmeier

Where:

S(t) is the survival curve at time i

d is the number of accounts that closed at time i

n is the number of at risk accounts, just prior to time i

The number of at-risk accounts is computed by taking the at-risk population from the time period just before, minus the number of events in the last period, minus the number of censored accounts last period. It's here that we make full use of the censored data. It stays in the calculation until the period after it is censored.

Once the curve is computed using the KM estimator, then the area under the curve is computed to find the average life of an account. This is computed by taking the weighted average of the time periods, where the weights are the survival curve probabilities.

Kaplan Meier in SAS

Now the intuition has been established, let's walk through how to compute the result in SAS. All of the code and data from this blog post can be found on github. For the sake of a reproducible example, let's generate random numbers from a Weibull distribution using an arbitrary seed, 8787.  To make the data look more like a number of months, we'll scale the data by 12 and round it off to a whole month. Finally, we'll randomly censor 30% of the data before passing it on the LIFETEST Procedure.

There are three macro variables defined up front that you may want to adjust.

  • seed: used for generating the same data from Weibull distribution, and to censor the same observations each time.
  • recordCount: adjust this to generate more datapoints and get a smoother curve
  • censorProportion: to experiment with how censoring more or less of the data alters the results

The first part of the output shown below answers the business question of what the average lifespan of an account is:  34.909 months or about 3 years. The next piece is the Kaplan Meier curve with 95% confidence intervals around the survival probability. The interpretation of this chart is the probability of an account lasting beyond time t. For example, for t=50 on the x axis, it corresponds to a survival probability of about 0.25, so an account has a 25% chance of being open for more than 50 months. With the confidence bands, we would say that we're 95% confident that the probability of surviving beyond 50 months is between about 0.15 and 0.35. The final table displays the proportion of data censored, which just happened to hit exactly 30% like we declared upfront with the censorProportion variable.

KM curve with 95pct CI

While the proc lifetest output gives us the mean we were originally looking for, let's compute it manually too in order to strengthen the intuition. Using the KMoutput dataset from proc lifetest, we have a record for each time period that an event occurred. However, to compute the area under the curve, we have to impute the survival probabilities that occur between events to get a complete curve. Keeping in mind that the survival curve doesn't drop when there is censoring, only when an event occurs, so we should repeat probability for the non-event months until the next event occurs.

The first part of this data step computes the number of times to repeat a value. The KMoutput dataset does not explicitly state each time period under the curve, so we need to compute it ourselves, and we'll call that variable impliedRepetition. The impliedRepetition variable is shifted to t+1, so we lag the survival probabilities to line them up properly with our repetition counter variable. Finally, we weight the time periods by the probability of occurrence, and store those values in a variable called auc, which is short for area under the curve. The last proc means step sums the auc variable, which yields an identical value for the mean from the LIFETEST Procedure, 34.909.

To see exactly what is happening with this calculation, look at Obs 10 and 11. The time t variable skips over time 5, meaning that we need to repeat the value for time 4 twice. That is, 0.89724 should also be the value used for time 5. However, the impliedRepetition variable that corresponds to repeating time 4 twice is sitting on Obs 11, so we lag SURVIVAL to align these cleanly. Summing the last column gives the area under the curve.

First 20 Records of KMoutput

KMoutput table 2

Mean of AUC from KM

Kaplan Meier in R

Let's compute the average lifespan of an account in R as well. Rather than recreate the random data from Weibull distribution, let's just read in the same randomdata SAS dataset into R.

The LIFETEST Procedure naturally expects the censoring variable to be 1 when censored, and 0 when not censored. In R, the Surv() function to create a survival object treats 0 as censored and 1 as the event/account closure time indicator. So we need to flip the 1's and 0's in our censor column.

Now we're ready to build the survival object with the Surv() function and estimate the Kaplan Meier survival curve with survfit(). Both are from the survival package. The model and results are stored in the s object.

Looking at the available data in the model object s, I don't see the mean value precomputed like we had in SAS. Let's compute it manually again.

To get a quick visual of the survival probabilities and confidence intervals, just plot the model object. It's not the most aesthetically appealing chart, but it was quick and memorable code.

R KM curve

To see all of this code in a handful of compact scripts, go to my GitHub account page. Both SAS datasets are available too: the random Weibull input data, and the proc lifetest output data.

Final Thoughts

We rushed ahead with the Kaplan Meier estimator without mentioning what assumptions we were making. One key assumption was that censoring is independent from the survival probabilities. While we applied the censoring randomly with a uniform distribution in this example, we may encounter a different pattern in practice. Conceding that every business process has its nuances, I would suspect that new accounts are less sticky than older, more established ones, so I would expect a lot less censoring for the new accounts. Also, Zhong and Hess found that longer tail distributions are more sensitive to bias when there is right censoring. So the shape of the curve matters too.

Resources

Aside from the other links peppered in through out the blog post, I found these links helpful while writing this blog post:

 

Posted in stats and tagged , , , , .

Leave a Reply

Your email address will not be published. Required fields are marked *