**Business Question**

Say your boss poses the following question: w*hat is the average lifespan of an account?* It seems like an easy question at first. Just take the difference between the end date and start date for each account, and compute the average from those deltas. Here is the rub...what happens when accounts haven't been closed yet? Computing the delta isn't possible when you're missing the end date. This type of missing data is called right censored in survival analysis.

**Dealing with Right Censored Data**

The initial temptation may be to just compute the average for the deltas that have an end date, ignoring the accounts that are still open, but that will underestimate the true average. This approach is a conditional mean. It answers the question, *what is the average lifespan of an account, given that it has closed already?* That's not our business question.

The other temptation may be to adjust every observation that doesn't yet have an end date, and give it an end date the corresponds with the end of your study. However, this prematurely limits the lifespan of accounts that would otherwise continue, providing a biased estimate of the average.

One way to appropriately deal with this data and still answer the original business question is to build a survival curve using the Kaplan Meier (KM) estimator, and then take the area under the curve to generate our estimate for the average lifespan of an account. KM incorporates the right censored data so that we're not throwing away information in the estimation. It is an iterative calculation, that uses the censored data to determine the at-risk population (n) right up until the time they are censored, then discards them.

**Kaplan Meier Estimator**

Where:S(t) is the survival curve at time i

d is the number of accounts that closed at time i

n is the number of at risk accounts, just prior to time i

The number of at-risk accounts is computed by taking the at-risk population from the time period just before, minus the number of events in the last period, minus the number of censored accounts last period. It's here that we make full use of the censored data. It stays in the calculation until the period after it is censored.

Once the curve is computed using the KM estimator, then the area under the curve is computed to find the average life of an account. This is computed by taking the weighted average of the time periods, where the weights are the survival curve probabilities.

**Kaplan Meier in SAS**

Now the intuition has been established, let's walk through how to compute the result in SAS. All of the code and data from this blog post can be found on github. For the sake of a reproducible example, let's generate random numbers from a Weibull distribution using an arbitrary seed, 8787. To make the data look more like a number of months, we'll scale the data by 12 and round it off to a whole month. Finally, we'll randomly censor 30% of the data before passing it on the LIFETEST Procedure.

There are three macro variables defined up front that you may want to adjust.

- seed: used for generating the same data from Weibull distribution, and to censor the same observations each time.
- recordCount: adjust this to generate more datapoints and get a smoother curve
- censorProportion: to experiment with how censoring more or less of the data alters the results

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
%let seed = 8787; %let recordCount = 100; %let censorProportion = 0.30; data randomdata; /* set seed for the two rand() functions */ call streaminit(&seed.); do i = 1 to &recordCount.; /* generate random numbers from a Weibull dist shape=1, scale=2 */ t = rand("WEIBULL", 1, 2); /* scale the random numbers and round them off to an even month */ t = round(t * 12); /* intialize censor as zero, overwrite to 1 when censor threshold met */ censor = 0; randCensor = rand("UNIFORM"); if randCensor < &censorProportion. then censor = 1; output; end; run; proc lifetest data=randomdata plots=survival(cl) /*surv prob plot with Conf Int*/ outsurv=KMoutput; /*output survival props*/ time t*censor(1); /* the censor field takes on a 1 when the data is censored */ title ’Account Lifespan’; run; |

The first part of the output shown below answers the business question of what the average lifespan of an account is: 34.909 months or about 3 years. The next piece is the Kaplan Meier curve with 95% confidence intervals around the survival probability. The interpretation of this chart is the probability of an account lasting beyond time *t*. For example, for t=50 on the x axis, it corresponds to a survival probability of about 0.25, so an account has a 25% chance of being open for more than 50 months. With the confidence bands, we would say that we're 95% confident that the probability of surviving beyond 50 months is between about 0.15 and 0.35. The final table displays the proportion of data censored, which just happened to hit exactly 30% like we declared upfront with the *censorProportion* variable.

While the proc lifetest output gives us the mean we were originally looking for, let's compute it manually too in order to strengthen the intuition. Using the *KMoutput* dataset from proc lifetest, we have a record for each time period that an event occurred. However, to compute the area under the curve, we have to impute the survival probabilities that occur between events to get a complete curve. Keeping in mind that the survival curve doesn't drop when there is censoring, only when an event occurs, so we should repeat probability for the non-event months until the next event occurs.

The first part of this data step computes the number of times to repeat a value. The KMoutput dataset does not explicitly state each time period under the curve, so we need to compute it ourselves, and we'll call that variable *impliedRepetition.* The *impliedRepetition* variable is shifted to t+1, so we lag the survival probabilities to line them up properly with our repetition counter variable. Finally, we weight the time periods by the probability of occurrence, and store those values in a variable called *auc*, which is short for area under the curve. The last proc means step sums the *auc* variable, which yields an identical value for the mean from the LIFETEST Procedure, 34.909.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
data KMoutput; set KMoutput; /* number of times to repeat the last time t */ impliedRepetition = t - lag(t); /* use a shifted Survival probability */ lagSURVIVAL = lag(SURVIVAL); auc = lagSURVIVAL * impliedRepetition; run; proc print data=KMoutput; run; /* find the area under the curve, auc */ proc means sum data=KMoutput; var auc; run; |

To see exactly what is happening with this calculation, look at Obs 10 and 11. The time t variable skips over time 5, meaning that we need to repeat the value for time 4 twice. That is, 0.89724 should also be the value used for time 5. However, the *impliedRepetition* variable that corresponds to repeating time 4 twice is sitting on Obs 11, so we lag *SURVIVAL* to align these cleanly. Summing the last column gives the area under the curve.

**First 20 Records of KMoutput**

**Kaplan Meier in R**

Let's compute the average lifespan of an account in R as well. Rather than recreate the random data from Weibull distribution, let's just read in the same randomdata SAS dataset into R.

1 2 3 4 5 6 7 8 9 10 11 12 |
library(sas7bdat) accounts <- read.sas7bdat("/home/ben/SAS/SAS-University-Edition/mysharedfolder/randomdata.sas7bdat") #print the first 6 records head(accounts) i t censor randCensor 1 1 7 1 0.80070742 2 2 18 0 0.21298203 3 3 0 0 0.03259109 4 4 7 1 0.88587602 5 5 23 1 0.70753106 6 6 15 0 0.25643528 |

The LIFETEST Procedure naturally expects the censoring variable to be 1 when censored, and 0 when not censored. In R, the Surv() function to create a survival object treats 0 as censored and 1 as the event/account closure time indicator. So we need to flip the 1's and 0's in our censor column.

1 2 3 4 5 6 7 |
head(accounts$censor) [1] 0 1 1 0 0 1 accounts$censor <- (accounts$censor - 1) * -1 head(accounts$censor) [1] 1 0 0 1 1 0 |

Now we're ready to build the survival object with the Surv() function and estimate the Kaplan Meier survival curve with survfit(). Both are from the survival package. The model and results are stored in the *s* object.

1 2 |
library(survival) s <- survfit(formula = Surv(time = accounts$t, accounts$censor, type = "right")~1, conf.type = c("plain")) |

Looking at the available data in the model object *s*, I don't see the mean value precomputed like we had in SAS. Let's compute it manually again.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# no precomputed area under the curve in the model object attributes(s) $names [1] "n" "time" "n.risk" "n.event" "n.censor" "surv" "type" [8] "std.err" "upper" "lower" "conf.type" "conf.int" "call" # manaully compute the average for (i in 1:(length(s$time)-1)) { # next time t minus current time t s$impliedRepetition[i] <- s$time[i+1] - s$time[i] # auc are the weighted average time periods s$auc[i] <- s$impliedRepetition[i] * s$surv[i] } # identical to proc lifetest sum(s$auc) 34.90933 |

To get a quick visual of the survival probabilities and confidence intervals, just plot the model object. It's not the most aesthetically appealing chart, but it was quick and memorable code.

1 |
plot(s) |

To see all of this code in a handful of compact scripts, go to my GitHub account page. Both SAS datasets are available too: the random Weibull input data, and the proc lifetest output data.

**Final Thoughts**

We rushed ahead with the Kaplan Meier estimator without mentioning what assumptions we were making. One key assumption was that censoring is independent from the survival probabilities. While we applied the censoring randomly with a uniform distribution in this example, we may encounter a different pattern in practice. Conceding that every business process has its nuances, I would suspect that new accounts are less sticky than older, more established ones, so I would expect a lot less censoring for the new accounts. Also, Zhong and Hess found that longer tail distributions are more sensitive to bias when there is right censoring. So the shape of the curve matters too.

**Resources**

Aside from the other links peppered in through out the blog post, I found these links helpful while writing this blog post:

- Slides from Joe Ibrahim's ASA seminar on survival analysis
- NIH primer on survival analysis
- Chapter 2 slides from a statistics course at NC State, Analysis of Survival Data