We're going to create survival estimates using KMSurvival
on the same subscribers data as in SAS above. Then we'll compare both estimates to see if they are equal. After that, more comparisions on hierarchical strata are given.
import pandas as pd
from kmsurvival import KMSurvival, plot_right_censor
pd.set_option('precision', 3)
%matplotlib inline
df = pd.read_csv('censored_start_stop.txt', sep='\t',
parse_dates=['start_date', 'stop_date'])
df.head()
kms = KMSurvival(col_censored = 'censored',
col_start_date='start_date',
col_stop_date='stop_date')
cutoff = '2008-12-28'
estimates = kms.fit(df, cutoff)
estimates.head()
kms.plot()
# For comparing with the output of SAS PROC LIFETEST
estimates.to_csv('km_estimates_py.csv', header=True,
index_label='tenure', float_format='%.6f')
We use the same subscribers data and create survival estimates using SAS PROC LIFETEST
and KMSurvial
. Now let's see if both are equal.
class Display(object):
"""Display multiple objects horizontally"""
template = """<div style="float: left; padding: 6px;">
<p>{0}</p><font size=2>{1}</font></div>
"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
df_sas = pd.read_sas("survival.sas7bdat") # read SAS survival estimates
df_py = pd.read_csv("km_estimates_py.csv") # read KMSurvial estimates
Display('df_sas.head()', 'df_py.head()')
df_sas[df_sas['_CENSOR_'].isnull()]
For df_sas:
For df_py:
df_sas = df_sas[df_sas['_CENSOR_'].notnull()]
df_sas.rename(columns={'SURVIVAL': 'survival', '_CENSOR_': 'censor'},
inplace=True)
df_sas = df_sas[['tenure', 'survival', 'censor']]
df_py = df_py[['tenure', 'survival']]
Display('df_sas.head(3)', 'df_py.head(3)')
print(df_sas.shape, df_py.shape)
The differences are probably because SAS's estimates include the 'censor' column. Let's try these:
grp_sas = df_sas.groupby(['tenure'])
for tenure, group in grp_sas: # group is a DataFrame
if len(group) > 1:
if group['survival'].unique().size == 1:
continue
else:
print('Found multiple survival estimates for a tenure!\n'
'Tenure:{}\n{}'.format(tenure, group))
There's one outlier! It seems safe to drop the row with missing value (index 596).
df_sas.index.get_loc(596)
df_sas.drop(596, inplace=True)
df_sas.drop(['censor'], axis=1, inplace=True)
Display('df_sas.head()', 'df_py.head()', 'df_sas.tail()', 'df_py.tail()')
Check missing values
df_sas[df_sas.survival.isnull()].index
df_py[df_py.survival.isnull()].index
We can see:
df_sas.ix[593:598]
Let's take a look at the survival curve created by SAS.
from IPython.display import Image
Image(filename='SAS_SurvivalPlot.png')
We see the tail is flat, which means the survival values of index from 597 to 637 equal to index 595's. So we do these:
df_sas.ix[:, 'survival'].fillna(method='ffill', inplace=True)
df_sas.drop_duplicates(inplace=True)
Display('df_sas.head(3)', 'df_py.head(3)', 'df_sas.tail(3)', 'df_py.tail(3)')
print(df_sas.shape, df_py.shape)
The two look like much similar now. Let's tune them a little more.
df_sas['tenure'] = df_sas.tenure.astype(int)
df_py.set_index('tenure', inplace=True)
df_sas.set_index('tenure', inplace=True)
Display('df_sas.head(3)', 'df_py.head(3)', 'df_sas.tail(3)', 'df_py.tail(3)')
They are much alike now. Let's check it.
print("Indices equal? -- {}\n"
"Survival values equal? -- {}".format(
df_sas.index.equals(df_py.index),
df_sas['survival'].equals(df_py['survival'])))
Why not equal? Probably it's because of very small differences between float values.
diff = (df_sas - df_py); diff.columns = ['survival_diff']
diff.head(3)
Let's assume if the difference of two values is less than 1E-6, we say they're equal.
threshold = 1e-6
if (diff < threshold).all().values:
print('Survial estimates from SAS and KMSurvival are equal!')
else:
print('Not euqal!')
We've compared the two quantatively by data exploring and cleaning. Now let's compare them qualitatively.
KM estimation can be used to compare different groups of customers by creating a separate curve for each group. This process, called stratification, qualitatively shows the effect of different factors, or a combination of them, on survival.
Compare the subscribers' survivals in different markets.
# KMSurvival
cutoff = '2008-12-28'
group = ['market']
kms.fit(df, cutoff, group)
kms.plot(vlines=[365])
# SAS
Image(filename='SAS_SurvivalPlot_market.png')
From the survival curves we can see the product performs better at Market-A than at Market-B, and Market-A's subscriptions drop dramtically at tenure 365 days.
# KMSurvial
group = ['market', 'channel']
kms.fit(df, cutoff, group)
kms.plot()
# SAS
Image(filename='SAS_SurvivalPlot_market_channel.png')
The figures from SAS and KMSurvival look like very similar.
Now let's do something that SAS doesn't support. Suppose we only want to compare Channel-C2, Channel-C3 of Market-A.
kms.subgrps_
kms.plot(strata=[['Market-A'], ['Channel-C2', 'Channel-C3']])
# hazard curve of Channel-C3 and Channel-C1 in Market-B
kms.plot(curve='h', strata=[['Market-B'], ['Channel-C1', 'Channel-C3']],
style='ggplot', vlines=[180, 365], lw=0.7)
We have compared between PROC LIFETEST
and KMSurvival
in Jupyter Notebook. The output of the two is almost same. And we can see Jupyter Notebook is great for packaging code, results and documentations in the notebook format. It also makes sharing the process of data analysis more convenient.
Thank for reading!