KMSurvival in Jupyter Notebook¶

We're going to create survival estimates using KMSurvival on the same subscribers data as in SAS above. Then we'll compare both estimates to see if they are equal. After that, more comparisions on hierarchical strata are given.

import pandas as pd
from kmsurvival import KMSurvival, plot_right_censor

pd.set_option('precision', 3)
%matplotlib inline

df = pd.read_csv('censored_start_stop.txt', sep='\t', 
                 parse_dates=['start_date', 'stop_date'])
df.head()

kms = KMSurvival(col_censored = 'censored',
                 col_start_date='start_date',
                 col_stop_date='stop_date')

cutoff = '2008-12-28'
estimates = kms.fit(df, cutoff)
estimates.head()

kms.plot()

# For comparing with the output of SAS PROC LIFETEST
estimates.to_csv('km_estimates_py.csv', header=True, 
                 index_label='tenure', float_format='%.6f')

Compare survival estimates between SAS and KMSurvival¶

We use the same subscribers data and create survival estimates using SAS PROC LIFETEST and KMSurvial. Now let's see if both are equal.

class Display(object):
    """Display multiple objects horizontally"""    
    template = """<div style="float: left; padding: 6px;">
                  <p>{0}</p><font size=2>{1}</font></div>
               """
    
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)

df_sas = pd.read_sas("survival.sas7bdat")  # read SAS survival estimates
df_py = pd.read_csv("km_estimates_py.csv") # read KMSurvial estimates

Display('df_sas.head()', 'df_py.head()')

df_sas[df_sas['_CENSOR_'].isnull()]

For df_sas:

drop the row whose _CENSOR_ is missing
rename columns
drop the the last two columns (95% confidence intervals)

For df_py:

keep columns 'tenure', 'survival', and drop others

df_sas = df_sas[df_sas['_CENSOR_'].notnull()]
df_sas.rename(columns={'SURVIVAL': 'survival', '_CENSOR_': 'censor'}, 
              inplace=True)
df_sas = df_sas[['tenure', 'survival', 'censor']]

df_py = df_py[['tenure', 'survival']]
Display('df_sas.head(3)', 'df_py.head(3)')

print(df_sas.shape, df_py.shape)

(637, 3) (392, 2)

The differences are probably because SAS's estimates include the 'censor' column. Let's try these:

Find if survivals for the same tenures but different censors are equal.
If so, we can drop the column 'censor'.

grp_sas = df_sas.groupby(['tenure'])
for tenure, group in grp_sas:  # group is a DataFrame
    if len(group) > 1:
        if group['survival'].unique().size == 1:
            continue
        else:
            print('Found multiple survival estimates for a tenure!\n'
                  'Tenure:{}\n{}'.format(tenure, group))

Found multiple survival estimates for a tenure!
Tenure:553.0
     tenure  survival  censor
595   553.0     0.465     0.0
596   553.0       NaN     1.0

There's one outlier! It seems safe to drop the row with missing value (index 596).

df_sas.index.get_loc(596)

595

df_sas.drop(596, inplace=True)

df_sas.drop(['censor'], axis=1, inplace=True)

Display('df_sas.head()', 'df_py.head()', 'df_sas.tail()', 'df_py.tail()')

Check missing values

df_sas[df_sas.survival.isnull()].index

Int64Index([597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609,
            610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622,
            623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635,
            636, 637],
           dtype='int64')

df_py[df_py.survival.isnull()].index

Int64Index([], dtype='int64')

We can see:

df_py: no missing values
df_sas: the survival values are missing since index 597. Let's check a few values before it.

df_sas.ix[593:598]

Let's take a look at the survival curve created by SAS.

from IPython.display import Image
Image(filename='SAS_SurvivalPlot.png')

We see the tail is flat, which means the survival values of index from 597 to 637 equal to index 595's. So we do these:

forward-fill these missing values
then drop duplicate rows

df_sas.ix[:, 'survival'].fillna(method='ffill', inplace=True)
df_sas.drop_duplicates(inplace=True)

Display('df_sas.head(3)', 'df_py.head(3)', 'df_sas.tail(3)', 'df_py.tail(3)')

print(df_sas.shape, df_py.shape)

(392, 2) (392, 2)

The two look like much similar now. Let's tune them a little more.

df_sas['tenure'] = df_sas.tenure.astype(int)
df_py.set_index('tenure', inplace=True)
df_sas.set_index('tenure', inplace=True)

Display('df_sas.head(3)', 'df_py.head(3)', 'df_sas.tail(3)', 'df_py.tail(3)')

They are much alike now. Let's check it.

print("Indices equal? -- {}\n"      
      "Survival values equal? -- {}".format(
        df_sas.index.equals(df_py.index),
        df_sas['survival'].equals(df_py['survival'])))

Indices equal? -- True
Survival values equal? -- False

Why not equal? Probably it's because of very small differences between float values.

diff = (df_sas - df_py); diff.columns = ['survival_diff']
diff.head(3)

Let's assume if the difference of two values is less than 1E-6, we say they're equal.

threshold = 1e-6 
if (diff < threshold).all().values:
    print('Survial estimates from SAS and KMSurvival are equal!')
else:
    print('Not euqal!')

Survial estimates from SAS and KMSurvival are equal!

Qualitative comparisions¶

We've compared the two quantatively by data exploring and cleaning. Now let's compare them qualitatively.

KM estimation can be used to compare different groups of customers by creating a separate curve for each group. This process, called stratification, qualitatively shows the effect of different factors, or a combination of them, on survival.

One-level strata¶

Compare the subscribers' survivals in different markets.

# KMSurvival
cutoff = '2008-12-28'
group = ['market']
kms.fit(df, cutoff, group)
kms.plot(vlines=[365])

# SAS
Image(filename='SAS_SurvivalPlot_market.png')

From the survival curves we can see the product performs better at Market-A than at Market-B, and Market-A's subscriptions drop dramtically at tenure 365 days.

Two-level hierarchical strata¶

# KMSurvial
group = ['market', 'channel']
kms.fit(df, cutoff, group)
kms.plot()

# SAS
Image(filename='SAS_SurvivalPlot_market_channel.png')

The figures from SAS and KMSurvival look like very similar.

Flexible combinations of strata¶

Now let's do something that SAS doesn't support. Suppose we only want to compare Channel-C2, Channel-C3 of Market-A.

kms.subgrps_

OrderedDict([('market', ('Market-B', 'Market-A')),
             ('channel', ('Channel-C3', 'Channel-C1', 'Channel-C2'))])

kms.plot(strata=[['Market-A'], ['Channel-C2', 'Channel-C3']])

# hazard curve of Channel-C3 and Channel-C1 in Market-B
kms.plot(curve='h', strata=[['Market-B'], ['Channel-C1', 'Channel-C3']],
         style='ggplot', vlines=[180, 365], lw=0.7)

Summary¶

We have compared between PROC LIFETEST and KMSurvival in Jupyter Notebook. The output of the two is almost same. And we can see Jupyter Notebook is great for packaging code, results and documentations in the notebook format. It also makes sharing the process of data analysis more convenient.

Thank for reading!

	user_id	market	channel	start_date	stop_date	censored
0	2488577	Market-B	Channel-C3	2008-03-28	NaT	1
1	2574285	Market-A	Channel-C1	2008-05-12	NaT	1
2	2565957	Market-B	Channel-C2	2008-05-07	NaT	1
3	2201921	Market-B	Channel-C3	2007-12-11	NaT	1
4	2776956	Market-A	Channel-C2	2008-09-18	NaT	1

estimates_	survival	hazard	cnt_uncensored	cnt_censored	cutoff
_tenure_
1	0.994	0.006	4	0	2008-12-28
2	0.987	0.007	5	1	2008-12-28
3	0.986	0.001	1	0	2008-12-28
4	0.982	0.004	3	1	2008-12-28
5	0.979	0.003	2	4	2008-12-28

	tenure	_CENSOR_	SURVIVAL	SDF_LCL	SDF_UCL
0	0.0	NaN	1.000	1.000	1.000
1	1.0	0.0	0.994	0.985	0.998
2	2.0	0.0	0.987	0.976	0.993
3	2.0	1.0	0.987	NaN	NaN
4	3.0	0.0	0.986	0.974	0.992

	tenure	survival	hazard	cnt_uncensored	cnt_censored	cutoff
0	1	0.994	0.006	4	0	2008-12-28
1	2	0.987	0.007	5	1	2008-12-28
2	3	0.986	0.001	1	0	2008-12-28
3	4	0.982	0.004	3	1	2008-12-28
4	5	0.979	0.003	2	4	2008-12-28

	tenure	survival
0	1	0.994
1	2	0.987
2	3	0.986

	tenure	survival
593	552.0	0.476
594	552.0	0.476
595	553.0	0.465
597	554.0	NaN
598	556.0	NaN

	tenure	survival
633	598.0	NaN
634	601.0	NaN
635	601.0	NaN
636	602.0	NaN
637	602.0	NaN

	tenure	survival
632	598.0	0.465
634	601.0	0.465
636	602.0	0.465

	survival_diff
tenure
1	-9.818e-08
2	2.791e-07
3	-2.405e-07