DataThon!

Data for Good teaming up with Raising the Roof held a DataThon on Saturday in Toronto. I was glad to take part in it. It was 12-hour data crunching on about 20 data sets. In this post, we'll deal with one of them -- a donation Excel file. We're going to explore and clean it, create a new feature, and finally do donation frequency analysis.

Take a look at the donation data

In [1]:
import numpy as np
import pandas as pd
In [2]:
class Display(object):
    """Display multiple objects horizontally"""    
    template = """<div style="float: left; padding: 10px;">
                  <p>{0}</p><font size=2>{1}</font></div>
               """    
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
In [3]:
df = pd.read_excel("donation.xlsx")
df.head()
Out[3]:
Expr1000 ID IncomeID FundAccount IncomeType Payment Date Payment Amount Payment Method Prefix City1 StateOrProvince1 Postal Code Area1 TotalGifts NumberOfGifts
0 1 1 2419 TQN TQS May-10-2001 25.0 Cheque Ms. Ottawa YT K1R 5M8 NaN 540.01 8
1 1 1 63 X X NaN 75.0 Cheque Ms. Ottawa YT K1R 5M8 NaN 540.01 8
2 1 1 7996 TP TPE NaN 0.0 Cheque Ms. Ottawa YT K1R 5M8 NaN 540.01 8
3 1 1 2891 GF DN NaN 50.0 Unknown Ms. Ottawa YT K1R 5M8 NaN 540.01 8
4 1 1 8325 TP HW NaN 0.0 Money order Ms. Ottawa YT K1R 5M8 NaN 540.01 8
In [4]:
df.shape
Out[4]:
(9635, 15)
In [5]:
df.ix[df.Expr1000 != df.ID].shape
Out[5]:
(0, 15)

The columns Expr1000 and ID have same values, and we use ID to identify donors.

For this post, we only use three columns as shown below.

In [6]:
df = df[['ID', 'Prefix', 'Payment Date']]
df.columns = ['id', 'prefix', 'pay_date']  # rename columns
df.head()
Out[6]:
id prefix pay_date
0 1 Ms. May-10-2001
1 1 Ms. NaN
2 1 Ms. NaN
3 1 Ms. NaN
4 1 Ms. NaN
In [7]:
df.dtypes
Out[7]:
id           int64
prefix      object
pay_date    object
dtype: object

Clean the column prefix

In [8]:
prefix = df.prefix.value_counts(dropna=False)
print("There are {} prefixes:\n\n{}".format(prefix.size, prefix))
There are 26 prefixes:

Mr.             3472
Ms.             3443
NaN             2090
Mrs.             369
Dr.               83
Mr. and Mrs.      58
Mr. & Mrs.        53
Ms                24
Rev.               9
Mr                 9
Miss               7
996                3
Mr. & Ms.          2
Mr & Mrs.          1
Miss.              1
MS                 1
Rabi               1
Ms.'               1
Father             1
Capt.              1
Hon.               1
Mrs                1
 Leora             1
Dr. & Mrs.         1
Mr. & Mrs          1
Mr.&Mrs.           1
Name: prefix, dtype: int64

The data is dirty in terms of one group having multiple values, e.g., 'Mr.' and 'Mr'. We can writing a Python dictionary to map prefixes to a few groups but it's tedious and not flexible; however, writing one big regular expression to deal with all the patterns is error-prone and hard to read.

Nothing is particularly hard if you divide it into small jobs.

-- Henry Ford

We'll use a pipeline of regular expressions to process the prefix column step by step.

Step 1. Replace missing values with Unknown.

In [9]:
df['prefix'].fillna('Unknown', inplace=True)

Step 2. Map string including '&' or 'and' to 'Couple'

In [10]:
df['prefix'].replace(r'^[DM].*(and|&).*', 'Couple', regex=True, inplace=True)
df.prefix.value_counts(dropna=False).sort_index()
Out[10]:
 Leora        1
996           3
Capt.         1
Couple      117
Dr.          83
Father        1
Hon.          1
MS            1
Miss          7
Miss.         1
Mr            9
Mr.        3472
Mrs           1
Mrs.        369
Ms           24
Ms.        3443
Ms.'          1
Rabi          1
Rev.          9
Unknown    2090
Name: prefix, dtype: int64

Step 3. Map string beginning 'M' and including 's' or 'S' to 'F' (Female)

In [11]:
df.prefix.replace(r'^M.*[sS].*', 'F', regex=True, inplace=True)
df.prefix.value_counts(dropna=False).sort_index()
Out[11]:
 Leora        1
996           3
Capt.         1
Couple      117
Dr.          83
F          3847
Father        1
Hon.          1
Mr            9
Mr.        3472
Rabi          1
Rev.          9
Unknown    2090
Name: prefix, dtype: int64

Step 4. Map strings 'Mr', 'Mr.', and 'Father' to 'M' (Male)

In [12]:
df.prefix.replace(r'^(Mr|Fa).*', 'M', regex=True, inplace=True)
df.prefix.value_counts().sort_index()
Out[12]:
 Leora        1
996           3
Capt.         1
Couple      117
Dr.          83
F          3847
Hon.          1
M          3482
Rabi          1
Rev.          9
Unknown    2090
Name: prefix, dtype: int64

Step 5. Finaly map other strings to 'Other'

In [13]:
df.prefix.replace(r'^(?!Couple|[FMU]).*', 'Other', regex=True, inplace=True)
df.prefix.value_counts().sort_index()
Out[13]:
Couple      117
F          3847
M          3482
Other        99
Unknown    2090
Name: prefix, dtype: int64
In [14]:
df.prefix.value_counts().sum() == df.shape[0]
Out[14]:
True
In [15]:
df.isnull().sum()
Out[15]:
id             0
prefix         0
pay_date    1169
dtype: int64

We also see there are 1169 missing values in pay_date. Let's deal with them.

Convert the column pay_date to datetime

First we're going to convert the column 'pay_date' from object to date type.

In [16]:
pd.to_datetime(df.pay_date, format='%b-%d-%Y', errors='coerce').isnull().sum()
Out[16]:
1202

We can see that besides 1169 missing values pandas convert other 33 (= 1202 - 1169) invalid strings to null, let's check what they are.

In [17]:
try:
    pd.to_datetime(df.pay_date, format='%b-%d-%Y')
except Exception as inst:
    print(inst)
time data 'Feb-18-200' does not match format '%b-%d-%Y' (match)

The string 'Feb-18-200' is wrong with the year. Let's see how many this kind of data there are.

In [18]:
filter_wrong_year = df.pay_date.notnull() & (df.pay_date.str.len() == 10)
wrong_year = df[filter_wrong_year]
print(wrong_year.head())
wrong_year.shape
        id prefix    pay_date
67      27      M  Feb-18-200
276    149      F  Feb-11-101
1277  1003      M  Mar-20-101
1625  1307      M  Jan-10-100
1908  1642      F  Feb-04-100
Out[18]:
(28, 3)

We have found 28 of 33 invalid non-null strings. What are the remaining 5 (= 33 - 28)?

In [19]:
filter_5 = df.pay_date.notnull() & (df.pay_date.str.len() != 10)
try:
    pd.to_datetime(df.ix[filter_5, 'pay_date'], format='%b-%d-%Y')
except Exception as inst:
    print(inst)
Out of bounds nanosecond timestamp: 8201-06-09 00:00:00

The year is 8201!

In [20]:
pd.Timestamp.min, pd.Timestamp.max
Out[20]:
(Timestamp('1677-09-21 00:12:43.145225'),
 Timestamp('2262-04-11 23:47:16.854775807'))

(why pandas uses the two datetimes?)

Let's check how many years are outside of pandas valid date range.

In [21]:
df[(df.pay_date.str.len() != 10) & 
   ((df.pay_date.str.slice(7, 11) > '2262') | 
    (df.pay_date.str.slice(7, 11) < '1677'))]
Out[21]:
id prefix pay_date
1440 1130 Unknown Jun-09-8201
3640 3061 F Nov-24-3003
4085 3303 M Aug-31-1201
7577 5604 M Jan-17-1201
8957 9704 Unknown Apr-04-4201

There are exactly 5 strings with invalid years. Having found all the 33 invalid date strings, we are now more confident about the conversion as shown below.

In [22]:
df['pay_date'] = pd.to_datetime(df.pay_date, format='%b-%d-%Y', errors='coerce')

Pandas (and other tools) can't do everything for us. We need to do further data integrity check. Because Raising the Roof was launched in 1996, the payment dates earlier than 1996 or later than the data cutoff date should be treated as invalid data.

In [23]:
min_date = pd.to_datetime('1996-01-01')
max_date = pd.to_datetime('2016-12-03') # we suppose today is the cutoff date

filter_biz = (df.pay_date < min_date) | (df.pay_date > max_date)
num_invalid_biz = df[filter_biz].shape[0]
print("There are {} dates violateing business rules.".format(num_invalid_biz))
Display('df[filter_biz].head()', 'df[filter_biz].tail()')
There are 826 dates violateing business rules.
Out[23]:

df[filter_biz].head()

id prefix pay_date
10 1 F 1900-11-20
11 1 F 1900-11-20
19 3 Unknown 1900-12-01
20 3 Unknown 1900-03-03
45 22 F 1900-12-21

df[filter_biz].tail()

id prefix pay_date
3264 2831 Unknown 1901-02-02
3265 2832 F 1901-02-05
3267 2833 F 1901-08-13
7569 5604 M 2070-09-24
9550 10625 Unknown 2106-05-30
In [24]:
num_null_before = df.pay_date.isnull().sum()
num_null_before
Out[24]:
1202

We change the dates violating business rules to missing values.

In [25]:
df.ix[filter_biz, ['pay_date']] = pd.NaT
In [26]:
df.sort_values('pay_date', inplace=True)
Display('df[df.pay_date.notnull()].head()', 'df[df.pay_date.notnull()].tail()')
Out[26]:

df[df.pay_date.notnull()].head()

id prefix pay_date
510 290 F 1997-11-09
561 313 Unknown 1998-01-01
517 290 F 1998-02-03
518 291 M 1998-02-24
479 262 M 1998-03-01

df[df.pay_date.notnull()].tail()

id prefix pay_date
9632 10732 F 2016-07-28
7248 5390 M 2016-07-31
8199 8512 M 2016-07-31
8933 9682 Unknown 2016-08-28
9634 10734 M 2016-09-08
In [27]:
num_null_before, num_invalid_biz, df.pay_date.isnull().sum()
Out[27]:
(1202, 826, 2028)
In [28]:
df.isnull().sum()
Out[28]:
id             0
prefix         0
pay_date    2028
dtype: int64

The number of nulls in pay_date changes from 1202 to 2028 after the additional 826 conversions.

Create a new column 'active' to represent if donors are active

The business rule for counting active donors: a donor remains active for one year after the donating.

In [29]:
min_active_date = df.pay_date.max() - pd.Timedelta('365 days')
min_active_date
Out[29]:
Timestamp('2015-09-09 00:00:00')

We suppose a donor is active if the last date of her/his donating is later than 2015-09-08.

1: active; 0: not active

In [30]:
df['active'] = np.where(df['pay_date'] >= min_active_date, 1, 0)
Display('df.head()', 'df.tail()', 'df[df.pay_date.notnull()].tail()')
Out[30]:

df.head()

id prefix pay_date active
510 290 F 1997-11-09 0
561 313 Unknown 1998-01-01 0
517 290 F 1998-02-03 0
518 291 M 1998-02-24 0
479 262 M 1998-03-01 0

df.tail()

id prefix pay_date active
8737 9465 Unknown NaT 0
8957 9704 Unknown NaT 0
9343 10320 Unknown NaT 0
9550 10625 Unknown NaT 0
9597 10674 F NaT 0

df[df.pay_date.notnull()].tail()

id prefix pay_date active
9632 10732 F 2016-07-28 1
7248 5390 M 2016-07-31 1
8199 8512 M 2016-07-31 1
8933 9682 Unknown 2016-08-28 1
9634 10734 M 2016-09-08 1
In [31]:
df.active.value_counts()
Out[31]:
0    9307
1     328
Name: active, dtype: int64
In [32]:
df.pay_date.isnull().sum()
Out[32]:
2028

How about these 2028 missing dates?

For null dates, we use random integers with [0,1] discrete uniform distribution to assign values to them.

In [33]:
np.random.seed(0)
nr = np.random.randint(2, size=df.shape[0])
In [34]:
df.active.where(df.pay_date.notnull(), nr, inplace=True)
In [35]:
df.active.value_counts()
Out[35]:
0    8248
1    1387
Name: active, dtype: int64

We can see 1387 is roughly equal to 328 + (2028 / 2), which means the distribution of the assigned values (0 or 1) for null pay_dates is almost uniform.

In [36]:
Display('df.head()', 'df.tail()')
Out[36]:

df.head()

id prefix pay_date active
510 290 F 1997-11-09 0
561 313 Unknown 1998-01-01 0
517 290 F 1998-02-03 0
518 291 M 1998-02-24 0
479 262 M 1998-03-01 0

df.tail()

id prefix pay_date active
8737 9465 Unknown NaT 0
8957 9704 Unknown NaT 0
9343 10320 Unknown NaT 1
9550 10625 Unknown NaT 0
9597 10674 F NaT 0

Donation frequency analysis

Some donors donate many times; some only once. What percents of donors donate once, twice, or N times? It's acutally a time-to-event (also called survival analysis) question. Let's try to answer it.

In [37]:
donation_freq = df.groupby(['id'])
donation_freq
Out[37]:
<pandas.core.groupby.DataFrameGroupBy object at 0x0000021E01972278>
In [38]:
rows = []
for donor_id, group in donation_freq:  # group is DataFrame
    rows.append((donor_id, group.shape[0], group.prefix.values[0], group.active.max(), 
                 pd.NaT if group.pay_date.isnull().any() else group.pay_date.max()))
rows[:5]
Out[38]:
[(1, 17, 'F', 1, NaT),
 (2, 1, 'Unknown', 1, NaT),
 (3, 3, 'Unknown', 1, NaT),
 (4, 3, 'F', 1, NaT),
 (5, 4, 'F', 1, NaT)]
In [39]:
df_freq = pd.DataFrame(rows, columns=['donor_id', 'freq', 'group','active', 'last_pay_date'])
df_freq.sort_values('freq', ascending=False, inplace=True)
Display('df_freq.head()', 'df_freq.tail()')
Out[39]:

df_freq.head()

donor_id freq group active last_pay_date
4665 5604 40 M 1 NaT
2366 3156 38 Unknown 1 NaT
2085 2837 26 Unknown 1 NaT
550 999 25 Unknown 1 NaT
46 70 24 M 1 NaT

df_freq.tail()

donor_id freq group active last_pay_date
2306 3077 1 M 0 2003-11-24
2305 3076 1 M 0 2003-11-24
2304 3075 1 M 0 2003-11-24
2303 3074 1 F 0 2003-11-24
6322 10734 1 M 1 2016-09-08
In [40]:
df_freq.shape
Out[40]:
(6323, 5)
In [41]:
df_freq.freq.sum() == df.shape[0]
Out[41]:
True

We'll use a survival analysis package called kmsurvival to do the donation frequency analysis.

$ pip install kmsurvival

In [42]:
from kmsurvival import KMSurvival
%matplotlib inline
In [43]:
cutoff = df_freq.last_pay_date.max().strftime('%Y-%m-%d')
kms = KMSurvival(col_censored='active',
                 col_tenure='freq')
estimates = kms.fit(df_freq, cutoff)
Display('estimates.head()', 'estimates.tail()')
Out[43]:

estimates.head()

estimates_ survival hazard cnt_uncensored cnt_censored cutoff
_tenure_
1 0.337814 0.662186 4187 877 2016-09-08
2 0.206338 0.389198 490 154 2016-09-08
3 0.150979 0.268293 165 82 2016-09-08
4 0.120209 0.203804 75 43 2016-09-08
5 0.096167 0.200000 50 22 2016-09-08

estimates.tail()

estimates_ survival hazard cnt_uncensored cnt_censored cutoff
_tenure_
24 0.025961 0.166667 1 1 2016-09-08
25 0.025961 0.000000 0 1 2016-09-08
26 0.025961 0.000000 0 1 2016-09-08
38 0.025961 0.000000 0 1 2016-09-08
40 0.025961 0.000000 0 1 2016-09-08
In [44]:
kms.cutoff
Out[44]:
'2016-09-08'
In [45]:
title='Donation Frequency Curve'
xlabel='Frequency'
kms.plot(title=title, 
         xlabel=xlabel, 
         vlines=[1, 3], 
         figsize=(10, 6), 
         style='ggplot')

From the figure above, we can see:

  • about 34% donors have donated more than one time -- that also means that about 66% donors just donate once and then stop donating.
  • 15% donors have donated more than three times.
  • about 2.5% donors have donated more than 24 times and still active, which means it is possible that they will donate again.

Let's compare among groups.

In [46]:
kms.fit(df_freq, cutoff, group=['group'])
kms.subgrps_
Out[46]:
OrderedDict([('group', ('M', 'Unknown', 'F', 'Couple', 'Other'))])
In [47]:
kms.plot(strata=[['F', 'M', 'Couple']], 
         title=title, xlabel=xlabel, 
         figsize=(10, 6), vlines=[1, 3, 17])

From the comparisons, we can get some interesting information.

  • The couple donors have the highest 'survival' probabilities, which means it is more possible for couples to donate more times than single donors.
  • The male donors are a little higher than the female ones in terms of the probabilities of donating more times.
  • Only the male donors have donated more than 17 times and still be active.

Summary

  • We often say "know your data". It's acutally about knowing the business behind data. I took almost half of the day to figure out what the data means, but it was worth the time.
  • The real data is not neat, just as the real world.
  • We learn from each other in a team. And produce results early and often so you have something to talk about and get timely feedback. Iterate data analysis quickly.
  • How to deal with missing values, how to create derived features, and how to come up with the right set of variables for modeling? To answer these questions needs experience and business knowledge. They are the art part of data science as the tweet says:

Julie tweet

What a day!


Thanks for reading!