DataThon!¶

Data for Good teaming up with Raising the Roof held a DataThon on Saturday in Toronto. I was glad to take part in it. It was 12-hour data crunching on about 20 data sets. In this post, we'll deal with one of them -- a donation Excel file. We're going to explore and clean it, create a new feature, and finally do donation frequency analysis.

Take a look at the donation data¶

import numpy as np
import pandas as pd

class Display(object):
    """Display multiple objects horizontally"""    
    template = """<div style="float: left; padding: 10px;">
                  <p>{0}</p><font size=2>{1}</font></div>
               """    
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)

df = pd.read_excel("donation.xlsx")
df.head()

df.shape

(9635, 15)

df.ix[df.Expr1000 != df.ID].shape

(0, 15)

The columns Expr1000 and ID have same values, and we use ID to identify donors.

For this post, we only use three columns as shown below.

df = df[['ID', 'Prefix', 'Payment Date']]
df.columns = ['id', 'prefix', 'pay_date']  # rename columns
df.head()

df.dtypes

id           int64
prefix      object
pay_date    object
dtype: object

Clean the column `prefix`¶

prefix = df.prefix.value_counts(dropna=False)
print("There are {} prefixes:\n\n{}".format(prefix.size, prefix))

There are 26 prefixes:

Mr.             3472
Ms.             3443
NaN             2090
Mrs.             369
Dr.               83
Mr. and Mrs.      58
Mr. & Mrs.        53
Ms                24
Rev.               9
Mr                 9
Miss               7
996                3
Mr. & Ms.          2
Mr & Mrs.          1
Miss.              1
MS                 1
Rabi               1
Ms.'               1
Father             1
Capt.              1
Hon.               1
Mrs                1
 Leora             1
Dr. & Mrs.         1
Mr. & Mrs          1
Mr.&Mrs.           1
Name: prefix, dtype: int64

The data is dirty in terms of one group having multiple values, e.g., 'Mr.' and 'Mr'. We can writing a Python dictionary to map prefixes to a few groups but it's tedious and not flexible; however, writing one big regular expression to deal with all the patterns is error-prone and hard to read.

Nothing is particularly hard if you divide it into small jobs.

-- Henry Ford

We'll use a pipeline of regular expressions to process the prefix column step by step.

Step 1. Replace missing values with Unknown.

df['prefix'].fillna('Unknown', inplace=True)

Step 2. Map string including '&' or 'and' to 'Couple'

df['prefix'].replace(r'^[DM].*(and|&).*', 'Couple', regex=True, inplace=True)
df.prefix.value_counts(dropna=False).sort_index()

 Leora        1
996           3
Capt.         1
Couple      117
Dr.          83
Father        1
Hon.          1
MS            1
Miss          7
Miss.         1
Mr            9
Mr.        3472
Mrs           1
Mrs.        369
Ms           24
Ms.        3443
Ms.'          1
Rabi          1
Rev.          9
Unknown    2090
Name: prefix, dtype: int64

Step 3. Map string beginning 'M' and including 's' or 'S' to 'F' (Female)

df.prefix.replace(r'^M.*[sS].*', 'F', regex=True, inplace=True)
df.prefix.value_counts(dropna=False).sort_index()

 Leora        1
996           3
Capt.         1
Couple      117
Dr.          83
F          3847
Father        1
Hon.          1
Mr            9
Mr.        3472
Rabi          1
Rev.          9
Unknown    2090
Name: prefix, dtype: int64

Step 4. Map strings 'Mr', 'Mr.', and 'Father' to 'M' (Male)

df.prefix.replace(r'^(Mr|Fa).*', 'M', regex=True, inplace=True)
df.prefix.value_counts().sort_index()

 Leora        1
996           3
Capt.         1
Couple      117
Dr.          83
F          3847
Hon.          1
M          3482
Rabi          1
Rev.          9
Unknown    2090
Name: prefix, dtype: int64

Step 5. Finaly map other strings to 'Other'

df.prefix.replace(r'^(?!Couple|[FMU]).*', 'Other', regex=True, inplace=True)
df.prefix.value_counts().sort_index()

Couple      117
F          3847
M          3482
Other        99
Unknown    2090
Name: prefix, dtype: int64

df.prefix.value_counts().sum() == df.shape[0]

True

df.isnull().sum()

id             0
prefix         0
pay_date    1169
dtype: int64

We also see there are 1169 missing values in pay_date. Let's deal with them.

Convert the column `pay_date` to datetime¶

First we're going to convert the column 'pay_date' from object to date type.

pd.to_datetime(df.pay_date, format='%b-%d-%Y', errors='coerce').isnull().sum()

1202

We can see that besides 1169 missing values pandas convert other 33 (= 1202 - 1169) invalid strings to null, let's check what they are.

try:
    pd.to_datetime(df.pay_date, format='%b-%d-%Y')
except Exception as inst:
    print(inst)

time data 'Feb-18-200' does not match format '%b-%d-%Y' (match)

The string 'Feb-18-200' is wrong with the year. Let's see how many this kind of data there are.

filter_wrong_year = df.pay_date.notnull() & (df.pay_date.str.len() == 10)
wrong_year = df[filter_wrong_year]
print(wrong_year.head())
wrong_year.shape

        id prefix    pay_date
67      27      M  Feb-18-200
276    149      F  Feb-11-101
1277  1003      M  Mar-20-101
1625  1307      M  Jan-10-100
1908  1642      F  Feb-04-100

(28, 3)

We have found 28 of 33 invalid non-null strings. What are the remaining 5 (= 33 - 28)?

filter_5 = df.pay_date.notnull() & (df.pay_date.str.len() != 10)
try:
    pd.to_datetime(df.ix[filter_5, 'pay_date'], format='%b-%d-%Y')
except Exception as inst:
    print(inst)

Out of bounds nanosecond timestamp: 8201-06-09 00:00:00

The year is 8201!

pd.Timestamp.min, pd.Timestamp.max

(Timestamp('1677-09-21 00:12:43.145225'),
 Timestamp('2262-04-11 23:47:16.854775807'))

(why pandas uses the two datetimes?)

Let's check how many years are outside of pandas valid date range.

df[(df.pay_date.str.len() != 10) & 
   ((df.pay_date.str.slice(7, 11) > '2262') | 
    (df.pay_date.str.slice(7, 11) < '1677'))]

There are exactly 5 strings with invalid years. Having found all the 33 invalid date strings, we are now more confident about the conversion as shown below.

df['pay_date'] = pd.to_datetime(df.pay_date, format='%b-%d-%Y', errors='coerce')

Pandas (and other tools) can't do everything for us. We need to do further data integrity check. Because Raising the Roof was launched in 1996, the payment dates earlier than 1996 or later than the data cutoff date should be treated as invalid data.

min_date = pd.to_datetime('1996-01-01')
max_date = pd.to_datetime('2016-12-03') # we suppose today is the cutoff date

filter_biz = (df.pay_date < min_date) | (df.pay_date > max_date)
num_invalid_biz = df[filter_biz].shape[0]
print("There are {} dates violateing business rules.".format(num_invalid_biz))
Display('df[filter_biz].head()', 'df[filter_biz].tail()')

There are 826 dates violateing business rules.

num_null_before = df.pay_date.isnull().sum()
num_null_before

1202

We change the dates violating business rules to missing values.

df.ix[filter_biz, ['pay_date']] = pd.NaT

df.sort_values('pay_date', inplace=True)
Display('df[df.pay_date.notnull()].head()', 'df[df.pay_date.notnull()].tail()')

num_null_before, num_invalid_biz, df.pay_date.isnull().sum()

(1202, 826, 2028)

df.isnull().sum()

id             0
prefix         0
pay_date    2028
dtype: int64

The number of nulls in pay_date changes from 1202 to 2028 after the additional 826 conversions.

Create a new column 'active' to represent if donors are active¶

The business rule for counting active donors: a donor remains active for one year after the donating.

min_active_date = df.pay_date.max() - pd.Timedelta('365 days')
min_active_date

Timestamp('2015-09-09 00:00:00')

We suppose a donor is active if the last date of her/his donating is later than 2015-09-08.

1: active; 0: not active

df['active'] = np.where(df['pay_date'] >= min_active_date, 1, 0)
Display('df.head()', 'df.tail()', 'df[df.pay_date.notnull()].tail()')

df.active.value_counts()

0    9307
1     328
Name: active, dtype: int64

df.pay_date.isnull().sum()

2028

How about these 2028 missing dates?

For null dates, we use random integers with [0,1] discrete uniform distribution to assign values to them.

np.random.seed(0)
nr = np.random.randint(2, size=df.shape[0])

df.active.where(df.pay_date.notnull(), nr, inplace=True)

df.active.value_counts()

0    8248
1    1387
Name: active, dtype: int64

We can see 1387 is roughly equal to 328 + (2028 / 2), which means the distribution of the assigned values (0 or 1) for null pay_dates is almost uniform.

Display('df.head()', 'df.tail()')

Donation frequency analysis¶

Some donors donate many times; some only once. What percents of donors donate once, twice, or N times? It's acutally a time-to-event (also called survival analysis) question. Let's try to answer it.

donation_freq = df.groupby(['id'])
donation_freq

<pandas.core.groupby.DataFrameGroupBy object at 0x0000021E01972278>

rows = []
for donor_id, group in donation_freq:  # group is DataFrame
    rows.append((donor_id, group.shape[0], group.prefix.values[0], group.active.max(), 
                 pd.NaT if group.pay_date.isnull().any() else group.pay_date.max()))
rows[:5]

[(1, 17, 'F', 1, NaT),
 (2, 1, 'Unknown', 1, NaT),
 (3, 3, 'Unknown', 1, NaT),
 (4, 3, 'F', 1, NaT),
 (5, 4, 'F', 1, NaT)]

df_freq = pd.DataFrame(rows, columns=['donor_id', 'freq', 'group','active', 'last_pay_date'])
df_freq.sort_values('freq', ascending=False, inplace=True)
Display('df_freq.head()', 'df_freq.tail()')

df_freq.shape

(6323, 5)

df_freq.freq.sum() == df.shape[0]

True

We'll use a survival analysis package called kmsurvival to do the donation frequency analysis.

$ pip install kmsurvival

from kmsurvival import KMSurvival
%matplotlib inline

cutoff = df_freq.last_pay_date.max().strftime('%Y-%m-%d')
kms = KMSurvival(col_censored='active',
                 col_tenure='freq')
estimates = kms.fit(df_freq, cutoff)
Display('estimates.head()', 'estimates.tail()')

kms.cutoff

'2016-09-08'

title='Donation Frequency Curve'
xlabel='Frequency'
kms.plot(title=title, 
         xlabel=xlabel, 
         vlines=[1, 3], 
         figsize=(10, 6), 
         style='ggplot')

From the figure above, we can see:

about 34% donors have donated more than one time -- that also means that about 66% donors just donate once and then stop donating.
15% donors have donated more than three times.
about 2.5% donors have donated more than 24 times and still active, which means it is possible that they will donate again.

Let's compare among groups.

kms.fit(df_freq, cutoff, group=['group'])
kms.subgrps_

OrderedDict([('group', ('M', 'Unknown', 'F', 'Couple', 'Other'))])

kms.plot(strata=[['F', 'M', 'Couple']], 
         title=title, xlabel=xlabel, 
         figsize=(10, 6), vlines=[1, 3, 17])

From the comparisons, we can get some interesting information.

The couple donors have the highest 'survival' probabilities, which means it is more possible for couples to donate more times than single donors.
The male donors are a little higher than the female ones in terms of the probabilities of donating more times.
Only the male donors have donated more than 17 times and still be active.

Summary¶

We often say "know your data". It's acutally about knowing the business behind data. I took almost half of the day to figure out what the data means, but it was worth the time.
The real data is not neat, just as the real world.
We learn from each other in a team. And produce results early and often so you have something to talk about and get timely feedback. Iterate data analysis quickly.
How to deal with missing values, how to create derived features, and how to come up with the right set of variables for modeling? To answer these questions needs experience and business knowledge. They are the art part of data science as the tweet says:

Julie tweet

What a day!

Thanks for reading!

	Expr1000	ID	IncomeID	FundAccount	IncomeType	Payment Date	Payment Amount	Payment Method	Prefix	City1	StateOrProvince1	Postal Code	Area1	TotalGifts	NumberOfGifts
0	1	1	2419	TQN	TQS	May-10-2001	25.0	Cheque	Ms.	Ottawa	YT	K1R 5M8	NaN	540.01	8
1	1	1	63	X	X	NaN	75.0	Cheque	Ms.	Ottawa	YT	K1R 5M8	NaN	540.01	8
2	1	1	7996	TP	TPE	NaN	0.0	Cheque	Ms.	Ottawa	YT	K1R 5M8	NaN	540.01	8
3	1	1	2891	GF	DN	NaN	50.0	Unknown	Ms.	Ottawa	YT	K1R 5M8	NaN	540.01	8
4	1	1	8325	TP	HW	NaN	0.0	Money order	Ms.	Ottawa	YT	K1R 5M8	NaN	540.01	8

	id	prefix	pay_date
1440	1130	Unknown	Jun-09-8201
3640	3061	F	Nov-24-3003
4085	3303	M	Aug-31-1201
7577	5604	M	Jan-17-1201
8957	9704	Unknown	Apr-04-4201

	id	prefix	pay_date
3264	2831	Unknown	1901-02-02
3265	2832	F	1901-02-05
3267	2833	F	1901-08-13
7569	5604	M	2070-09-24
9550	10625	Unknown	2106-05-30

	id	prefix	pay_date
510	290	F	1997-11-09
561	313	Unknown	1998-01-01
517	290	F	1998-02-03
518	291	M	1998-02-24
479	262	M	1998-03-01

	id	prefix	pay_date
9632	10732	F	2016-07-28
7248	5390	M	2016-07-31
8199	8512	M	2016-07-31
8933	9682	Unknown	2016-08-28
9634	10734	M	2016-09-08

	id	prefix	pay_date
10	1	F	1900-11-20
11	1	F	1900-11-20
19	3	Unknown	1900-12-01
20	3	Unknown	1900-03-03
45	22	F	1900-12-21

	id	prefix	pay_date
8737	9465	Unknown	NaT
8957	9704	Unknown	NaT
9343	10320	Unknown	NaT
9550	10625	Unknown	NaT
9597	10674	F	NaT

	donor_id	freq	group	active	last_pay_date
4665	5604	40	M	1	NaT
2366	3156	38	Unknown	1	NaT
2085	2837	26	Unknown	1	NaT
550	999	25	Unknown	1	NaT
46	70	24	M	1	NaT

	donor_id	freq	group	active	last_pay_date
2306	3077	1	M	0	2003-11-24
2305	3076	1	M	0	2003-11-24
2304	3075	1	M	0	2003-11-24
2303	3074	1	F	0	2003-11-24
6322	10734	1	M	1	2016-09-08

estimates_	survival	hazard	cnt_uncensored	cnt_censored	cutoff
_tenure_
1	0.337814	0.662186	4187	877	2016-09-08
2	0.206338	0.389198	490	154	2016-09-08
3	0.150979	0.268293	165	82	2016-09-08
4	0.120209	0.203804	75	43	2016-09-08
5	0.096167	0.200000	50	22	2016-09-08

estimates_	survival	hazard	cnt_uncensored	cnt_censored	cutoff
_tenure_
24	0.025961	0.166667	1	1	2016-09-08
25	0.025961	0.000000	0	1	2016-09-08
26	0.025961	0.000000	0	1	2016-09-08
38	0.025961	0.000000	0	1	2016-09-08
40	0.025961	0.000000	0	1	2016-09-08