Data for Good teaming up with Raising the Roof held a DataThon on Saturday in Toronto. I was glad to take part in it. It was 12-hour data crunching on about 20 data sets. In this post, we'll deal with one of them -- a donation Excel file. We're going to explore and clean it, create a new feature, and finally do donation frequency analysis.
import numpy as np
import pandas as pd
class Display(object):
"""Display multiple objects horizontally"""
template = """<div style="float: left; padding: 10px;">
<p>{0}</p><font size=2>{1}</font></div>
"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
df = pd.read_excel("donation.xlsx")
df.head()
df.shape
df.ix[df.Expr1000 != df.ID].shape
The columns Expr1000
and ID
have same values, and we use ID
to identify donors.
For this post, we only use three columns as shown below.
df = df[['ID', 'Prefix', 'Payment Date']]
df.columns = ['id', 'prefix', 'pay_date'] # rename columns
df.head()
df.dtypes
prefix
¶prefix = df.prefix.value_counts(dropna=False)
print("There are {} prefixes:\n\n{}".format(prefix.size, prefix))
The data is dirty in terms of one group having multiple values, e.g., 'Mr.' and 'Mr'. We can writing a Python dictionary to map prefixes to a few groups but it's tedious and not flexible; however, writing one big regular expression to deal with all the patterns is error-prone and hard to read.
Nothing is particularly hard if you divide it into small jobs.
-- Henry Ford
We'll use a pipeline of regular expressions to process the prefix
column step by step.
Step 1. Replace missing values with Unknown
.
df['prefix'].fillna('Unknown', inplace=True)
Step 2. Map string including '&' or 'and' to 'Couple'
df['prefix'].replace(r'^[DM].*(and|&).*', 'Couple', regex=True, inplace=True)
df.prefix.value_counts(dropna=False).sort_index()
Step 3. Map string beginning 'M' and including 's' or 'S' to 'F' (Female)
df.prefix.replace(r'^M.*[sS].*', 'F', regex=True, inplace=True)
df.prefix.value_counts(dropna=False).sort_index()
Step 4. Map strings 'Mr', 'Mr.', and 'Father' to 'M' (Male)
df.prefix.replace(r'^(Mr|Fa).*', 'M', regex=True, inplace=True)
df.prefix.value_counts().sort_index()
Step 5. Finaly map other strings to 'Other'
df.prefix.replace(r'^(?!Couple|[FMU]).*', 'Other', regex=True, inplace=True)
df.prefix.value_counts().sort_index()
df.prefix.value_counts().sum() == df.shape[0]
df.isnull().sum()
We also see there are 1169 missing values in pay_date
. Let's deal with them.
pay_date
to datetime¶First we're going to convert the column 'pay_date' from object to date type.
pd.to_datetime(df.pay_date, format='%b-%d-%Y', errors='coerce').isnull().sum()
We can see that besides 1169 missing values pandas convert other 33 (= 1202 - 1169) invalid strings to null, let's check what they are.
try:
pd.to_datetime(df.pay_date, format='%b-%d-%Y')
except Exception as inst:
print(inst)
The string 'Feb-18-200' is wrong with the year. Let's see how many this kind of data there are.
filter_wrong_year = df.pay_date.notnull() & (df.pay_date.str.len() == 10)
wrong_year = df[filter_wrong_year]
print(wrong_year.head())
wrong_year.shape
We have found 28 of 33 invalid non-null strings. What are the remaining 5 (= 33 - 28)?
filter_5 = df.pay_date.notnull() & (df.pay_date.str.len() != 10)
try:
pd.to_datetime(df.ix[filter_5, 'pay_date'], format='%b-%d-%Y')
except Exception as inst:
print(inst)
The year is 8201!
pd.Timestamp.min, pd.Timestamp.max
(why pandas uses the two datetimes?)
Let's check how many years are outside of pandas valid date range.
df[(df.pay_date.str.len() != 10) &
((df.pay_date.str.slice(7, 11) > '2262') |
(df.pay_date.str.slice(7, 11) < '1677'))]
There are exactly 5 strings with invalid years. Having found all the 33 invalid date strings, we are now more confident about the conversion as shown below.
df['pay_date'] = pd.to_datetime(df.pay_date, format='%b-%d-%Y', errors='coerce')
Pandas (and other tools) can't do everything for us. We need to do further data integrity check. Because Raising the Roof was launched in 1996, the payment dates earlier than 1996 or later than the data cutoff date should be treated as invalid data.
min_date = pd.to_datetime('1996-01-01')
max_date = pd.to_datetime('2016-12-03') # we suppose today is the cutoff date
filter_biz = (df.pay_date < min_date) | (df.pay_date > max_date)
num_invalid_biz = df[filter_biz].shape[0]
print("There are {} dates violateing business rules.".format(num_invalid_biz))
Display('df[filter_biz].head()', 'df[filter_biz].tail()')
num_null_before = df.pay_date.isnull().sum()
num_null_before
We change the dates violating business rules to missing values.
df.ix[filter_biz, ['pay_date']] = pd.NaT
df.sort_values('pay_date', inplace=True)
Display('df[df.pay_date.notnull()].head()', 'df[df.pay_date.notnull()].tail()')
num_null_before, num_invalid_biz, df.pay_date.isnull().sum()
df.isnull().sum()
The number of nulls in pay_date
changes from 1202 to 2028 after the additional 826 conversions.
The business rule for counting active donors: a donor remains active for one year after the donating.
min_active_date = df.pay_date.max() - pd.Timedelta('365 days')
min_active_date
We suppose a donor is active if the last date of her/his donating is later than 2015-09-08.
1: active; 0: not active
df['active'] = np.where(df['pay_date'] >= min_active_date, 1, 0)
Display('df.head()', 'df.tail()', 'df[df.pay_date.notnull()].tail()')
df.active.value_counts()
df.pay_date.isnull().sum()
How about these 2028 missing dates?
For null dates, we use random integers with [0,1] discrete uniform distribution to assign values to them.
np.random.seed(0)
nr = np.random.randint(2, size=df.shape[0])
df.active.where(df.pay_date.notnull(), nr, inplace=True)
df.active.value_counts()
We can see 1387 is roughly equal to 328 + (2028 / 2), which means the distribution of the assigned values (0 or 1) for null pay_dates is almost uniform.
Display('df.head()', 'df.tail()')
Some donors donate many times; some only once. What percents of donors donate once, twice, or N times? It's acutally a time-to-event (also called survival analysis) question. Let's try to answer it.
donation_freq = df.groupby(['id'])
donation_freq
rows = []
for donor_id, group in donation_freq: # group is DataFrame
rows.append((donor_id, group.shape[0], group.prefix.values[0], group.active.max(),
pd.NaT if group.pay_date.isnull().any() else group.pay_date.max()))
rows[:5]
df_freq = pd.DataFrame(rows, columns=['donor_id', 'freq', 'group','active', 'last_pay_date'])
df_freq.sort_values('freq', ascending=False, inplace=True)
Display('df_freq.head()', 'df_freq.tail()')
df_freq.shape
df_freq.freq.sum() == df.shape[0]
We'll use a survival analysis package called kmsurvival
to do the donation frequency analysis.
$ pip install kmsurvival
from kmsurvival import KMSurvival
%matplotlib inline
cutoff = df_freq.last_pay_date.max().strftime('%Y-%m-%d')
kms = KMSurvival(col_censored='active',
col_tenure='freq')
estimates = kms.fit(df_freq, cutoff)
Display('estimates.head()', 'estimates.tail()')
kms.cutoff
title='Donation Frequency Curve'
xlabel='Frequency'
kms.plot(title=title,
xlabel=xlabel,
vlines=[1, 3],
figsize=(10, 6),
style='ggplot')
From the figure above, we can see:
Let's compare among groups.
kms.fit(df_freq, cutoff, group=['group'])
kms.subgrps_
kms.plot(strata=[['F', 'M', 'Couple']],
title=title, xlabel=xlabel,
figsize=(10, 6), vlines=[1, 3, 17])
From the comparisons, we can get some interesting information.
What a day!
Thanks for reading!