Gewei's BlogArchive

A Case Study of TDD in Data Analysis


Introduction

Test-driven development (TDD) uses agile and lean approaches and test-first practice instead of testing near the end of a development cycle. In this post we will use a simplified example of association rules in retail industry to illustrate TDD in data analysis.

The common metrics for association rules are support, confidence and lift, however, we'll see sometimes these are not enough to ensure rules extracted are correct. Before diving into details, let's give our test-driven data analysis a brief definition:

  • Short analysis cycle. Begining with simple cases, we'll generate one-way association rules.
  • Use so simple but real data that we can extract rules quickly and check them easily.
  • Write a failing test to check the rules.
  • Refactor the original code to make the failing test pass.
  • [Next iteration]


Generate one-way rules on clean data

Association rules is unsupervised learning techniques which try to find useful patterns in transaction data. The simplest rules are one-way rules, for example, customers buy product A are likely to buy product B at the same time.


Suppose we have a receipt from a grocery store called Neat, and we'll derive one-way associaton rules for the store using SQL. Let's take a look at the data. The entity-relationship (ER) diagram of Neat data is also given.


A receipt of store Neat

The entity-relationship diagram of Neat data


Insert the Neat data on the receipt into database

SELECT * FROM products_neat;                   SELECT * FROM orders_neat;      SELECT * FROM order_items_neat;
product_id | name             | category       order_id | order_date           order_item_id | order_id | product_id | quantity
-----------+------------------+----------      ---------+-----------           --------------+----------+------------+---------
6148301460 | CARROTS 3LB      | PRODUCE        100      | 2016-01-01             1           | 100      | 6148301460 |  1
4053       | LEMON            | PRODUCE                                          2           | 100      | 4053       |  2
4065       | PEPPER GREEN SWT | PRODUCE                                          3           | 100      | 4065       |  1
2193860    | CHICKEN LEG BA   | MEATS                                            4           | 100      | 2193860    |  1
6340002536 | WOND BREAD WW    | BAKERY                                           5           | 100      | 6340002536 |  1
6340011169 | WHL WHT FIBR BRD | BAKERY                                           6           | 100      | 6340011169 |  1
5719732951 | ROOSTER TOFU     | DELI                                             7           | 100      | 5719732951 |  3


One-way association rules for Neat

Because of the data schemas of Neat, the three SQL commands give us the same results but use different execution time.

only SELECT using GROUP BY using DISTINCT
SELECT order_id, product_id FROM order_items_neat; SELECT order_id, product_id FROM order_items_neat GROUP BY order_id, product_id; SELECT DISTINCT order_id, product_id FROM order_items_neat;



The three commands give the same result

order_id | product_id
---------+-----------
 100     | 6148301460
 100     | 4053
 100     | 4065
 100     | 2193860
 100     | 6340002536
 100     | 6340011169
 100     | 5719732951


We see the SQL with only SELECT is faster than the one using GROUP BY, which is faster than the one using DISTINCE. The data used here is just one transaction. When we have millions of tranaction records, the difference could be even more pronounced. So we decide to use the SQL without GROUP BY or DISTINCT in the common table expression (CTE) baskets.

/*
Extract one-way association rules for stores Neat and Messy.
Compare the differences between no GROUP BY and using GROUP BY in CTE baskets
Only change the CTE baskets for different scenarios.
The CTE rules and the main SELECT clause are common parts.
*/

WITH baskets AS (
    -- baskets of store Neat
    -- without GROUP BY or DISTINCT
    SELECT oi.order_id, oi.product_id, p.name,
           count(*) OVER (PARTITION BY oi.product_id) AS cnt,
           tot_order
    FROM order_items_neat oi
        INNER JOIN products_neat p ON oi.product_id = p.product_id
        CROSS JOIN (SELECT COUNT(DISTINCT order_id) tot_order FROM order_items_neat) t
),

-- The code below is common for different baskets
rules AS (
    -- ls: left sides of association rules
    -- rs: right sides of association rules
    SELECT ls.product_id AS ls_pid, rs.product_id AS rs_pid,
           ls.name AS ls_name, rs.name AS rs_name,
           count(*) AS cnt_ls_rs,
           max(ls.cnt) AS cnt_ls, max(rs.cnt) AS cnt_rs,
           max(ls.tot_order) AS tot_order
    FROM baskets ls JOIN
         baskets rs ON ls.order_id = rs.order_id
                   AND ls.product_id != rs.product_id
    GROUP BY ls.product_id, rs.product_id, ls_name, rs_name      
)
SELECT format('%-20s => %20s', ls_name, rs_name) AS rules,
       round(1.0 * cnt_rs / tot_order, 2) AS expectation,   -- Expectation of right side
       round(1.0 * cnt_ls_rs / tot_order, 2) AS support,    -- Support of the rule: {left side} => {right side}
       round(1.0 * cnt_ls_rs / cnt_ls, 2) AS confidence,    -- Confidence of the rule
       round(1.0 * cnt_ls_rs * tot_order / (cnt_ls * cnt_rs), 2) AS lift  -- Lift of the rule
FROM rules
ORDER BY lift DESC, confidence DESC, support DESC;


As we expect, all values of three metrics (support, confidence, and lift) are 1 because we only have one order. The first three rows are:

                   rules                     | expectation | support | confidence | lift
---------------------------------------------+-------------+---------+------------+------
CARROTS 3LB          =>   PEPPER GREEN SWT   | 1.00        | 1.00    | 1.00       | 1.00
LEMON                =>   CARROTS 3LB        | 1.00        | 1.00    | 1.00       | 1.00
ROOSTER TOFU         =>   PEPPER GREEN SWT   | 1.00        | 1.00    | 1.00       | 1.00



Write a failing test

Suppose Neat has acquired another grocery store called Messy. We'll apply the code of Neat to the data of Messy to see if it works. Let's look at the new data.

A receipt of store Messy

The entity-relationship diagram of Messy data


We can see two receipts represent different database schemas.

Insert the Messy data on the receipt into database

SELECT * FROM products_messy;                 SELECT * FROM orders_messy;       SELECT * FROM order_items_messy;
product_id | name         | category          order_id | order_date             order_item_id | order_id | product_id | quantity
-----------+--------------+----------         ---------+-----------             --------------+----------+------------+---------
2305       | KRO WHL MLK  | DAIRY             200      | 2016-01-01               1           | 200      | 2305       |  1
2306       | KRO SKM MLK  | DAIRY                                                 2           | 200      | 2305       |  1
2307       | ROMA TOMATO  | PRODUCE                                               3           | 200      | 2305       |  1
2308       | KRO TKY GRND | NA                                                    4           | 200      | 2305       |  1
                                                                                  5           | 200      | 2306       |  1
                                                                                  6           | 200      | 2306       |  1
                                                                                  7           | 200      | 2306       |  1
                                                                                  8           | 200      | 2306       |  1
                                                                                  9           | 200      | 2307       |  1
                                                                                 10           | 200      | 2308       |  1  
                                                                                 11           | 200      | 2308       |  1  


Then we apply the SQL code of Neat to the Messy data. We only need to change the common table expression (CTE) baskets to:

WITH baskets AS (
    -- baskets of store Messy
    -- without GROUP BY or DISTINCT
    SELECT oi.order_id, oi.product_id, p.name,
           count(*) OVER (PARTITION BY oi.product_id) AS cnt,
           tot_order
    FROM order_items_messy oi
        INNER JOIN products_messy p ON oi.product_id = p.product_id
        CROSS JOIN (SELECT COUNT(DISTINCT order_id) tot_order FROM order_items_messy) t
),  

However, we get obvious wrong results (the first three rows shown here):

                    rules                      | expectation | support | confidence | lift
-----------------------------------------------+-------------+---------+------------+------
KRO SKM MLK          =>   KRO WHL MLK          | 4.00        | 16.00   | 4.00       | 1.00
KRO WHL MLK          =>   KRO SKM MLK          | 4.00        | 16.00   | 4.00       | 1.00
KRO TKY GRND         =>   KRO WHL MLK          | 4.00        | 8.00    | 4.00       | 1.00


All values are supposed to be 1 with only one order. The good news is that we now have a failing test case. Let's figure out what's wrong with the original code.

The CTE baskets is the building block for generating the rules. Its data pipeline is:

The wrong counting happens in the window function, which is in the last step. We want the windown function to calculate the total number of each unique product_id in each order.

If we use PostgreSQL and try count(DISTINCT oi.product_id) OVER (PARTITION BY oi.product_id) AS cnt, we get ERROR: DISTINCT is not implemented for window functions. So the product_id has to be unique for each order_id before being piped into the window function. A common practice is to use GROUP BY.


Refactor the original code

Change the common table expression baskets to:

WITH baskets AS (
    -- baskets of store Messy
    -- using GROUP BY
    SELECT oi.order_id, oi.product_id, p.name,
           count(*) OVER (PARTITION BY oi.product_id) AS cnt,
           max(tot_order) AS tot_order
    FROM order_items_messy oi
        INNER JOIN products_messy p ON oi.product_id = p.product_id
        CROSS JOIN (SELECT COUNT(DISTINCT order_id) tot_order FROM order_items_messy) t
        GROUP BY oi.order_id, oi.product_id, p.name
),

After refactoring we get correct output (the first three rows shown here):

                    rules                      | expectation | support | confidence | lift
-----------------------------------------------+-------------+---------+------------+------
KRO TKY GRND         =>   KRO WHL MLK          | 1.00        | 1.00    | 1.00       | 1.00
KRO SKM MLK          =>   KRO TKY GRND         | 1.00        | 1.00    | 1.00       | 1.00
KRO TKY GRND         =>   KRO SKM MLK          | 1.00        | 1.00    | 1.00       | 1.00

Now we have passed the failing test!


During the first iteration, we have known the data, built a basic association rules extracting function, written a failing test, and refactored the code to pass it. All of these actions are quick and lean. We are now comfortable to move on with more data or extracting complex rules. And remember: writing failing tests for each iteration.




Summary

We have used an example to illustrate how data analysis can work hand in hand with testing to explore and produce results early and often. TDD even advocates writing tests first before coding. Though we don't always have time budget and resources to do this kind of TDD, and during the exploratory phase testing might not be necessary, we keep TDD in mind and practice it if necessary for at least three reasons:

  • Writing test cases is a process of clarifying objectives of data analysis, and documenting how the code is intended to work. As code becomes more complex, the need for documentation increases.
  • TDD can help mitigate the fear and error of refactoring, modifying features, or adding new features. With test cases, we don't do them at the mercy of luck.
  • When switching from exploratory data analysis phase to production phase, the agile way of TDD can ensure that our data products are on the right track.


Thanks for reading!