Data Science, Machine Learning, Tableau

Market Basket Analysis w/ R

Market Basket Analysis is a great tool for learning about the relationships between products and the behavior of customers.

If a customer has a family-size bag of Doritos and 6-pack of Mountain Dew in their basket, what is the likelihood they also grab the latest X-box game?

Before We Start:

While it’s not necessary to understand the full math behind Apriori algorithm(s), it does help to understand the motivation behind them. Apriori algorithm(s) are used for frequent item set mining and association rule learning in a transactional database. Fundamentally, it’s a great methodology to ease your way into Market Basket Analysis. However, you are likely to run into scalability issues as your dataset increases.

Additionally, this tutorial assumes your data is in transactional state: 1 product & 1 transaction ID per row where product to transaction relationship is 1:N
tl;dr: Apriori can quickly become a memory hog.

Downloadable Materials

A list of the following files can be found in this shareable drive on Google.

  1. The raw dataset (SupstoreForR.csv)
  2. The R Script (Apriori-Generate-Ruletset.R)
  3. The ruleset generated from the Apriori model (SuperStore-Apriori-Rules.csv)
  4. The output used for visualization (Apriori-Output-Superstore.csv)

Step 1: Building the Model

Before we get too deep into building this model, I want to reinforce that this is a very basic implementation. I’m not sure I would take this model directly to your boss and call it complete. Hopefully, though, it gets you started in the right direction.

Let’s start with loading the libraries needed. Fortunately, there not many at all – as in there’s only one non-base package needed.

if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org") library(dplyr)

if(!require(arules)) install.packages("arules", repos = "http://cran.us.r-project.org") library(arules)

Okay – now we need to import the data. We’re simply going to read the csv file and convert the dataset into a data frame.

## Read the transactional data and convert to data frame
transactions = read.csv('SuperstoreForR.csv') %>%
  data.frame()

Simple enough. Now, let’s create a new data frame with just the variables that are important – ‘Order ID’ & ‘Sub Category’.

## Select only the columns of interest; make sure ID's are characters/strings
df = transactions %>%
  select(Sub.Category, Order.ID) %>%
  mutate(Order.ID = as.character(Order.ID))

Notice that we explicitly cast Order.ID as a character. Much like Tableau, numerical IDs are often assumed to be measures – which we know to be false. It’s always worth double-checking your IDs to ensure they’re the correct datatype.

Alright – one more step to finish prepping our data. As weird as this may sound, importing a file as a transaction-type is exponentially quicker than converting an existing data frame that’s currently in memory (which makes sense, but for another post…). We’re going to write our data frame to a temp .csv and read in back in as a transaction.

## Write transactions to file and read back as a transaction from arules package
## Weird - I know. I don't know the full reasoning of this either...
write.csv(df, "transactions.csv", row.names = F)
txns = read.transactions(file = "transactions.csv", 
                         format = "single",
                         sep = ",",
                         cols=c(2,1),
                         skip=1, rm.duplicates = T)

Awesome! Let’s build the model. Take 30 seconds and read the definitions below as these will be critical in tweaking your model.

Notice we pass in two of the aforementioned parameters: support (0.001) and confidence (0.01). For the Superstore dataset, I had to set the threshold fairly low in order to get a decent resultset. In the real word (with millions of transactions, most confidence levels start at 0.5.

## Generate the ruleset based on the transactions
## Specify both support, confidence, & max number of products (4 = 3 items + 1 add-on)
rules = apriori(txns, parameter = list(support=0.001, confidence=0.05, maxlen=4))
rules = sort(rules, decreasing=TRUE, by="confidence")

There. The model is built. Kind of anti-climatic, isn’t it? Let’s look at the details of our model:

summary(rules)

Sweet. Now let’s write out results to a file and move onto step two!

write(rules, file = 'SuperStore-Apriori-Rules.csv', sep=",", quote=TRUE)

Or, if you’re lazy like me, here’s the source code. You’ll just need to change the file path and you should be fine.

Step 2: Prep The Data For Visualization

As you might have already noticed, the generated ruleset is displayed in an unfamiliar format:

{ Product A, Product B, Product C } => Product D

We need to manipulate the rule in a few ways:

  1. Remove noise characters (such as ‘{‘, ‘}’, ‘=>’, etc.)
  2. Store the right-hand side (RHS) in a column by itself (I named mine ‘Add-On Item’)
  3. Separate the left-hand side by commas and store each value in a column (Basket-1, Basket-2, … Basket-N). Our model has a max itemset of 4 which means there are 3 supporting items (LHS) and 1 ‘recommended’ item (RHS). Some rules will have only 1 or 2 items on the LHS – I recommend filling the NULL values with ‘(Empty)’ for ease of use.

When the Apriori Rules dataset is imported into Tableau, it should look something like this:

Fortunately, Tableau has made it very easy to manipulate the data in our favor. The first thing we do is custom split out ALL data by the “=>” characters. This ensures a clean separation of the LHS (left-hand side) and RHS (right-hand side).

Our LHS is our “basket” and our RHS is our “item”. From here, we just need a bit of clean up. Let’s edit the calculation to remove the curly braces and rename the column to something more readable: “item”.

 

Similarly, we’re going to split the LHS up into 3 column. This part is going to take two steps:

  1. Remove curly braces and split at the “=>” pattern.
  2. Split the new calculation at the commas. At this point, there should be no need for a custom split – the default split should accept commas and produce 3 more columns  (see the image below).  After the columns are generated, hide the irrelevant columns and rename the new 3 new columns. I choose to name them “Basket 1”, “Basket 2”, & “Basket 3”.

 

Step 3: Share The Results

 

This is the fun part. In my opinion, Market Basket Analysis projects can be difficult to visualize and present effectively. Since we have up to 3 products in our basket (yielding an Add-On Item), I decided to break my dashboard up into 4 distinct sections: Basket 1, Basket 2, Basket 3 and the resulting item. The resulting item(s) have their respective model measurements (confidence, lift, support and observations. Finally, I added a small section for the association rules themselves. I think of this segment more of a cheat sheet to guide the user to the most frequent item sets.

Follow this link for the interactive version

Conclusion

Market basket analysis is one of the most studied disciplines in the data science industry. Mathematically, there’s no shortage of progression. However, in my experience, I’ve found a HUGE gap in presenting the MBA results (other than basic forms). Variants of network graphs are often used, but I don’t find those charts intuitive to a lot of end-users. Given a powerful tool such as Tableau, I’d love to see what else visualization junkies can come up with.

What are your thoughts?!