In this high-level overview of R, my goal is to get you writing code as quickly as possible. As a Business Intelligence analyst/developer (regardless of Tableau, PowerBI, Qlik, etc.), it’s safe to assume you’re already comfortable with data types (numeric, string, etc.) and basic functionality (addition, multiplications, etc.). And let’s be real: we’re all going to Google our ways to becoming experts.
Why R opposed to Python?!!
Remember: Concepts over syntax.
Setup & Installation
Download R and RStudio respectively. All of the defaults should be fine for how you’ll be using R 95% of the time.
It should go without saying, but you’re going to need an instance of R on your machine. Here is a list of central repositories (sites) that you can download R from.
Next, you’ll need an IDE (Integrated Development Environment). By far, RStudio is the most common. Here is a link to download RStudio. The Desktop version (free) will be just fine. Alternatively, if you’re use to more traditional development, you can use Visual Studio (version 2017 as of this post).
Part 1: the RStudio Environment
Variables, working directories, console, etc.
Setting/confirming your working directory
It’s always important to understand where all of your work and files are being read and saved from. You should be comfortable with a few basic commands:
## Check current working directory getwd() ## Manually set working directory setwd("C:/TimLafferty/R_Tutorials/")
Additionally, there are a couple of simpler ways to do this.
From the toolbar:
From the ‘Files’ tab on the 4th quadrant of the RStudio pane:
Part 2: Importing Data
The datasets for this tutorial can be found here.
Without data, R is just an aesthetically dull calculator. Here’s a few ways to bring your datasets into your environment:
- Reading .csv’s
.csv’s are probably the most straight forward to import. We’ll use the ‘read.csv’ method to assign the data to a variable – just like so:
## Most basic approach df.orders = read.csv("Superstore-Orders.csv", header = TRUE, sep = ",") df.people = read.csv("Superstore-People.csv", header = TRUE, sep = ",") df.returns = read.csv("Superstore-Returns.csv", header = TRUE, sep = ",")
Excel spreadsheets are probably the most common (flattened) data source that Business Intelligence units deal with. Reading those in via R is pretty simple as well:
## We need to install and load the 'readxl' package if(!require(readxl)) install.packages("readxl", repos = "http://cran.us.r-project.org") library(readxl) ## We assign each sheet in the workbook its own variable df.orders = read_xlsx("Datasources/Superstore.xlsx", sheet = "Orders", col_names = TRUE) df.people = read_xlsx("Datasources/Superstore.xlsx", sheet = "People", col_names = TRUE) df.returns = read_xlsx("Datasources/Superstore.xlsx", sheet = "Returns", col_names = TRUE)
There are ways to assign multiple sheets to one variable, but it’s a bit outside the scope of this post. Feel free to reach out if you’d like to learn.
This is absolutely possible and done frequently. However, for a crash-course type of post like this, I think it might actually do more harm than good. Best case scenario is it confuses people (unnecessarily) and worst case you drop a production database. If there’s interest, I’ll do a quick write up on connecting via ODBC drivers.
Part 3: tidyverse
tidyverse makes a life of data easier. Tidyverse consists of 8 main packages designed to help import, manipulate and interrogate data sets of all kinds. Here is a brief overview of the 8 packages copied directly from the tidyverse site:
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.
*More details and examples in following section.
tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive.
tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code.
stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations
forcats provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.
Part 4: dplyr
dplyr is huge. It might even be the deciding reason I prefer R over Python. The ability to chain (or pipe – ‘%>%’) functions together makes development so much more coherent. The functions & methods that dplyr offer are robust but not overwhelming – very simple to understand. For instance – here are a few of the main functions:
Just like it sounds – select which columns you’d like to include in the proceeding functions.
df = df.orders %>% select(Order.ID, Order.Date, Customer.ID, State, Sub.Category, Sales, Profit)
This function will aggregate your data on the categorical (dimension) columns. This is used in conjunction with the Summarize function, so I’ll combine the example with the following bullet point
Typically following a Group_By statement, the Summarize function will determine how the measures are aggregated – SUM(), MEAN(), MIN(), etc. For this exercise, we’ll create a few aggregated measures including TotalSales, AvgSales, TotalProfit, AvgProfit and TotalTransactions.
df = df.orders %>% select(Order.ID, Order.Date, Customer.ID, State, Sub.Category, Sales, Profit) %>% group_by(Customer.ID) %>% summarize(TotalTransactions = n_distinct(Order.ID), TotalSales = sum(Sales), AvgSales = mean(Sales), TotalProfit = sum(Profit), AvgProfit = mean(Profit))
Simply reordering the data. To be clear: this will physically rewrite the data in the order you specify unlike querying data in a database. For this exercise, we’ll sort by the Total Sales in from high to low (descending).
df = df.orders %>% select(Order.ID, Order.Date, Customer.ID, State, Sub.Category, Sales, Profit) %>% group_by(Customer.ID) %>% summarize(TotalTransactions = n_distinct(Order.ID), TotalSales = sum(Sales), AvgSales = mean(Sales), TotalProfit = sum(Profit), AvgProfit = mean(Profit)) %>% arrange(desc(TotalSales))
There will be plenty of times when you’ll want to create a new column – which is the main purpose of the Mutate method. In this example, we’ll create two new columns that are calculations of existing columns.
df = df.orders %>% select(Order.ID, Order.Date, Customer.ID, State, Sub.Category, Sales, Profit) %>% group_by(Customer.ID) %>% summarize(TotalTransactions = n_distinct(Order.ID), TotalSales = sum(Sales), AvgSales = mean(Sales), TotalProfit = sum(Profit), AvgProfit = mean(Profit)) %>% arrange(desc(TotalSales)) %>% mutate(TotalMargin = TotalProfit/TotalSales, AvgMargin = AvgProfit/AvgSales)
dplyr supports the 4 “common” joins (INNER, LEFT, RIGHT & FULL), 2 filtering joins (ANTI & SEMI) and a nested join (way outside the scope of this post). I imagine most people reading this far are familiar with the 4 common joins, so I’m going to focus on the two filtering joins.
Example 1: Keeping only the orders that have NOT been returned:
## Approach 1 returnedOrders = anti_join(df.orders, df.returns, by = c("Order.ID" = "Order.ID")) ## Approach 2 (piped) returnedOrders = df.orders %>% anti_join(df.returns, by = c("Order.ID" = "Order.ID"))
Example 2: Keeping only the orders that HAVE been returned:
## Approach 1 nonReturnedOrders = semi_join(df.orders, df.returns, by = c("Order.ID" = "Order.ID")) ## Approach 2 (piped) nonReturnedOrders = df.orders %>% semi_join(df.returns, by = c("Order.ID" = "Order.ID"))
All of these concepts should be fairly familiar – especially if you’re already comfortable with any variant of ANSI-SQL (T-SQL, mySQL, etc.).
Part 5: Lubridate
Anyone dealing with data will most likely need to address date fields at some point. lubridate is a great, easy-to-use package for manipulating and parsing dates. In fact, it’s so straight-forward that I’m just going to point you to their documentation for examples.
Hopefully this was easy enough to follow and provided enough value to spark your interest. Most of the following posts will be focused on building models and structuring their outputs in a way that’s easily-compatible with dedicated BI platforms such as Tableau. Be sure to subscribe to be kept up-to-date on future machine learning posts!