R, Text-Mining, Walk-Throughs

Text-mining & Sentiment Analysis:
The Basics

Sentiment analysis and text-mining is critical in looking beyond the face-value of words. What is your data really telling you?

Not familiar with R? Check out this crash-course specifically for BI developers!

Setup

In this tutorial, we’ll be using the transcripts of 6 Quentin Tarantino films (Pulp Fiction, Inglourious Basterds, Django Unchained, Kill Bill 1 & 2, and The Hateful Eight.
Download the transcripts here.

Setup for this walk-through is fairly simple and straight-forward. The majority of our example revolves around two packages:

  • tidyverse
  • tidyverse is an incredibly robust collection of R packages aggregated specifically for data science tasks. Some of the packages include dplyr (pretty much the sole reason I prefer R over Python) for data manipulation, tidyr for structuring data in tidy formats, stringr for string manipulation, readr for parsing external files and more.

  • tidytext
  • tidytext is a great, lightweight package for stemming and lemmatizing (both covered in a bit) words.

  • textstem
  • tidytext is the arguable one of the fundamental packages for text mining. There’s a great number of functions that minimize the amount of hard work (read: coding) that you’ll have to do.

if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
library(tidyverse)

if(!require(tidytext)) install.packages("tidytext", repos = "http://cran.us.r-project.org")
library(tidytext)

if(!require(textstem)) install.packages("textstem", repos = "http://cran.us.r-project.org")
library(textstem)

Without spending too much time rambling, the following steps read in the raw data source (.csv), clean the data & structure the data in a conducive way for a sentiment model.

Build a function that expands contractions:

## Function for expanding common contractions in English
expandContractions = function(text) {
  text = gsub("won't", "will not", text)
  text = gsub("can't", "can not", text)
  text = gsub("'m", " am", text)
  text = gsub("'d", " would", text)
  text = gsub("n't", " not", text)
  text = gsub("'ve", " have", text)
  text = gsub("'ll", " will", text)
  text = gsub("'re", " are", text)
  # 's can be possessive, but we accept the tradeoff
  text = gsub("'s", " is", text)
  return(text)
}

Import/ingest the data source as a data frame:

## You'll need to adjust the file path appropriate to your environment
df.source = read.csv("Tarantino-Transcripts.csv", fileEncoding="UTF-8-BOM") %>%
  data.frame()

## Strip all punctuation from text and transform to lower case
## Rename some of the columns because I'm OCD about naming conventions...
df = df.source %>%
  rename(index = X, line = Line) %>%
  mutate(line = tolower(str_replace_all(line, '[:punct:]', "")))

Awesome! From here, we’ll move on to Text Normalization.

Part 1: Tokenizing

Tokenization is a really cool term for string spitting – it’s nothing more that splitting sentences up into their individual words (generally kept together with an ID of some sort). This is where the textstem package saves the day. It’s incredibly simple to do this:

## Tokenize the 'line' column with the result in a new column 'word'
df.tokenized = df %>%
  unnest_tokens(word, line, token = "words") 
head(df.tokenized, n = 10)

Notice that the ID column is still in-tact. This is crucial in keeping track sentences. Although this is a fundamental data structure for any sort of NLP or Sentiment analysis, by itself it’s not very useful. You can derive a word-count. But so what?

Part 2: Blacklists (or Stop Words)

If you skim through the tokenized dataset, the first thing you’ll notice is an abundance of noise words – words such as “the”, “is”, “of”, “and”, etc. Most of the time, you’ll want to remove these from the dataset in order to get a clearer picture of meaningful words. In our case, we’re going to flag them instead of removing them entirely. This allows us the flexibility to of having both sets of context for our visualizations.

## Create a blacklist of words that will be omitted from analysis
stop_words_custom = c("words", "that", "will", "be", "excluded")
head(stop_words)
blacklist = c(stop_words$word, stop_words_custom)

## Add a flag to notate whether word is stop-word or not
df.tokenized = df.tokenized %>%
  mutate(isStopWord = ifelse(word %in% stop_words$word, 1, 0))

Part 3: Stemming

Stemming more/less removes prefixes and suffixes, (attempting to) leave only the root remaining. In terms of scalability, stemming is inexpensive performance-wise. That being said, stemming does not consider morphological information. Check it out:

## Add stem to the tokenized data frame
df.tokenized = df.tokenized %>%
  mutate(stem = stem_words(word))
head(df.tokenized, n = 10)

Part 4: Lemmatizing

Lemmatizing is very similar to stemming with out main exception: morphological context is considered. For example: if the source word is “remembering”, the stem shows “rembemb”. Recall that stemming is context-agnostic and simply removes suffices & prefixes. The lemma of “remembering” would yield “remember” – which is what we’d expect. Stemming is much quicker performance-wise, however I’d speculate that only 1%-2% of you will have text data sources large enough to notice unbearably poor performance while lemmatizing.

## Add lemma to the tokenized data frame
df.tokenized = df.tokenized %>%
  mutate(lemma = lemmatize_words(word))
head(df.tokenized, n = 10)

Part 5: N-grams

If your goal is to show frequently used words, at least show phrases consisting of 2-3 words. This weeds out a bit of ambiguous terminology. N-grams do exactly this. The most commonly used are bi-grams (two-word phrases) and tri-grams (four-word phrases). Wait – tri-grams are four-word phrases?! No. Just wanted to keep you on your toes.

## Return the top 25 bigrams
df.bigrams = df %>%
  unnest_tokens(bigram, line, token = "ngrams", n = 2) %>%
  select(bigram) %>%
  group_by(bigram) %>%
  summarize(occurances = n()) %>%
  arrange(desc(occurances)) %>%
  top_n(25)

Your turn: Create a data frame with the top 25 most used trigrams.

Conclusion: Creating Output

Now that we have a few different sets of data, let’s export them to a readable format (most frequently .csv) so we can begin exploring them in our visualization tools (Tableau, PowerBI, Qlik, etc.).

## First argument: dataset to export; Second argument: location, filename, & extension
write.csv(df.tokenized, "output/tokenized.csv")
write.csv(df.bigrams, "output/bigrams.csv")

Although not too much data thus far, you can use these datasets to start understanding the data a bit more.

Next time: Sentiment & Emotional Lexicons

In the next post of this 3-part series, we’ll introduce the concept of 3 different sentiment models as well as some emotional lexicons. Make sure to subscribe to stay-up-date!

What are your thoughts?!