5 minute read

I recently came across the tidytext package in R, with the accompanying book: Text Mining in R by David Robinson and Julia Silge. I found it very cogent and practical for basic text mining and NLP problems.

The book builds on tidy data principles, so a knowledge of dplyr and ggplot2 really helped with picking up the book and jumping into some NLP.

There are a pleathora of out-of-the-box tools that help with basic natural language processing (NLP) tasks, such as:

  • functions for tokenizing documents
  • built in dataframes of common stop-words (e.g. - a, an, and, the, but)
  • functions for calculating tf, idf, and tf-idf

After playing around a bit with examples, I thought it would be interesting to see what my 38 page research prospectus which I spent months slaving over boiled down to. Here’s how I did it.


Bring in Data

I first saved my .docx file as a .txt in UTF-8 encoding because, in short, it’s easier for R to read. The result is a very messy table, which I won’t print here.

path <- 'rp.txt' # the local file path to my research prospectus

dat <- read.table(path, header = FALSE, fill = TRUE) # fill = TRUE b/c rows are of unequal length

Load libraries

library(dplyr) # for data wrangling
library(tidytext) # for NLP
library(stringr) # to deal with strings
library(wordcloud) # to render wordclouds
library(knitr) # for tables
library(DT) # for dynamic tables
library(tidyr)

1. Tidy

Since the package we’re using adheres to tidy data principles, step 1 is to get this messy table into a one column data frame, with one word in each row.

# reshape the .txt data frame into one column
tidy_dat <- tidyr::gather(dat, key, word) %>% select(word)

tidy_dat$word %>% length() # there are 10,480 tokens in my document
## [1] 10504
unique(tidy_dat$word) %>% length() # and of these, 2,866 are unique 
## [1] 2874

2. Tokenize

The next step is to tokenize, or boil the dataframe down down to only unique observations, and count the number of each observation. To perform this, we use the out-of-the-box function unnest_tokens(), which takes 3 arguments:

  • a tidy data frame
  • name of the output column to be created
  • name of the input column to be split into tokens

Then we use the count() function from dplyr to group by words and tally observations. Becauase count() performs a group_by() on the word column, we can’t forget to ungroup().

# tokenize
tokens <- tidy_dat %>% 
  unnest_tokens(word, word) %>% 
  dplyr::count(word, sort = TRUE) %>% 
  ungroup()

Just because a token is common doesn’t mean it’s important. For instance, take a look at the most 10 common tokens in my research prospectus.

tokens %>% head(10)
## # A tibble: 10 x 2
##    word            n
##    <chr>       <int>
##  1 the           487
##  2 and           367
##  3 of            344
##  4 in            235
##  5 to            235
##  6 groundwater   192
##  7 a             139
##  8 water         138
##  9 is            112
## 10 for            92

Of the 10, only 2 actually tell us something about what’s written about: groundwater, and water. Cleaning natural language is like panning for gold: most of language is useless, but every once in a while we find a gold nugget. We want to get only the nuggets.