5 minute read
I recently came across the
tidytext package in R, with the accompanying book: Text Mining in R by David Robinson and Julia Silge. I found it very cogent and practical for basic text mining and NLP problems.
The book builds on tidy data principles, so a knowledge of
ggplot2 really helped with picking up the book and jumping into some NLP.
There are a pleathora of out-of-the-box tools that help with basic natural language processing (NLP) tasks, such as:
- functions for tokenizing documents
- built in dataframes of common stop-words (e.g. - a, an, and, the, but)
- functions for calculating tf, idf, and tf-idf
After playing around a bit with examples, I thought it would be interesting to see what my 38 page research prospectus which I spent months slaving over boiled down to. Here’s how I did it.
Bring in Data
I first saved my .docx file as a .txt in UTF-8 encoding because, in short, it’s easier for R to read. The result is a very messy table, which I won’t print here.
path <- 'rp.txt' # the local file path to my research prospectus dat <- read.table(path, header = FALSE, fill = TRUE) # fill = TRUE b/c rows are of unequal length
library(dplyr) # for data wrangling library(tidytext) # for NLP library(stringr) # to deal with strings library(wordcloud) # to render wordclouds library(knitr) # for tables library(DT) # for dynamic tables library(tidyr)
Since the package we’re using adheres to tidy data principles, step 1 is to get this messy table into a one column data frame, with one word in each row.
# reshape the .txt data frame into one column tidy_dat <- tidyr::gather(dat, key, word) %>% select(word) tidy_dat$word %>% length() # there are 10,480 tokens in my document
##  10504
unique(tidy_dat$word) %>% length() # and of these, 2,866 are unique
##  2874
The next step is to tokenize, or boil the dataframe down down to only unique observations, and count the number of each observation. To perform this, we use the out-of-the-box function
unnest_tokens(), which takes 3 arguments:
- a tidy data frame
- name of the output column to be created
- name of the input column to be split into tokens
Then we use the
count() function from
dplyr to group by words and tally observations. Becauase
count() performs a
group_by() on the word column, we can’t forget to
# tokenize tokens <- tidy_dat %>% unnest_tokens(word, word) %>% dplyr::count(word, sort = TRUE) %>% ungroup()
Just because a token is common doesn’t mean it’s important. For instance, take a look at the most 10 common tokens in my research prospectus.
tokens %>% head(10)
## # A tibble: 10 x 2 ## word n ## <chr> <int> ## 1 the 487 ## 2 and 367 ## 3 of 344 ## 4 in 235 ## 5 to 235 ## 6 groundwater 192 ## 7 a 139 ## 8 water 138 ## 9 is 112 ## 10 for 92
Of the 10, only 2 actually tell us something about what’s written about: groundwater, and water. Cleaning natural language is like panning for gold: most of language is useless, but every once in a while we find a gold nugget. We want to get only the nuggets.