Analysing Free Form Text

Introduction

In this post, I am going to introduce basic methods for analysing free form (i.e., unstructured) text. In a different post, we analysed some [free form text that is somewhat semi-structured] (/post/matching-messy-texts/) (e.g., addresses, firm names, etc.). In this post, we will expand on some of those techniques and explicitly focus on quantifying frequencies of words, topics, and sentiments. Free form texts are ubiquitous in the planning field. For example, you may want to understand public sentiment around a proposed project by analysing tweets or public comments. Or you may want to analyse newspaper articles and blogs to understand trending topics. The possibilities are endless.

However, by its very nature, unstructured text and natural language is very hard to pin down in numbers. Even things that are amenable to quantification (e.g., frequencies of words) lose context–and thus meaning and nuance–when quantified. Qualitative analysis of text is always more meaningful than reductive quantitative analyses. Nonetheless, quantitative approaches are useful for examinining large bodies of text (also referred to as a “corpus” of texts), and they can provide some advantages over qualitative analyses, including replicability and scalability.

Acquire Data and Packages

This post draws heavily from Will Curran-Groome’s final project for the Urban Analytics course in Spring 2020. We are going to analyse emails from a listserv (Cohousing-L) that focuses on cohousing. Cohousing is an intentional community of private homes clustered around shared space with some shared norms about voluntary contributions, management, and governance structures. US cohousing communities often comprise both rental and owner-occupied units; they frequently are multi-generational; they leverage existing legal structures, most often the home owner association (HOA), but the lived experience is often very different from that of conventional HOAs; and they also reflect a diversity of housing types, including apartment buildings, side-by-side duplexes and row homes, and detached single-family units.

Web-scraped emails (~45,000) and community characteristics are available here.

In this tutorial, we are going to use packages such as tidytext, textclean, and sentimentr in addition to other packages we have used previously. Please install them and call them into your library appropriately.

I am using the library calls to packages where I need them for pedagogical purposes. In general, you want to put all your library calls at the top of the script. Please pay particular attention to conflicts in the functions of the same name in different packages. Packages that are loaded later will take precedence over ones that are loaded earlier. If you want use the function from an earlier loaded package you can use packagename::function().
library(ids)
library(lubridate)
library(textclean)

msgs <- here("tutorials_datasets", "cohousingemails", "cohousing_emails.csv") %>% read_csv()

str(msgs)
# spec_tbl_df [45,000 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#  $ subject : chr [1:45000] "Test message 10/22/92 6:37 pm" "test message 10/22/92 6:47 pm" "Places to announce COHOUSING-L" "Discussion style" ...
#  $ author  : chr [1:45000] "Fred H Olson -- WB0YQM" "STAN" "Fred H Olson -- WB0YQM" "Fred H Olson -- WB0YQM" ...
#  $ email   : chr [1:45000] NA NA NA NA ...
#  $ date    : POSIXct[1:45000], format: NA NA ...
#  $ msg_body: chr [1:45000] "This is a test message. Fred" "Thes message was set to the list with the address in all caps: COHOUSING-L [at] UCI.COM Fred" "Judy, below is my list of places to announce the list including the Inovative Housing newsletter. Can you dig u"| __truncated__ "Another topic is what tone, style or whatever should we encourage on this list. Some discussions people introdu"| __truncated__ ...
#  $ thread  : chr [1:45000] NA NA NA NA ...
#  $ content : chr [1:45000] "\n\n\n\n\n\n\n\n\nTest message 10/22/92 6:37 pm\n\t  <– Date –>    <– Thread –>\n\n\tFrom: Fred H Olson -- WB0Y"| __truncated__ "\n\n\n\n\n\n\n\n\ntest message 10/22/92 6:47 pm\n\t  <– Date –>    <– Thread –>\n\n\tFrom: STAN (STAN%MNHEPvx.c"| __truncated__ "\n\n\n\n\n\n\n\n\nPlaces to announce COHOUSING-L\n\t  <– Date –>    <– Thread –>\n\n\tFrom: Fred H Olson -- WB0"| __truncated__ "\n\n\n\n\n\n\n\n\nDiscussion style\n\t  <– Date –>    <– Thread –>\n\n\tFrom: Fred H Olson -- WB0YQM (FRED%JWHv"| __truncated__ ...
#  - attr(*, "spec")=
#   .. cols(
#   ..   subject = col_character(),
#   ..   author = col_character(),
#   ..   email = col_character(),
#   ..   date = col_datetime(format = ""),
#   ..   msg_body = col_character(),
#   ..   thread = col_character(),
#   ..   content = col_character()
#   .. )
#  - attr(*, "problems")=<externalptr>

msgs2 <- msgs %>%
  mutate_all(as.character) %>% ## I (Will) added this because I was getting an error stemming from read_csv() returning all factor variables. 
  filter(!is.na(content)) %>%
  mutate(
    msg_id = random_id(n = nrow(.)), #  Create a random ID
    email = case_when(
      is.na(email) ~ content %>% 
        str_match("\\(.*\\..*\\)") %>%
        str_sub(2,-2),
      T ~ email
    ),
    content = content %>%
      str_replace_all("\\\n", " ") %>%
      str_squish(),

# Note that this section takes a long time; I recommend patience. It may make sense to save intermittent steps
# instead of sequencing a long chain of pipes.

    msg_body = msg_body %>% 
                stringi::stri_trans_general("Latin-ASCII") %>%
                replace_html() %>%
                replace_emoticon() %>%
                replace_time(replacement = '<<TIME>>') %>%
                replace_number(remove = TRUE) %>%
                replace_url() %>%
                replace_tag() %>%
                replace_email(),
    
    date = as.POSIXct(date),
    
    date = case_when(
      is.na(date) ~ str_match(
        content,
        "[0-9]{1,2} [A-Za-z]{3} [0-9]{2,4} [0-9]{2}:[0-9]{2}"
      ) %>%
      lubridate::dmy_hm(),
      T ~ date
    ),
    author = author %>% tolower %>% str_replace_all("[^a-z]", " "),
    email = email %>% tolower %>% str_replace_all("[^a-z\\.@_\\d ]", "")
  )

A Digression into Regular Expressions

In the above code, you see a number of regular expressions (also referred to as regex) that are used to find and manipulate particular sequences of characters. Both the textclean and stringr packages provide functions (e.g., replace_emoticon and str_match, respectively) that leverage regular expressions. Regular expressions provide a very powerful and concise syntax for working with text, but they can be very difficult to read, and if you’re not careful, they can return unintended results. Use them sparingly and add comments for clarity.

At their core, regular expressions match patterns in text. A pattern can be as simple as “abc,” or can be significantly more complicated. For a quick introduction to regular expressions in R, read through the corresponding R for Data Science chapter: https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions.


Exercise

  • In the above code, we have used the following regular expressions. Can you tell what pattern each of these regular expressions will match?

    • "\\(.*\\..*\\)"
    • "\\\n"
    • "[0-9]{1,2} [A-Za-z]{3} [0-9]{2,4} [0-9]{2}:[0-9]{2}"
    • "[^a-zA-Z]"
    • "[^a-zA-Z\\.@_\\d ]"

Hint: see https://cheatography.com/davechild/cheat-sheets/regular-expressions/

  • In the following string_lowercase, develop a regular expression that capitalizes the first letter of each sentence of string_lowercase (taken from https://en.wikipedia.org/wiki/Regular_expression). Is there an alternate or better way to do this without using regular expressions?
string_lowercase <- "a regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. usually such patterns are used by string-searching algorithms for 'find' or 'find and replace' operations on strings, or for input validation. it is a technique developed in theoretical computer science and formal language theory. the concept arose in the 1950s when the american mathematician stephen cole kleene formalized the description of a regular language. the concept came into common use with unix text-processing utilities. different syntaxes for writing regular expressions have existed since the 1980s, one being the posix standard and another, widely used, being the perl syntax. regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and awk and in lexical analysis. many programming languages provide regex capabilities either built-in or via libraries."

Hint: Read the documentation for str_replace and str_replace_all. Explore ?case for case conversion. You may need to use a different regular expression or different approach to capitalize the first sentence.


Tokenization, Stemming & Lemmatization

Tokenization

Tokenization is the task of chopping up your text into pieces, called tokens; it can also involve throwing away certain characters, such as punctuation. Tokenization is important in that it defines the smallest unit of analysis at which you can examine your text. In the simplest cases, tokens are simply words. However, there are number of rules that you may want to employ that would work in certain instances and would not work in others. For example, aren't can be tokenised as are and n't or aren and t or as arent, depending on which rules you want to use.

library(tidytext)

data(stop_words)

stop_words
# # A tibble: 1,149 × 2
#    word        lexicon
#    <chr>       <chr>  
#  1 a           SMART  
#  2 a's         SMART  
#  3 able        SMART  
#  4 about       SMART  
#  5 above       SMART  
#  6 according   SMART  
#  7 accordingly SMART  
#  8 across      SMART  
#  9 actually    SMART  
# 10 after       SMART  
# # … with 1,139 more rows

other_stop_words <- tibble( word = 
      c("cohousing",
        "mailing", 
      "list",
      "unsubscribe",
      "mailman", 
      "listinfo",
      "list",
      "º",
      "org",
      "rob",
      "ann",
      "sharon",
      "villines",
      "sandelin", 
      "zabaldo",
      "fholson"),
      
      lexicon = "CUSTOM")


stop_words <- bind_rows(stop_words, other_stop_words)

body_tokens <- msgs2 %>%
  unnest_tokens(word, msg_body, token='words') %>%
  anti_join(stop_words)

Other types of tokens include characters, n-grams, sentences, lines, paragraphs, and tweets. Explore these. anti_join is a useful function to only keep words that are not in the stop_words tibble. Notice why we needed to use word as a column name.

A naive word count and frequency would look as follows.

top_word_counts <- body_tokens %>%
  filter(
      !str_detect(word,"\\d"),
    !str_detect(word, "_")
  ) %>%
  group_by(word) %>%
  summarise(count = n()) %>%
  select(word = word, count) %>%
  arrange(desc(count))

Exercise

In the body_tokens above, we tried to remove common words that might skew the frequencies. Looking at the top_word_counts, iteratively build a list to remove more words to make the analysis more compelling and interesting.


Stemming

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. There are different algorithms that can be used in the stemming process, but the most common in English is the Porter stemmer. The purpose of stemming is to reduce similar words to their shared root; e.g., “talking,” “talked,” and “talks” might all be reduced to “talk.” We can use the SnowballC package to stem our tokens.

library(SnowballC)

top_stem_counts <- body_tokens %>%
                      select(word)%>%
                      mutate(stem_word = wordStem(word)) %>%
                       group_by(stem_word) %>%
                       summarise(count = n()) %>%
                       select(word = stem_word, count) %>%
                       arrange(desc(count))

top_stem_counts
# # A tibble: 73,015 × 2
#    word     count
#    <chr>    <int>
#  1 commun  104195
#  2 time     50521
#  3 peopl    50314
#  4 skeptic  47595
#  5 hous     42853
#  6 stick    37263
#  7 tongu    36576
#  8 common   27568
#  9 live     26204
# 10 info     24591
# # … with 73,005 more rows

Lemmatization

Lemmatization takes into consideration the morphological analysis of the words. For example, “ran” and “run” are derived from the same lemma. Lemmatization requires a language-specific dictonary for translating words to their lemmas; we will use a dictionary provided in the textstem package.

library(textstem)

top_lemm_counts <- body_tokens %>%
                      select(word)%>%
                       mutate(lemm_word = lemmatize_words(word))%>%
                       group_by(lemm_word) %>%
                       summarise(count = n()) %>%
                       select(word = lemm_word, count) %>%
                       arrange(desc(count))

g1 <- top_lemm_counts %>%
        top_n(30) %>%
        ggplot() + 
        geom_bar(aes(x=  reorder(word, count), y = count), stat = 'identity') +
        coord_flip() + 
        xlab("")+
        theme_bw()

library(plotly)

ggplotly(g1)
What’s going on with stick and tongue? Is this an issue with replacing the emoticons?

N-grams

Looking at words in isolation has all sorts of problems. For example, ‘not happy’ refers to a single concept than two different concepts of negation and happiness. This will affect sentiment analysis later on. To deal with this issue, we could potentially use n-grams. An n-gram reflects a token of n sequenced units. By looking at multiple units of text as a single token, we can overcome some of the challenges of looking at single words in isolation from their context. Here, we construct bigrams, or two-word n-grams.

bigrams <- msgs2 %>%
  select(-thread) %>%
  unnest_tokens(
    bigram,
    msg_body,
    token = "ngrams",
    n = 2
  )

negation_words <- c(
  "not",
  "no",
  "never", 
  "without",
  "don't",
  "cannot",
  "can't",
  "isn't",
  "wasn't",
  "hadn't",
  "couldn't",
  "wouldn't",
  "won't"
)

modified_stops <- stop_words %>%
  filter(!(word %in% negation_words))

refined_bigrams <- bigrams %>%
  separate(bigram, c("word1", "word2")) %>%
  filter(
    !word1 %in% modified_stops$word,
    !word2 %in% modified_stops$word
  ) %>%
  mutate(lemm_word1 = lemmatize_words(word1),
         lemm_word2 = lemmatize_words(word2))

refined_bigrams <- refined_bigrams %>%
  count(lemm_word1, lemm_word2, sort = T) %>%
  unite(bigram, lemm_word1, lemm_word2, sep = " ")

Exercise

  • Visualise the top 20 bigrams using ggplot.
  • Wordclouds are bad statistical graphics. However, they are popular. Create a Wordcloud for these bigrams using wordcloud2 package.

There is no reason to think that bigrams are the right tokens. You can use any combination of words. But you should be cognizant about the the trade offs between increasing the n and the marginal value to the analysis. As you can see, bigrams are a substantially larger dataset than words. For a vocabulary of 1,000 terms, the universe of potential bigrams are 1,000,000, though you will see much smaller dataset in practice because of linguistic conventions.

Sentiment Analysis

Sentiment analysis is the interpretation and classification of emotions (positive, negative, and neutral) within text data. It is notoriously unreliable without proper understanding of the context and linguistic patterns such as sarcasm, subtweeting, etc.

There are a number of dictionaries that exist for evaluating opinion or emotion in text. The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are:

  • AFINN from Finn Årup Nielsen,
  • bing from Bing Liu and collaborators, and
  • nrc from Saif Mohammad and Peter Turney.

All three of these lexicons are based on unigrams, i.e., tokens of single words. These lexicons contain many English words, and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, and sadness. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. There are also specialized lexicons, such as the loughran lexicon, which is designed for analysis of financial documents. This lexicon labels words with six possible sentiments important in financial contexts: “negative,” “positive,” “litigious,” “uncertainty,” “constraining,” or “superfluous.”

You can access these lexicons via:

library(textdata)
nrc_sentiment <- get_sentiments('nrc')

You can estimate the sentiment of each message by looking at the frequency of words with a particular sentiment using the following code.

body_tokens %>%
      select(word, msg_id) %>%
      mutate(lemm_word = lemmatize_words(word)) %>%
     inner_join(nrc_sentiment, by=c('lemm_word' = 'word')) %>%
     group_by(msg_id, sentiment) %>%
     summarize(count = n()) %>%
     mutate(freq = count/sum(count)) %>%
     pivot_wider(id_cols = msg_id, values_from=freq, names_from=sentiment, values_fill = 0) %>%
    top_n(5) 
# # A tibble: 44,609 × 11
# # Groups:   msg_id [44,609]
#    msg_id      anger anticipation disgust  fear negative positive sadness  trust
#    <chr>       <dbl>        <dbl>   <dbl> <dbl>    <dbl>    <dbl>   <dbl>  <dbl>
#  1 0000c5777… 0.0870       0.217   0.0435 0.130   0.217     0.174  0.0870 0.0435
#  2 000169c3f… 0            0.143   0      0       0         0.429  0      0.143 
#  3 0001b31d1… 0            0.15    0      0       0.05      0.5    0      0.2   
#  4 000364e2c… 0.02         0.16    0.01   0.02    0.1       0.34   0.08   0.15  
#  5 0004d9c5f… 0            0.2     0      0       0.2       0.4    0      0.1   
#  6 000551075… 0.02         0.24    0      0       0.08      0.34   0.02   0.1   
#  7 0007194c3… 0            0.2     0      0.133   0.0667    0.267  0      0.133 
#  8 00087d159… 0.0819       0.111   0.0234 0.111   0.129     0.240  0.0292 0.158 
#  9 0009e398e… 0.130        0.0926  0.0556 0.148   0.130     0.231  0.0648 0.0648
# 10 000a0aa90… 0            0.08    0.12   0       0.2       0.24   0.12   0.12  
# # … with 44,599 more rows, and 2 more variables: joy <dbl>, surprise <dbl>

Exercise

  • What are the problems with the above approach?
    • Hint 1: There’s a one-to-many relationship in NRC. Does this need fixing?
    • Hint 2: We know that the sentiment of a single word is often context-dependent. How can we address this (e.g., by using bigrams)?

N-grams are better than unigrams in many instances in detecting sentiment, but they have their own challenges. For example, it is not clear what the appropriate value of n is. In such instances, it may be useful to think about the sentiment of the sentence as a whole. Since sentences are denumerably infinite, it is not possible to create a sentiment dictionary for sentences.

Furthermore, the presence of valence shifters (negation, amplifier, deamplifier etc.) changes the meaning and sentiment of the sentences. In such instances, it may be useful to use a package that can consider the entire sentence rather than combinations of words. sentimentr is one such package.

library(sentimentr)

msgs2 %>% 
  top_n(50) %>% 
  get_sentences() %>%
  sentiment_by(by=c('date', 'author')) %>% 
  top_n(10)
#                    date            author word_count   sd ave_sentiment
#  1: 1997-12-16 14:15:00      paul viscuso        164 0.37          0.26
#  2: 2001-10-02 04:00:00    molly williams        724 0.33          0.28
#  3: 2006-07-12 04:00:00     martin sheehy        372 0.32          0.41
#  4: 2008-02-22 05:00:00     craig ragland        252 0.23          0.28
#  5: 2008-04-11 04:00:00      steven hecht        435 0.35          0.55
#  6: 2011-04-13 04:00:00     craig ragland        305 0.27          0.40
#  7: 2012-09-21 04:00:00    jerry mcintire        193 0.23          0.26
#  8: 2014-07-14 04:00:00 fred list manager        200 0.34          0.24
#  9: 2015-05-05 04:00:00       allison tom        315 0.27          0.32
# 10: 2015-08-26 04:00:00             diane        705 0.21          0.24

Exercise

  • Use a different package, such as syuzhet instead of sentimentr. What are the similarities and differences?

Conclusions

This is but a scratch in the vast field of text mining and natural language processing. As you may have noticed, much of this analysis is domain specific and, more importantly, language specific. While some principles are transferable, it is always a good idea to learn about a domain prior to devising an analytical strategy.

Additional Resources

Related