Fitting Topic Models

Data preparation and model fitting code for topics.

Kris Sankaran (UW Madison)
01-07-2024

Reading, Recording, Rmarkdown

library("dplyr")
library("ggplot2")
library("gutenbergr")
library("stringr")
library("tidyr")
library("tidytext")
library("topicmodels")
theme479 <- theme_minimal() + 
  theme(
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "#f7f7f7"),
    panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
    legend.position = "bottom"
  )
theme_set(theme479)
  1. There are several packages in R that can be used to fit topic models. We will use LDA as implemented in the topicmodels package, which expects input to be structured as a DocumentTermMatrix, a special type of matrix that stores the counts of words (columns) across documents (rows). In practice, most of the effort required to fit a topic model goes into transforming the raw data into a suitable DocumentTermMatrix.

  2. To illustrate this process, let’s consider the “Great Library Heist” example from the reading. We imagine that a thief has taken four books — Great Expectations, Twenty Thousand Leagues Under The Sea, War of the Worlds, and Pride & Prejudice — and torn all the chapters out. We are left with pieces of isolated pieces of text and have to determine from which book they are from. The block below downloads all the books into an R object.

titles <- c("Twenty Thousand Leagues under the Sea", 
            "The War of the Worlds",
            "Pride and Prejudice", 
            "Great Expectations")
books <- gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title")
books
# A tibble: 6,372 x 3
   gutenberg_id text                    title                
          <int> <chr>                   <chr>                
 1           36 "cover "                The War of the Worlds
 2           36 ""                      The War of the Worlds
 3           36 ""                      The War of the Worlds
 4           36 ""                      The War of the Worlds
 5           36 ""                      The War of the Worlds
 6           36 "The War of the Worlds" The War of the Worlds
 7           36 ""                      The War of the Worlds
 8           36 "by H. G. Wells"        The War of the Worlds
 9           36 ""                      The War of the Worlds
10           36 ""                      The War of the Worlds
# i 6,362 more rows
  1. Since we imagine that the word distributions are not equal across the books, topic modeling is a reasonable approach for discovering the books associated with each chapter. Note that, in principle, other clustering and dimensionality reduction procedures could also work.

  2. First, let’s simulate the process of tearing the chapters out. We split the raw texts anytime the word “Chapter” appears. We will keep track of the book names for each chapter, but this information is not passed into the topic modeling algorithm.

by_chapter <- books %>%
  group_by(title) %>%
  mutate(
    chapter = cumsum(str_detect(text, regex("chapter", ignore_case = TRUE)))
  ) %>%
  group_by(title, chapter) %>%
  mutate(n = n()) %>%
  filter(n > 5) %>%
  ungroup() %>%
  unite(document, title, chapter)
  1. As it is, the text data are long character strings, giving actual text from the novels. To fit LDA, we only need counts of each word within each chapter – the algorithm throws away information related to word order. To derive word counts, we first split the raw text into separate words using the unest_tokens function in the tidytext package. Then, we can count the number of times each word appeared in each document using count, a shortcut for the usual group_by and summarize(n = n()) pattern.
word_counts <- by_chapter %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(document, word) # shortcut for group_by(document, word) %>% summarise(n = n())

word_counts
# A tibble: 8,825 x 3
   document                word            n
   <chr>                   <chr>       <int>
 1 The War of the Worlds_0 10,000,000      1
 2 The War of the Worlds_0 12              1
 3 The War of the Worlds_0 140,000,000     1
 4 The War of the Worlds_0 1894            1
 5 The War of the Worlds_0 2               1
 6 The War of the Worlds_0 35,000,000      1
 7 The War of the Worlds_0 8th             1
 8 The War of the Worlds_0 _been_          1
 9 The War of the Worlds_0 _brutes_        1
10 The War of the Worlds_0 _daily          3
# i 8,815 more rows
  1. These words counts are still not in a format compatible with conversion to a DocumentTermMatrix. The issue is that the DocumentTermMatrix expects words to be arranged along columns, but currently they are stored across rows. The line below converts the original “long” word counts into a “wide” DocumentTermMatrix in one step. Across these 4 books, we have 65 chapters and a vocabulary of size 18325.
chapters_dtm <- word_counts %>%
  cast_dtm(document, word, n)
chapters_dtm
<<DocumentTermMatrix (documents: 4, terms: 6352)>>
Non-/sparse entries: 8825/16583
Sparsity           : 65%
Maximal term length: 17
Weighting          : term frequency (tf)
  1. Once the data are in this format, we can use the LDA function to fit a topic model. We choose \(K = 4\) topics because we expect that each topic will match a book. Different hyperparameters can be set using the control argument.
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda
A LDA_VEM topic model with 4 topics.
  1. There are two types of outputs produced by the LDA model: the topic word distributions (for each topic, which words are common?) and the document-topic memberships (from which topics does a document come from?). For visualization, it will be easiest to extract these parameters using the tidy function, specifying whether we want the topics (beta) or memberships (gamma).
topics <- tidy(chapters_lda, matrix = "beta")
memberships <- tidy(chapters_lda, matrix = "gamma")
  1. This tidy approach is preferable to extracting the parameters directly from the fitted model (e.g., using chapters_lda@gamma) because it ensures the output is a tidy data.frame, rather than a matrix. Tidy data.frames are easier to visualize using ggplot2.
# highest weight words per topic
topics %>%
  arrange(topic, -beta)
# A tibble: 25,408 x 3
   topic term        beta
   <int> <chr>      <dbl>
 1     1 martians 0.0118 
 2     1 martian  0.0105 
 3     1 time     0.0102 
 4     1 looked   0.00950
 5     1 day      0.00888
 6     1 eyes     0.00888
 7     1 machine  0.00756
 8     1 food     0.00752
 9     1 green    0.00732
10     1 hill     0.00696
# i 25,398 more rows
# topic memberships per document
memberships %>%
  arrange(document, topic)
# A tibble: 16 x 3
   document                topic      gamma
   <chr>                   <int>      <dbl>
 1 The War of the Worlds_0     1 0.00000205
 2 The War of the Worlds_0     2 0.00127   
 3 The War of the Worlds_0     3 0.999     
 4 The War of the Worlds_0     4 0.00000205
 5 The War of the Worlds_1     1 0.0000204 
 6 The War of the Worlds_1     2 1.00      
 7 The War of the Worlds_1     3 0.0000204 
 8 The War of the Worlds_1     4 0.0000204 
 9 The War of the Worlds_2     1 0.219     
10 The War of the Worlds_2     2 0.00000307
11 The War of the Worlds_2     3 0.00000307
12 The War of the Worlds_2     4 0.781     
13 The War of the Worlds_3     1 0.998     
14 The War of the Worlds_3     2 0.000599  
15 The War of the Worlds_3     3 0.000599  
16 The War of the Worlds_3     4 0.000599  

Citation

For attribution, please cite this work as

Sankaran (2024, Jan. 7). STAT 436 (Spring 2024): Fitting Topic Models. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week11-2/

BibTeX citation

@misc{sankaran2024fitting,
  author = {Sankaran, Kris},
  title = {STAT 436 (Spring 2024): Fitting Topic Models},
  url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week11-2/},
  year = {2024}
}