Data preparation and model fitting code for topics.
library("dplyr")
library("ggplot2")
library("gutenbergr")
library("stringr")
library("tidyr")
library("tidytext")
library("topicmodels")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
There are several packages in R that can be used to fit topic models. We will
use LDA as implemented in the topicmodels
package, which expects input to be
structured as a DocumentTermMatrix
, a special type of matrix that stores the
counts of words (columns) across documents (rows). In practice, most of the
effort required to fit a topic model goes into transforming the raw data into a
suitable DocumentTermMatrix
.
To illustrate this process, let’s consider the “Great Library Heist” example from the reading. We imagine that a thief has taken four books — Great Expectations, Twenty Thousand Leagues Under The Sea, War of the Worlds, and Pride & Prejudice — and torn all the chapters out. We are left with pieces of isolated pieces of text and have to determine from which book they are from. The block below downloads all the books into an R object.
titles <- c("Twenty Thousand Leagues under the Sea",
"The War of the Worlds",
"Pride and Prejudice",
"Great Expectations")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title")
books
# A tibble: 53,724 × 3
gutenberg_id text title
<int> <chr> <chr>
1 36 "cover " The War of the Worlds
2 36 "" The War of the Worlds
3 36 "" The War of the Worlds
4 36 "" The War of the Worlds
5 36 "" The War of the Worlds
6 36 "The War of the Worlds" The War of the Worlds
7 36 "" The War of the Worlds
8 36 "by H. G. Wells" The War of the Worlds
9 36 "" The War of the Worlds
10 36 "" The War of the Worlds
# … with 53,714 more rows
Since we imagine that the word distributions are not equal across the books, topic modeling is a reasonable approach for discovering the books associated with each chapter. Note that, in principle, other clustering and dimensionality reduction procedures could also work.
First, let’s simulate the process of tearing the chapters out. We split the raw texts anytime the word “Chapter” appears. We will keep track of the book names for each chapter, but this information is not passed into the topic modeling algorithm.
unest_tokens
function in the tidytext package. Then, we can count the number of times each
word appeared in each document using count
, a shortcut for the usual
group_by
and summarize(n = n())
pattern.word_counts <- by_chapter %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(document, word) # shortcut for group_by(document, word) %>% summarise(n = n())
word_counts
# A tibble: 101,279 × 3
document word n
<chr> <chr> <int>
1 Great Expectations_0 1867 1
2 Great Expectations_0 charles 1
3 Great Expectations_0 contents 1
4 Great Expectations_0 dickens 1
5 Great Expectations_0 edition 1
6 Great Expectations_0 expectations 1
7 Great Expectations_0 illustration 1
8 Great Expectations_100 age 1
9 Great Expectations_100 arose 1
10 Great Expectations_100 barnard’s 1
# … with 101,269 more rows
DocumentTermMatrix
. The issue is that the DocumentTermMatrix
expects words
to be arranged along columns, but currently they are stored across rows. The
line below converts the original “long” word counts into a “wide”
DocumentTermMatrix
in one step. Across these 4 books, we have 65 chapters and
a vocabulary of size 18325.<<DocumentTermMatrix (documents: 195, terms: 18744)>>
Non-/sparse entries: 101279/3553801
Sparsity : 97%
Maximal term length: 19
Weighting : term frequency (tf)
LDA
function to fit a
topic model. We choose \(K = 4\) topics because we expect that each topic will
match a book. Different hyperparameters can be set using the control
argument.A LDA_VEM topic model with 4 topics.
tidy
function,
specifying whether we want the topics (beta) or memberships (gamma).chapters_lda@gamma
) because it ensures the
output is a tidy data.frame, rather than a matrix. Tidy data.frames are easier
to visualize using ggplot2.# A tibble: 74,976 × 3
topic term beta
<int> <chr> <dbl>
1 1 captain 0.0154
2 1 _nautilus_ 0.0127
3 1 sea 0.00911
4 1 nemo 0.00873
5 1 ned 0.00798
6 1 conseil 0.00683
7 1 water 0.00624
8 1 land 0.00605
9 1 sir 0.00486
10 1 feet 0.00373
# … with 74,966 more rows
# A tibble: 780 × 3
document topic gamma
<chr> <int> <dbl>
1 Great Expectations_0 1 0.00402
2 Great Expectations_0 2 0.988
3 Great Expectations_0 3 0.00402
4 Great Expectations_0 4 0.00402
5 Great Expectations_100 1 0.000607
6 Great Expectations_100 2 0.000607
7 Great Expectations_100 3 0.603
8 Great Expectations_100 4 0.396
9 Great Expectations_101 1 0.0000201
10 Great Expectations_101 2 0.0000201
# … with 770 more rows
For attribution, please cite this work as
Sankaran (2022, Dec. 31). STAT 436 (Spring 2023): Fitting Topic Models. Retrieved from https://krisrs1128.github.io/stat436_s23/website/stat436_s23/posts/2022-12-27-week11-2/
BibTeX citation
@misc{sankaran2022fitting, author = {Sankaran, Kris}, title = {STAT 436 (Spring 2023): Fitting Topic Models}, url = {https://krisrs1128.github.io/stat436_s23/website/stat436_s23/posts/2022-12-27-week11-2/}, year = {2022} }