An overview of dimensionality reduction via topics.
Topic modeling is a type of dimensionality reduction method that is especially useful for high-dimensional count matrices. For example, it can be applied to,
For clarity, we will refer to samples as documents and features as words. However, keep in mind that these methods can be used more generally – we will see a biological application three lectures from now.
These models are useful to know about because they provide a compromise between clustering and PCA.
Without going into mathematical detail, topic models perform dimensionality reduction by supposing,
Figure 1: An overview of the topic modeling process. Topics are distributions over words, and the word counts of new documents are determined by their degree of membership over a set of underlying topics. In an ordinary clustering model, the bars for the memberships would have to be either pure purple or orange. Here, each document is a mixture.
To illustrate the first point, consider modeling a collection of newspaper articles. A set of articles might belong primarily to the “politics” topic, and others to the “business” topic. Articles that describe a monetary policy in the federal reserve might belong partially to both the “politics” and the “business” topic.
For the second point, consider the difference in words that would appear in politics and business articles. Articles about politics might frequently include words like “congress” and “law,” but only rarely words like “stock” and “trade.”
Geometrically, LDA can be represented by the following picture. The corners of the simplex
Figure 2: A geometric interpretation of LDA, from the original paper by Blei, Ng, and Jordan.