An overview of dimensionality reduction via topics.
Topic modeling is a type of dimensionality reduction method that is especially useful for high-dimensional count matrices. For example, it can be applied to,
For clarity, we will refer to samples as documents and features as words. However, keep in mind that these methods can be used more generally – we will see a biological application three lectures from now.
These models are useful to know about because they provide a compromise between clustering and PCA.
Without going into mathematical detail, topic models perform dimensionality reduction by supposing,
Figure 1: An overview of the topic modeling process. Topics are distributions over words, and the word counts of new documents are determined by their degree of membership over a set of underlying topics. In an ordinary clustering model, the bars for the memberships would have to be either pure purple or orange. Here, each document is a mixture.
To illustrate the first point, consider modeling a collection of newspaper articles. A set of articles might belong primarily to the “politics” topic, and others to the “business” topic. Articles that describe a monetary policy in the federal reserve might belong partially to both the “politics” and the “business” topic.
For the second point, consider the difference in words that would appear in politics and business articles. Articles about politics might frequently include words like “congress” and “law,” but only rarely words like “stock” and “trade.”
Geometrically, LDA can be represented by the following picture. The corners of the simplex1 represent different words (in reality, there would be \(V\) different corners to this simplex, one for each word). A topic is a point on this simplex. The closer the topic is to one of the corners, the more frequently that word appears in the topic.
Figure 2: A geometric interpretation of LDA, from the original paper by Blei, Ng, and Jordan.
A simplex is the geometric object describing the set of probability vectors over \(V\) elements. For example, if \(V = 3\), then \(\left(0.1, 0, 0.9\right)\) and \(\left(0.2, 0.3, 0.5\right)\) belong to the simplex, but not \(\left(0.3, 0.1, 9\right)\), since it sums to a number larger than 1.↩︎