An overview of dimensionality reduction via topics.
Topic modeling is a type of dimensionality reduction method that is especially useful for high-dimensional count matrices. For example, it can be applied to,
For clarity, we will refer to samples as documents and features as words. However, keep in mind that these methods can be used more generally – we will see a biological application three lectures from now.
These models are useful to know about because they provide a compromise between clustering and PCA.
Without going into mathematical detail, topic models perform dimensionality reduction by supposing,
To illustrate the first point, consider modeling a collection of newspaper articles. A set of articles might belong primarily to the “politics” topic, and others to the “business” topic. Articles that describe a monetary policy in the federal reserve might belong partially to both the “politics” and the “business” topic.
For the second point, consider the difference in words that would appear in politics and business articles. Articles about politics might frequently include words like “congress” and “law,” but only rarely words like “stock” and “trade.”
Geometrically, LDA can be represented by the following picture. The corners of the simplex1 represent different words (in reality, there would be \(V\) different corners to this simplex, one for each word). A topic is a point on this simplex. The closer the topic is to one of the corners, the more frequently that word appears in the topic.
A simplex is the geometric object describing the set of probability vectors over \(V\) elements. For example, if \(V = 3\), then \(\left(0.1, 0, 0.9\right)\) and \(\left(0.2, 0.3, 0.5\right)\) belong to the simplex, but not \(\left(0.3, 0.1, 9\right)\), since it sums to a number larger than 1.↩︎