Topic Modeling and Document Clustering; What’s the Difference?

When you have a huge collection of documents you need a method to organize the collection. It turns out that you can do so by topic modeling or by clustering. In topic modeling, a topic is defined by a cluster of words with each word in the cluster having a probability of occurrence for the given topic, and different topics have their respective clusters of words along with corresponding probabilities. Different topics may share some words and a document can have more than one topic associated with it. A popular topic modeling approach is based on latent Dirichlet allocation (LDA) wherein  each document is considered a mixture of topics and each word in a document is considered randomly drawn from document’s topics. The topics are considered hidden which must be uncovered via analyzing  joint distribution to compute the conditional distribution of hidden variables (topics) given the observed variables,  words in documents. Non-negative matrix factorization is another way to find topics in a collection of documents. Irrespective of the approach, the output of a topic modeling algorithm is a list of topics with associated clusters of words.

In clustering, the basic idea is to group documents into different groups based on some suitable similarity measure. To perform grouping, each document is represented by a vector representing the weights assigned to words in the document. It is common to perform weighting using the tf-idf (term frequency-inverse document frequency) scheme. The end result of clustering is a list of clusters with every document showing up in one of the clusters.

The basic difference between topic modeling and clustering thus can be illustrated by the following figure.

Since topic modeling yields topics present in each document, one can say that topic modeling generates a representation for documents in the topic space. As the number of topics is much less than the vocabulary associated with the document collection, the topic space representation can be viewed as a dimensionality reduction process as well. One can use this topic space representation of documents to perform clustering. On the other hand, one can analyze the frequency of words in each cluster to determine topic associated with each cluster. With hard clustering, this will yield only one topic associated with each document. On the other hand, using soft clustering will allow multiple topics associated with each document similar to the result obtained via topic modeling. Thus, topic modeling or soft clustering are much similar to each other; the difference being how the problem is being approach.