Semi-Supervised Clustering with K-means


Clustering is the process of grouping a collection of examples, i.e. data vectors, without class labels into distinct groups such that examples grouped together are similar to each other while those in different groups are different from each other. This process of finding patterns in the data via grouping is also known as unsupervised learning. Since the need for clustering arises in many different disciplines, a large number of clustering algorithms are available to us for use.

K-means clustering algorithm is the most widely used clustering procedure. Basically, the algorithm consists of initialing K cluster centers at random. All the example vectors are then assigned to their respective closest centers. Next, the centers are updated and the examples are reassigned to closest centers. This process of updating cluster centers and reassigning examples continues till the cluster centers stop changing. You can see a simple K-means demo via Excel in one of my earlier blog posts.

While clustering implies absence of class labels, it is possible to exploit domain knowledge to improve clustering results. Domain knowledge not only helps in guessing the number of expected clusters, i.e. K in K-means, but it can also be used to specify the cluster membership of some of the observations or data vectors. In other words, domain knowledge can be used to label some of the data vectors. Clustering in such a situation where some examples have labels attached to them is known as semi-supervised clustering. If on the other hand, were we to build a classifier using such a set of partially labeled examples, we would term the situation as semi-supervised classification. You can check my previous blog post on semi-supervised classification, here we will focus on semi-supervised clustering through K-means.

Constrained K-means Clustering

The most common way to use the domain knowledge is to specify pairwise relationships between certain examples. For example, we can specify that a certain pair of examples must be grouped together. This can be done not just for an example pair but can be done for many pairs. In the same fashion, we might know that two data vectors represent different phenomenons and thus must be placed in different clusters while performing clustering. These pairwise relationships are typical specified in the form of two lists of example pairs. The first list of pairs is commonly referenced as Must-link; it specifies all example pairs that should be placed in same clusters. The second list is known as Cannot-link, this list specifies pairs that should be placed in different clusters. These two lists of pairwise relationships impose constraints on how K-means performs grouping and thus the resulting clustering procedure is known as constrained K-means clustering.

There is a package in R called conclust that provides three constrained K-means clustering algorithms that I will briefly describe here and show some results using them. The first of these three algorithms is the ckmeans algorithm which is just a minor variation of the traditional K-means. The only difference from the standard K-means is that while assigning examples to nearest cluster centers, it checks whether any of the constraints are being violated or not. Once the method finds a constraint will be violated by assigning an example to its nearest center, the method terminates without producing any result.

The other two constrained K-means algorithms, lcvqe and mpckm, in the package strive to satisfy as many constraints as possible. Thus, these two methods always yield a clustering result while the ckmeans algorithm may or may not yield clusters. The lcvqe algorithm uses a set of rules to deal with example pairs that result in not satisfying the conditions. The mpckm algorithm adds a penalty term to the standard K-means criterion and the clustering solution sought minimizes this modified criterion. This particular algorithm is supposed to be best for constrained K-means clustering.

I will show the performance of these three methods using an artificial two-dimensional data set of 226 examples. Before proceeding with constrained K-means, I first applied the basic K-means with K equal to two. The data with ground truth and K-means result are shown below.

Since the ground truth is known, I evaluated the K-means result using adjusted Rand index (ARI). It turned out to be 0.4501191. Note that ARI cannot exceed 1.

Next, I decided to use the above mentioned constrained K-means methods. I ran two experiments. In the first case, I specified 10 pairs of data points as Must-link points and 20 pairs as Cannot-link points. The points were selected randomly. In the second experiment, I increased the Must-link points to 15 pairs and Cannot-link points to 25 pairs. Again, the points were randomly selected. The clustering results of these two runs using the three methods, ckmeans, lcvqe, and mpckm, are shown below in that order from left to right. The top row corresponds to constraints by 30 pairs and the bottom row corresponds to constraints by 40 pairs.

Looking at the above results you can note that ckmeans and lcvqe yield clustering results with constraints over 30 pairs almost identical to the K-means. This is also indicated by their respective ARI values of 0.4264356 and 0.4381882. The mpckm method gives a slightly better result with ARI value of 0.4864635. When the constraints are specified using 40 pairs of points, ckmeans and mpckm show significant improvements with corresponding ARI values as  0.5908327 and 0.6325625. The result from lcvqe worsens and it yields ARI value as 0.3071103.

While this simple experiment cannot be used to make a definitive statement about which of the three constrained clustering methods is better, mpckm appears relatively better and should be preferred. This also seems to be the viewpoint of the conclust package developers.