# How to Build a Labelled Training Data Set with a Few Labelled Examples

When building a classifier, we assume a large enough training data set with labels is available. This situation is what we call as supervised learning. In a real world setting, such training examples with labels need to be acquired. In any application domain where labeling requires domain expertise such as in medicine, gathering a large training set with labels is an expensive and time consuming task.  In such cases, it is not uncommon to use a small set of correctly labeled examples to label the rest of training examples. This type of learning is referred as semi-supervised learning and it falls somewhere between supervised and unsupervised learning. Often the term semi-supervised classification is used to describe the process of labeling training examples using a small set of labeled examples to differentiate from semi-supervised clustering. In semi-supervised clustering, the goal is to group a given set of examples into different clusters with the condition that certain examples must be clustered together and certain examples must be put in different clusters. In other words, some kind of constraints are imposed on resulting clusters in terms of cluster memberships of certain specified examples.

In this blog post, I am going to illustrate semi-supervised classification leaving semi-supervised clustering for another post. When we have a small set of labeled examples and we want to rely on them to label a much larger set of unlabeled training examples, we need to make some assumptions. For example, we might make an assumption that training examples close to each other are likely to have similar class labels, an assumption made when applying k-nearest neighbor classification. Instead one might assume classes to have Gaussian distribution and we may try to iteratively find the distribution parameters. We must remember that our result will be as good as our assumptions are.

### Label Propagation Algorithm (LPA)

One of the semi-supervised classification method is label propagation that I will explain here. This method is based on the assumption that examples near each other are likely to have similar class labels. The basic idea of this method is to consider all examples, labeled and unlabeled, as interconnected nodes in a network. Each node in the network tries to propagate its label to other nodes. How much of a node’s label influences other nodes is determined by their respective closeness or proximity.

We will work through a series of steps to illustrate the working of the label propagation algorithm. Let us consider the following nine training examples, each with two-features:

[5.4 3.9], [4.8 3.0], [5.1 3.3], [5.7 2.8], [5.7 3.0], [5.9 3.2], [6.9 3.2], [6.4 2.7], [6.7 3.0]

We know labels of three examples only. These three examples are shown above in color; each color representing a different class label. What label propagation algorithm does is that it tries to determine the labels of the six unlabeled examples.

The first step is to calculate closeness between each pair of examples. In LPA, the closeness between examples is measured by the following formula:

$w_{ij} = \exp(-\frac{d_{ij}^2}{\sigma^2})$,

where $d_{ij}^2$ is the squared euclidean distance between the example pair ‘i-j’ and $\sigma^2$ is a parameter to scale proximity. Since we have nine examples, we end up with a 9×9 symmetric weight matrix W. The following code snippets show the calculations for W. The examples are in array X and sigma equals 0.4.

from sklearn.metrics.pairwise import euclidean_distances
D = euclidean_distances(X, X, squared = True)
W =np.exp(-D/0.16)
print(W)

[[1.   0.   0.06 0.   0.   0.01 0.   0.   0.  ]
[0.   1.   0.32 0.   0.01 0.   0.   0.   0.  ]
[0.06 0.32 1.   0.02 0.06 0.02 0.   0.   0.  ]
[0.   0.   0.02 1.   0.78 0.29 0.   0.04 0.  ]
[0.   0.01 0.06 0.78 1.   0.61 0.   0.03 0.  ]
[0.01 0.   0.02 0.29 0.61 1.   0.   0.04 0.01]
[0.   0.   0.   0.   0.   0.   1.   0.04 0.61]
[0.   0.   0.   0.04 0.03 0.04 0.04 1.   0.32]
[0.   0.   0.   0.   0.   0.01 0.61 0.32 1.  ]]

Next, we associate with each node a c-dimensional label vector, c being the number of classes, that reflects the class probabilities associated with the node training example. In our example, we have three classes. We set the label vectors to ensure the sum of each vector being one and the examples/nodes with known labels have o’s and 1’s only in there respective vectors. These initial nine, three-dimensional vectors are shown below as a matrix Y.

You can check that each column adds up to one to show the probabilities of three class labels for each node. We also note that probability values have begun to move either upwards towards one or downwards towards zero. The above steps of updating the Y matrix, normalizing its rows, and resetting label vectors of known node labels complete one iteration of label propagation. Repeating these steps a few times moves probabilities sufficiently in upward and downward directions to let us terminate the algorithm. In the present example, we obtain the following Y matrix after 10 iterations.
[[0.58 0.93 1.   0.04 0.   0.04 0.01 0.02 0.  ]
[0.24 0.05 0.   0.91 1.   0.89 0.02 0.22 0.  ]
[0.18 0.02 0.   0.05 0.   0.07 0.97 0.76 1.  ]]
Looking at the Y matrix, we can safely assign class 1 label to first two examples, class 2 label to fourth and sixth examples, and class 3 label to seventh and eighth examples, thus completing the task of assigning labels to training examples with no labels.
You might wonder whether the algorithm will converge or not. It turns out that there is a theoretical result that establishes the convergence of LPA. The choice of sigma is important as it affects the convergence rate as well as the final result. This method is available in scikit-learn and you may want to check it out.