Mastering Machine Learning Algorithms
上QQ阅读APP看书,第一时间看更新

Label spreading

The last algorithm (proposed by Zhou et al.) that we need to analyze is called label spreading, and it's based on the normalized graph Laplacian:

This matrix has each a diagonal element lii equal to 1, if the degree deg(lii) > 0 (0 otherwise) and all the other elements equal to:

The behavior of this matrix is analogous to a discrete Laplacian operator, whose real-value version is the fundamental element of all diffusion equations. To better understand this concept, let's consider the generic heat equation:

This equation describes the behavior of the temperature of a room when a point is suddenly heated. From basic physics concepts, we know that heat will spread until the temperature reaches an equilibrium point and the speed of variation is proportional to the Laplacian of the distribution. If we consider a bidimensional grid at the equilibrium (the derivative with respect to when time becomes null) and we discretize the Laplacian operator (2 = ∇ · ∇) considering the incremental ratios, we obtain:

Therefore, at the equilibrium, each point has a value that is the mean of the direct neighbors. It's possible to prove the finite-difference equation has a single fixed point that can be found iteratively, starting from every initial condition. In addition to this idea, label spreading adopts a clamping factor α for the labeled samples. If α=0, the algorithm will always reset the labels to the original values (like for label propagation), while with a value in the interval (0, 1], the percentage of clamped labels decreases progressively until α=1, when all the labels are overwritten.

The complete steps of the label spreading algorithm are:

  1. Select an affinity matrix type (KNN or RBF) and compute W
  2. Compute the degree matrix D
  3. Compute the normalized graph Laplacian L
  4. Define Y(0) = Y
  1. Define α in the interval [0, 1]
  2. Iterate until convergence of the following step:

It's possible to show (as demonstrated in Semi-Supervised LearningChapelle O., Schölkopf B., Zien A., (edited by), The MIT Press) that this algorithm is equivalent to the minimization of a quadratic cost function with the following structure:

The first term imposes consistency between original labels and estimated ones (for the labeled samples). The second term acts as a normalization factor, forcing the unlabeled terms to become zero, while the third term, which is probably the least intuitive, is needed to guarantee geometrical coherence in terms of smoothness. As we have seen in the previous paragraph, when a hard-clamping is adopted, the smoothness assumption could be violated. By minimizing this term (μ is proportional to α), it's possible to penalize the rapid changes inside the high-density regions. Also in this case, the proof of convergence is very similar to the one for label propagation algorithms, and will be omitted. The interested reader can find it in Semi-Supervised Learning, Chapelle O., Schölkopf B., Zien A., (edited by), The MIT Press.