Notes of clustering approaches.

Gaussian Mixture Models (GMMs)

EM algorithm

Given the fully observed variable $\mathbf{x}_i$ of which $i$ indicates the $i$-th variable, $\mathbf{z}_i$ be the hidden or missing variables. The objective is to maximize the log likelihood of observed data:

$\mathcal{l}(\theta) = \sum_{i=1}^N \log p(\mathbf{x}_i \vert \mathbf{\theta}) = \sum_{i=1}^N \color{red}{\log} \big[ \sum_{\mathbf{z}_i} p(\mathbf{x}_i, \mathbf{z}_i \vert \theta) \big]$

It is hard to direct optimize due to the $\log$ cannot be merged into the inner sum.

EM define the complete data log likelihood as:

$\mathcal{l}_c (\theta) \triangleq \sum_{i=1}^N \log p(\mathbf{x}_i, \color{red}{\mathbf{z}_i} \vert \theta)$

This also cannot be computed since $\mathbf{z}_i$ is unknown.

Then EM defines the expected complete data log likelihood as:

$Q(\theta, \theta^{t-1}) = \mathbb{E}[ \mathcal{l}_c (\theta) \vert \mathcal{D}, \theta^{t-1} ]$

where $t$ is the current iteratio number, $Q$ is the auxiliary function. The expectation is take w.r.t the old parameters $\theta^{t-1}$ and the observed data $\mathcal{D}$.

In the E-step, compute the expected sufficient statistics (ESS) $Q(\theta, \theta^{t-1})$

In the M-step, the Q function is optimized w.r.t $\theta$:

$\theta^t = \arg \max_\theta Q(\theta, \theta^{t-1})$

To perform MAP estimation, the M step is modified as:

$\theta^t = \arg\max_\theta Q(\theta, \theta^{t-1}) + \log p(\theta)$

EM algorithm monotonically increase the log likelihood of the observed data (plus the log prior when doing MAP)

EM for GMMs

Let $K$ be the # of mixture components.

Auxiliary function

the expected complete data log likelihood is:

$\begin{align} Q(\theta, \theta^{t-1}) & \triangleq \mathbb{E} \big[ \sum_i p(\mathbf{x}_i, z_i \vert \theta) \big] \\ & = \sum_i \mathbb{E} \big[ \log \big[ \prod_{k=1}^K (\pi_k p(\mathbf{x}_i \vert \theta_k))^{\mathbb{I}(z_i=k)} \big] \big] \\ &= \sum_i \sum_k \mathbb{E} [ \mathbb{I}(z_i =k) \log [\pi_k p(\mathbf{x}_i \vert \theta_k)]] \\ &=\sum_i \sum_k p(z_i=k \vert \mathbf{x}_i, \theta^{t-1}) \log[\pi_k p(\mathbf{x}_i \vert \theta_k)] \\ &=\sum_i\sum_k r_{ik} \log \pi_k + \sum_i \sum_k r_{ik} \log p(\mathbf{x}_i \vert \theta_k) \end{align}$

where $r_{ik} \triangleq p(z_i \ k \vert \mathbf{x}_i, \theta^{t-1})$ is the responsibility that cluster $k$ takes for data point $i$, which is computed at E-step.

E step

$r_{ik} = \frac{\pi_k p(\mathbf{x}_i \vert \theta_k^{(t-1)})}{\sum_{k^\prime} \pi_{k^\prime} p(\mathbf{x}_i \vert \theta_{k^\prime}^{(t-1)})}$

M step

In the M step, EM optimizes $Q$ w.r.t $\pi$ and $\theta_k$ :

$\pi_k = \frac{1}{N} \sum_i r_{ik} = \frac{r_k}{N}$

where $r_k \triangleq \sum_i r_{ik}$ is the weighted number of points assigned to cluster $k$.

The log likelihood:

$\begin{align} \mathcal{l}(\mu_k, \Sigma_k) &= \sum_k \sum_i r_{ik} \log p(\mathbf{x}_i \vert \theta_k) \\ &= -\frac{1}{2} \sum_i r_{ik} [\log \vert \Sigma_k \vert + (\mathbf{x}_i - \mathbf{\mu}_k)^\top \Sigma_k^{-1} (\mathbf{x}_i - \mathbf{\mu}_k)] \end{align}$

Thus,

$\begin{align} \mu_k &= \frac{\sum_i r_{ik}\mathbf{x}_i}{r_k} \\ \Sigma_k &= \frac{\sum_i r_{ik} (\mathbf{x}_i - \mu_k)(\mathbf{x}_i - \mu_k)^\top}{r_k} \\ &= \frac{\sum_i r_{ik} \mathbf{x}_i \mathbf{x}_i^\top}{r_k} - \mu_k \mu_k^\top \end{align}$

After computing such new estimates, set $\theta^t = (\pi_k, \mu_k, \Sigma_k)$ for $k=1:K$ and go to the next $E$ step.

The mean of cluster $k$ is just the weighted average of all poitns assigned to cluster $k$.
The covariance is proportional to the weighted empirical scatter matrix.

class GMM(Clustering):
    def __init__(self, dim, n_clusters, init_centroids=None, seed=2020):
        super(GMM, self).__init__(dim, n_clusters, seed)
        self.points = tf.placeholder(tf.float64, [None, dim], name='points')
        num_points = tf.shape(self.points)[0]
        # choose N centroids -> (n_clusters, D)
        if init_centroids is None:
            self.mu = tf.Variable(tf.slice(tf.random.shuffle(self.points, seed=seed), [0, 0], [n_clusters, -1]))
        else:
            self.mu = tf.Variable(init_centroids, dtype=tf.float64, name='init')

        # expand dim -> (1, n_points, D)
        xs_expanded = tf.expand_dims(self.points, 0)
        # (n_clusters, 1, D)
        centroid_expanded = tf.expand_dims(self.mu, 1)
        # init variances
        self.var = tf.Variable(tf.cast(tf.ones([n_clusters, self.dim]), tf.float64) / n_clusters)
        # init weights
        self.weight = tf.Variable(tf.cast(tf.fill([n_clusters], 1. / n_clusters), tf.float64))

        log_2piD = tf.constant(np.log(2 * np.pi) * dim, dtype=tf.float64)
        # E step: recompute the responsibilities w.r.t the current parameters
        distances = tf.squared_difference(xs_expanded, centroid_expanded)
        distance_x_inv_var = tf.reduce_sum(distances / tf.expand_dims(self.var, 1), 2)
        log_coef = tf.expand_dims(log_2piD + tf.reduce_sum(tf.log(self.var), 1), 1)
        log_comp = -.5 * (log_coef + distance_x_inv_var)
        log_weighted = log_comp + tf.expand_dims(tf.log(self.weight), 1)
        log_shift = tf.reduce_max(log_weighted, axis=0, keepdims=True)
        weight = tf.exp(log_weighted - log_shift)
        weight_sum = tf.reduce_sum(weight, 0)
        gamma = weight / weight_sum

        # M step: maximizing parameters w.r.t the computed responsibilities
        gamma_sum = tf.reduce_sum(gamma, 1)
        gamma_weighted = gamma / tf.expand_dims(gamma_sum, 1)
        gamma_expanded = tf.expand_dims(gamma_weighted, 2)
        self.mu_ = tf.reduce_sum(xs_expanded * gamma_expanded, 1)
        distances = tf.squared_difference(xs_expanded, tf.expand_dims(self.mu_, 1))
        self.var_ = tf.reduce_sum(distances * gamma_expanded, 1)
        self.weight_ = gamma_sum / tf.cast(num_points, tf.float64)

        ll = tf.reduce_sum(tf.log(weight_sum)) + tf.reduce_sum(log_shift)
        self.mean_ll = ll / tf.cast(num_points * dim, tf.float64)

        self.train_op = tf.group(
            self.mu.assign(self.mu_),
            self.var.assign(self.var_),
            self.weight.assign(self.weight_)
        )

    def update(self, points, n_iters=1000, TOLERANCE=1e-8):
        prev_ll = - np.inf
        with tf.compat.v1.Session() as sess:
            sess.run(tf.compat.v1.global_variables_initializer())
            for i in range(n_iters):
                cur_ll, _ = sess.run(
                    [self.mean_ll, self.train_op],
                    feed_dict={self.points: points}
                )

                if i > 0:
                    difference = np.abs(cur_ll - prev_ll)
                    print(f'GMM step-{i}:\t mean likelihood {cur_ll} \t difference {difference}')
                    if difference < TOLERANCE:
                        break
                else:
                    print(f'GMM step-{i}:\t mean likelihood {cur_ll}')
                prev_ll = cur_ll
            mu = self.mu.eval(sess)
            var = self.var.eval(sess)
        return mu, var


if __name__ == '__main__':
    import numpy as np
    n_points = 200
    n_clusters = 3
    D = 100
    points = np.random.uniform(0, 10, (n_points, D))  # random
    cluster = GMM(D, n_clusters, init_centroids=None)
    mu, var = cluster.update(points)

Multiplying multiple small probabilities can cause the arithmetic underflow, i.e.

$\log \sum_{i=1}^n \exp (x_i)$

To circumvent this, a common trick called log sum exponential trick:

$\begin{align} \log \sum_{i=1}^n \exp (x_i) &= \log \exp (b) \sum_{i=1}^n \exp (x_i -b) \\ &= b + \log \sum_{i=1}^n \exp (x_i -b) \end{align}$

K-means

K-means algorithm is a variant of EM algorithm for GMM, which can be seen as a GMM with such assumptions: $\Sigma_k = \sigma^2 \mathbf{I}D$ and $\pi_k = \frac{1}{K}$ are fixed, only the cluster centers $\mathbf{\mu}_k \in \mathbb{R}^D$ will be estimated.

In the E step,

$p(z_i = k \vert \mathbf{x}_i ,\theta) \approx \mathbb{I}(k = z_i^*)$

where $z_i^* = \arg\max_k p(z_i=k \vert \mathbb{x}_i, \theta)$ , which is also called hard EM since K-means makes the hard assignment of the points to clusters.

E step

Since the equal spherical convariance matrix is assumed, the most probable cluster for $\mathbf{x}_i$ can be computed using the Euclidian distance between $N$ data points and $K$ cluster centroids:

$z_i^* = \arg \min_k \Vert \mathbf{x}_i - \mu_k \Vert_2^2$

This is equivalence to minimizing the pairwise squared deviations of points in the same cluster.

M step

Given the hard cluster assignments, the M step update the cluster centroid by computing the mean of all points assigned to it:

$\mu_k = \frac{1}{N_k} \sum_{i:z_i=k} \mathbf{x}_i$

import tensorflow as tf

class KMeans(object):
    """ KMeans clustering """

    def __init__(self, points, n_clusters, n_iter=300):
        self.points = points
        self.n_clusters = n_clusters
        self.n_iter = n_iter
        # randomly choose N centroids -> (n_clusters, D)
        self.centroids = tf.Variable(tf.slice(tf.random.shuffle(points), [0, 0], [n_clusters, -1]))
        # expand dim -> (1, n_points, D)
        xs_expanded = tf.expand_dims(points, 0)
        # (n_clusters, 1, 2)
        centroid_expanded = tf.expand_dims(self.centroids, 1)
        # calculate the distance between points and centroids -> (n_clusters, n_points)
        distances = tf.reduce_sum(tf.square(tf.subtract(xs_expanded, centroid_expanded)), -1)
        # (n_points,)
        self.assignments = tf.argmin(distances, 0)
        # get mean points
        means = []
        for c in range(n_clusters):
            idx = tf.reshape(tf.where(tf.equal(self.assignments, c)), [1, -1])
            c_points = tf.gather(points, idx)
            means.append(tf.reduce_mean(c_points, reduction_indices=[1]))
        # update centroids -> (n_cluster, D)
        new_centroids = tf.concat(means, 0)
        self.update_centroids = tf.compat.v1.assign(self.centroids, new_centroids)

    def update(self):
        """
        KMeans iterate
        :return:
            points_vals: data points
            centroid_vals: centroid points
            assignment_vals: cluster categories
        """
        with tf.compat.v1.Session() as sess:
            sess.run(tf.compat.v1.global_variables_initializer())
            for step in range(self.n_iter):
                _, centroid_vals, points_vals, assignment_vals = sess.run(
                    [self.update_centroids, self.centroids, self.points, self.assignments])
            # print(f'centroids: {centroid_vals}')
        return centroid_vals, points_vals, assignment_vals


if __name__ == '__main__':
    n_points = 200
    n_clusters = 3
    n_iter = 100
    D = 100
    points = tf.constant(np.random.uniform(0, 10, (n_points, D))) # random
    cluster = KMeans(points, n_clusters, n_iter)
    centroid_vals, points_vals, assignment_vals = cluster.update()

Deep Embedded Clustering (DEC)

Given a set of $n$ points $\{ x_i \in X \}_{i=1}^n$ into $k$ clusters, each represented by a centroid $\mu_j$ , $j=1, \cdots, k$.

Firstly apply linear transformation $f_\theta : X \rightarrow Z$ , where $\theta$ are learnable parameters, $Z$ is the latent feature space.

Deep Embedded Clustering (DEC)^[1] has two steps:

parametrize initialization with a deep autoencoder
clustering, by computing an auxiliary target distribution and minimize KL divergence.

DEC Clustering

Given initial cluster centroids $\{ u_j\}_{k=1}^k$ , DEC firstly computes the soft assignment between the embedded points and the cluster centroids; afterwards, update the deep mapping $f_\theta$ with Stochastic Gradient Descent and refine the cluster centroids using an auxiliary target distribution. This process is repeated until convergence.

Soft Assignment

DEC applies Student’s $t$-distribution as a kernel to measure the similarity between embedded point $z_i$ and centroid $\mu_j$ :

$\begin{align} q_{ij} = \frac{(1+\Vert z_i - u_j \Vert^2 / \alpha)^{-\frac{\alpha+1}{2}}}{\sum_{j^\prime} (1+\Vert z_i - u_{j^\prime} \Vert^2 / \alpha)^{-\frac{\alpha+1}{2}}} \end{align}$

where $z_i = f_\theta (x_i) \in Z$ corresponds to $x_i \in X$ after embedding, $\alpha$ (set to 1) are the degrees of freedom of the Student’s $t$-distribution and $q_{ij}$ can be interpreted as the probability of assigning sample $i$ to cluster $j$ (i.e., soft assignment).

KL Divergence

DEC iteratively refines the clusters by learning from their high confidence assignments with the auxiliary target distribution. KL divergence is computed beween the soft assignments $q_i$ and the auxiliary distribution $p_i$ :

$L = \mathbb{KL}(P \vert Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q{ij}}$

The choice of target distribution $P$ is critical.

$p_{ij} = \frac{q_{ij}^2 / f_j}{\sum_{j^\prime} q_{ij}^2 / f_{j^\prime}}$

where $f_j = \sum_i q_{ij}$ are soft cluster frequencies.

"""
q_{ij} = \frac{(1+\Vert z_i - u_j \Vert^2 / \alpha)^{-\frac{\alpha+1}{2}}}{\sum_{j^\prime} (1+\Vert z_i - u_{j^\prime} \Vert^2 / \alpha)^{-\frac{\alpha+1}{2}}}
"""
# tensorflow code
# given data variables `zs` -> shape (?, D) ; `us` -> shape (K, D) 
# where D is the dimension, K is the cluster number
zs = tf.expand_dims(zs, 1) # (?, 1, D)
us = tf.expand_dims(us, 0) # (1, K, D)
q = tf.pow(tf.reduce_sum(tf.math.squared_difference(zs, us), axis=-1) / model_config.alpha + 1, - (model_config.alpha + 1) / 2)
q /= tf.reduce_sum(q, axis=1, keepdims=True)
                    

"""
 p_{ij} = \frac{q_{ij}^2 / f_j}{\sum_{j^\prime q_{ij}^2 / f_{j^\prime}}}
"""
# norm along seq axis
p = tf.square(q) / tf.reduce_sum(q, axis=-1, keepdims=True)
# norm along cluster axis
p /= tf.reduce_sum(p, axis=1, keepdims=True)
kl_div = tf.reduce_sum(p * tf.math.log(p / q))

References

1.Xie, J., Girshick, R.B., & Farhadi, A. (2015). Unsupervised Deep Embedding for Clustering Analysis. ICML. ↩
2.Kmeans clustering ↩

The Gradient

Clustering Methods: A Note

Gaussian Mixture Models (GMMs)

EM algorithm

EM for GMMs

Auxiliary function

E step

M step

K-means

E step

M step

Deep Embedded Clustering (DEC)

DEC Clustering

Soft Assignment

KL Divergence

References