The Gradient

Language is not just words.

Fork me on GitHub

Variational Autoencoders

This is a concise introduction of Variational Autoencoder (VAE).

Background

  • PixelCNN define tractable density function with MLE:

    p(θ)=i=1npθ(xi|x1,,xi1)
  • VAE define the intractable density function with latent z:

    p(θ)=pθ(z)pθ(x|z)dz

This cannot directly optimize, VAEs derive and optimize the lower bound on likelihood instead.

Autoencoder

Autoencoder (AE) encodes the inputs into latent representations z with dimension reduction to capture meaningful factors of variation in data. Then employ z to reconstruct original data by autoencoding itself.

  • After training, throw away the decoder and only retain the encoder.
  • Encoder can be used to initialize the supervised model on downstream tasks.

Variational Autoencoder

Assume training data {x(i)}i=1N is generated from underlying unobserved (latent) representation z.

Intuition:

  • x -> image
  • z -> latent factors used to generate x: attributes, orientation, pose, how much smile, etc. Choose prior p(z) to be simple, e.g. Gaussian.

Training

Problem

Intractable integral to MLE of training data:

p(θ)=pθ(z)Gaussian priorpθ(x|z)decoder NNdz

where it is intractable to compute p(x|z) for every z, i.e. integral. The intractability is marked in red.

Thus, the posterior density is also intractable due to the intractable data likelihood:

pθ(z|x)=pθ(x|z)pθ(z)pθ(x)

VAE Decoder[5]

Solution

Encoder -> “recognition / inference” networks.

  • Define encoder network qϕ(z|x) that approximates the intractable true posterior pθ(z|x). VAE makes the variational approximate posterior be a multivariate Gaussian with diagonal covariance for data point x(i):logqϕ(z|x(i))=logN(z;μ(i),σ2(i)I)

where

  • For Gaussian MLP encoder or decoder[4],(1)μ=W4h+b4(2)logσ2=W5h+b5(3)h=tanh(W3z+b3)

Use NN to model logσ2 instead of σ2 is because that logσ2(,) whereas σ20

Decoder -> “generation” networks pθ(x|z)

(4)logpθ(x(i))=Ezqϕ(z|x(i))[logpθ(x(i))]pθ(x(i)) does not depend on z(5)=Ez[logpθ(x(i)|z)pθ(z)pθ(z|x(i))]Bayes rule(6)=Ez[logpθ(x(i)|z)pθ(z)pθ(z|x(i))qϕ(z|x(i))qϕ(z|x(i))]multiply by constant(7)=Ez[logpθ(x(i)|z)]Ez[logqϕ(z|x(i))pθ(z)]+Ez[logqϕ(z|x(i))pθ(z|x(i))]logarithms(8)=Ez[logpθ(x(i)|z)decoder]KL(qϕ(z|x(i))encoderpθ(z)zprior)L(x(i),θ,ϕ)+KL(qϕ(z|x(i))pθ(z|x(i))intactable!)0
  • The first RHS term represents tractable lower bound L(x(i),θ,ϕ), wherein pθ(x|z) and KL terms are differentiable.
  • Thus, the variational lower bound (ELBO) is derived:logpθ(x(i))L(x(i),θ,ϕ)
  • Training: maximize lower boundθ,ϕ=argmaxθ,ϕi=1NL(x(i),θ,ϕ)
L(x(i),θ,ϕ)=Ez[logpθ(x(i)|z)decoder]KL(qϕ(z|x(i))encoderpθ(z)zprior)

where

  • the fist term Ez[logpθ(x(i)|z)]: reconstruct the input data. It is a negative reconstruction error.
  • the second term KL(qϕ(z|x(i))pθ(z)) make approximate posterior distribution close to the prior. It acts as a regularizer.

The derived estimator when using isotropic multivariate Gaussian pθ(z)=N(z;0,I):

L(θ,ϕ;x(i))12j=1D(1+log((σj(i))2)(μj(i))2(σj(i))2)+1Ll=1Llogpθ(x(i)|z(i,l))

where z(i,l)=μi+σ(i)ϵ(l) and ϵ(l)N(0,I), μj and σj denote the j-th element of mean and variance vectors.

Reparameterization trick

Given the deterministic mapping z=gϕ(ϵ,x), we know that

qϕ(z|x)idzi=p(ϵ)idϵi

Thus,

(9)qϕ(z|x)f(z)dz=p(ϵ)f(gϕ(ϵ,x))dϵ(10)1Ll=1Lf(gϕ(x,ϵ(l)))where ϵ(l)p(ϵ)

Take the univariate Gaussian case for example: zp(z|x)=N(μ,σ2), the valid reparameterization is: z=μ+σϵ, where the auxiliary noise variable ϵN(0,1).
Thus,

(11)EN(z;μ,σ2)[f(z)]=EN[f(μ+σϵ)]

Generation

  • After training, remove the encoder network, and use decoder network to generate.
  • Sample z from prior as the input!
  • Diagonal prior on z -> independent latent variables!

  • Different dimensions of z encode interpretable factors of variation.
  • Good feature representation that can be computed using qϕ(z|x)

Pros & cons

  • Probabilistic spin to traditional autoencoders => allows generating data
  • Defines an intractable density => derive and optimize a (variational) lower bound

Pros:

  • Principles approach to generative models
  • Allows inference of q(z|x), can be useful feature representation for downstream tasks

Cons:

  • Maximizes lower bound of likelihood: okay, but not as good evalution as PixelRNN / PixelCNN
  • loert quality compared to the sota (GANs)

Variational Graph Auto-Encoder (VGAE)

Definition

Given an undirected, unweighted graph G=(V,E) with N=|V| nodes, the ajacency matrix A with self-connection (i.e., the diagonal is set to 1), degree matrix D, stochastic latent variable zi in matrix ZRN×F, node features XRN×D. [7]

Inference model

Apply a 2-layer Graph Convolutional Networks (GCN) to for parameterization:

(12)q(Z|X,A)=i=1Nq(zi|X,A)(13)q(zi|X,A)=N(zi|μi,diag(σi2))

where

  • Mean: μ=GCNμ(X,A)
  • Variance: logσ=GCNσ(X,A)

The two-layer GCN is defined as GCN(X,A)=A~ReLU(A~XW0)W1
where A~=D1/2AD1/2 is the semmetrically normalized adjacency matrix.

Generative model

The generative model applies an inner product between latent variables:

(14)p(A|Z)=i=1Nj=1Np(Aij|zi,zj)(15)p(Aij=1|zi,zj))=σ(zizj)

where Aij are elements of ajacency matrix A and σ() represents the sigmoid function.

Learning

Optimize the variational lower bound (ELBO) L w.r.t the variational parameters Wi:

L=Eq(Z|X,A)[logp(A|Z)]KL[q(Z|X,A)p(Z)]

where the Gaussian prior p(Z)=Ip(zi)=iN(zi|0,I)

References


  1. 1.Stanford cs231n: Generative models
  2. 2.I. Goodfellow et. al, Deep Learning
  3. 3.Goodfellow, I. (2016). Tutorial: Generative adversarial networks. In NIPS.
  4. 4.Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  5. 5.Doersch, C. (2016). Tutorial on Variational Autoencoders. ArXiv, abs/1606.05908.
  6. 6.cs236 VAE notes
  7. 7.Kipf, T., & Welling, M. (2016). Variational Graph Auto-Encoders. ArXiv, abs/1611.07308.