A brief introduction of the probability theory and the information theory.

Introduction

Bayesian interpretation models the uncertainty about the events.

Probability theory

Discrete random variables

We denote the probability of the event that $X=x$ by $p(X=x)$, ro just $p(x)$ for short. The expression $p(A)$ denotes the probability that the event $A$ is true. Here $p()$ is called a probability mass function (pmf).

Fundamental rules

Union of two event:
$p(A \cup B) = p(A) + p(B) - p(A \cap B)\\ = p(A) + p(B) \text{ if A and B are mutually exclusive}$
Joint probabilities:
$p(A,B) = p(A \cap B) = p(A|B) p(B)$
Marginal distribution:
$p(A) = \sum_b p(A,B) = \sum_b p(A|B=b)p(B=b)$
Chain rule of probability:
$p(X_{1:D}) = p(X_1) p(X_2|X_1) p(X_3|X_2,X_1) p(X_4|X_3,X_2,X_1) ... p(X_D|X_{1:d-1})$
Conditional probability:
$p(A|B) = \frac{p(A,B)}{p(B) \text{ if p(B) > 0}}$

Bayes rule (a.k.a Bayes Theorem):

$p(X=x|Y=y) = \frac{p(X=x,Y=y)}{p(Y=y)} = \frac{p(X=x)p(Y=y|X=x)}{\sum_{x'}p(X=x')p(Y=y|X=x')}$

Continuous random variables

Cumulative distribution function (cdf) of $X$: define $F(q) \triangleq p(X \leq q)$ .

$P(a < X \leq b) = F(b) - F(a)$

probability density function (pdf): define $f(x) = \frac{d}{dx} F(x)$

$P(a < X \leq b) = \int_a^b f(x)dx$

Mean and Variance

Mean (expected value)

Let $\mu$ denote the mean(expected value).

For discrete rv’s: $\mathbb{E}[X] \triangleq \sum_{x \in \chi} x p(x)$
For continuous rv’s: $\mathbb{E}[X] \triangleq \int_{\chi} x p(x) dx$

Variance

Let $\sigma^2$ denote the measure of the “spread of a distribution”.

$\text{var}[X] \triangleq \mathbb{E} (X-\mu)^2 = \int (x-\mu)^2 p(x)dx \\= \int x^2 p(x)dx+ \mu^2 \int p(x)dx - 2 \mu \int x p(x) dx = \mathbb{E}[X^2] - \mu^2$

Derive a useful result:

$\mathbb{E}[X^2] = \mu^2 + \sigma^2$

The standard deviation is:

$\text{std}[X] \triangleq \sqrt{\text{var}[X]}$

Common discrete distributions

The binomial and Bernoulli distributions

Suppose we toss a coin $n$ times. Let $X \in { 0,…,n }$ be the number of heads. If the probability of heads is $\theta$, then X has a binomial distribution: $\text{X} ~ \text{Bin}(n, \theta)$

$Bin(k|n, \theta) \triangleq {n \choose k} \theta^k (1-\theta)^{n-k}$

where

${n \choose k} \triangleq \frac{n!}{(n-k)!k!}$

is the number of ways to choose $k$ items from $n$ (a.k.a. binomial coefficient, pronounced “n choose k”).

$\mu = \theta, \quad \sigma^2 = n \theta (1-\theta)$

Suppose we only toss a coin only once. Let $X \in \{0,1\}$ be a binary random variable, with the probability of “success” or “heads” of $\theta$. We say that $X$ has a Bernoulli distribution: $X \sim \text{Ber}(\theta)$ , where the pmf is defined as:

$\text{Ber}(x|\theta) = \theta^{\mathbb{I}(x=1)} (1-\theta)^{\mathbb{I}(x=0)}$

The multinomial and multinoulli distributions

The binomial distribution only model the outcomes of coin tosses (2 results per round). We use multinomial distribution to model the outcomes of tossing a $K$-sided die. Let $\mathbf{x} = (x_1,…,x_K)$ be a random vector, where $x_j$ is the number of times side $j$ of the die occurs. The pmd is:

$\text{Mu}(x|n,\theta) \triangleq {n \choose {x_1...x_K}} \prod_{j=1}^K \theta_j^{x_j}$

where $\theta_j$ is the probability that side $j$ shows up, and the multinomial coefficient ( $n = \sum_{k=1}^K x_k$ ) is:

${n \choose {x_1...x_K}} \triangleq \frac{n!}{x_1!x_2!...x_K!}$

Suppose $n = 1$, we rolling a $K$-sided dice once. This is called one-hot encoding .

Name	$n$	$K$	$x$
Multinomial	-	-	$$\mathbf{x} \in \{0,1,...,n \}^K, \sum_{k=1}^K x_k = n$$
Multinoulli	1	-	$$\mathbf{x} \in \{0,1\}^K, \sum_{k=1}^K x_k = 1$$ (1-of-$K$ encoding)
Binomial	-	1	$$\mathbf{x} \in \{0,1,...,n \}$$
Bernoulli	1	1	$$\mathbf{x} \in \{0,1\}$$

The Poisson distribution

With parameter $\lambda > 0$, $X \sim \text{Poi}(\lambda)$, if its pmf is:

$\text{Poi}(x|\lambda) = e^{-\lambda} \frac{1}{N} \sum_{i=1}^N \delta_{x_i}(A)$

where the first term is just the normalization constant.

The empirical distribution

Given a set of data $\mathscr{D} = \{ x_1,...,x_N \}$ , define the empirical distribution (a.k.a. empirical measure):

$p_{emp}(A) \triangleq \frac{1}{N} \sum_{i=1}^N \delta_{x_i}(A)$

where $\delta_x(A)$ is the Dirac measure, defined by:

$\delta_x(A)=\left\{ \begin{array}{ll} 0 \text{ if } x \notin A \\ 1 \text{ if } x \in A \end{array} \right.$

Common continuous distributions

Gaussian (normal) distribution

Let $X \sim \mathcal{N}(\mu, \sigma^2)$ denote

$\mathcal{N}(x|\mu, \sigma^2) \triangleq \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2 (x-\mu)^2}}$

The precision of a Gaussian, i.e. the inverse variance $\lambda = 1/\sigma^2$ . A high precision means a narrow distribution (low variance) centered on $\mu$.

Problems: Gaussian distribution is sensitive to outliers, since the log-probability only decays quadratically with the distance from the center.

More robust: Student $t$ distribution.

$\tau(x|\mu,\sigma^2,v) \propto [1+ \frac{1}{v} (\frac{x-\mu}{\sigma})^2]^{-\frac{v+1}{2}}$ $\text{mean} = \mu, \text{mode} = \mu, \text{var} = \frac{v \sigma^2}{(v-2)}$

Laplace distribution

A.k.a. double-sided exponential distribution.

$Lap(x|\mu,b) \triangleq \frac{1}{2b} exp(-\frac{|x-\mu|}{b})$

Here $\mu$ is a location parameter and $b>0$ is a scale parameter.

$\text{mean} = \mu, \text{mode} = \mu, \text{var} = 2 b^2$

Gamma distribution

The gamma distribution is a flexible distribution for positive real valued rv’s, $x>0$.

$Ga(T|\text{shape}=a, \text{rate}=b) \triangleq \frac{b^a}{\Gamma(a)} T^{a-1} e^{-Tb}$

where $\gamma(a)$ is the gamma function:

$\Gamma(x) \triangleq \int_0^{\infty} \mu^{x-1} e^{-u} du$ $\text{mean} = \frac{a}{b}, \text{mode} = \frac{a-1}{b}, \text{var} = \frac{a}{b^2}$ $\text{mean} = \frac{a}{b}, \text{mode} = \frac{a-1}{b}, \text{var} = \frac{a}{b^2}$

Beta distribution

The beta distribution has support over the interval [0,1]:

$\text{Beta}(x|a,b) = \frac{1}{B(a,b)} x^{a-1} (1-x)^{b-1}$

Here $B(p,q)$ is the beta function,

$B(a,b) \triangleq \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}$ $\text{mean} = \frac{a}{a+b}, \text{mode} = \frac{a-1}{a+b-2}, \text{var} = \frac{ab}{(a+b)^2(a+b+1)}$

Pareto distribution

The Pareto distribution is used to model the distribution of quantities that exhibit long tails (heavy tails).

For example, the most frequent ward in English occurs approximately twice as often as the second most frequent word, which occurs twice as oten as the fourth most frequent word. This is Zipf’s law.

Its pdf:

$\text{Pareto(x|k,m)} = km^k x^{-(k+1)} \mathbb{I}(x \geq m)$ $\text{mean} = \frac{km}{k-1} \text{ if k>1}, \text{mode} = m, \text{var} = \frac{m^2 k}{(k-1)^2 (k-2)} \text{ if k>2}$

Joint probability distributions

A joint probability distribution has the form $p(x_1,...,x_D)$ for a set of $D > 1$ variables, and models the (stochastic) relationships between the variables.

Covariance and correlation

Covariance between two rv’s X and Y measures the degree to which X and Y are (linearly) related.

$\text{cov}[X,Y] \triangleq \mathbb{E}[(X- \mathbb{E}[X])(Y-\mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$

If $\mathbf{x}$ is a $d$-dimensional random vector, its covariance matrix is defined to be symmetric, positive definite matrix:

$\text{cov}[\mathbf{x}] \triangleq \mathbb{E}[(\mathbf{x} - \mathbb{E}[\mathbf{x}])(\mathbf{x} - \mathbb{E}[\mathbf{x}])^T ]] = \begin{bmatrix} \text{var}[X_1] & \text{cov}[X_1,X_2] & \cdots & \text{cov}[X_1,X_d] \\ \text{cov}[X_2,X_1] & \text{var}[X_2] & \cdots & \text{cov}[X_2,X_d] \\ \vdots & \vdots & \ddots & \vdots \\ \text{cov}[X_d,X_1] & \text{cov}[X_d,X_2] & \cdots & \text{var}[X_d] \end{bmatrix}$

Covariance $\in [0, \infty]$. Pearson correlation coefficient use normalized measure with a finite upper bound:

$\text{corr}[X, Y] \triangleq \frac{\text{cov}[X,Y]}{\sqrt{\text{var}[X] \text{var}[Y]}}$

Acorrelation matrix has the form:

$\mathbf{R} = \begin{bmatrix} \text{corr}[X_1, X_1] & \text{corr}[X_1,X_2] & \cdots & \text{corr}[X_1,X_d] \\ \vdots & \vdots & \ddots & \vdots \\ \text{corr}[X_d,X_1] & \text{corr}[X_d,X_2] & \cdots & \text{corr}[X_d, X_d] \end{bmatrix}$

where $\text{corr}[X,Y] \in [-1, 1]$.

independent $\Rightarrow$ uncorrelated,
uncorrelated $\nRightarrow$ independent.

Measure the dependence between rv’s: mutual information.

Multivariate Gaussian

The Multivariate Gaussian or multivariate normal (MVN) is the most widely used pdf for continuous variables.
Its pdf:

$\mathcal{N}(\mathbf{x}|\mathbf{\mu}, \mathbf{\Sigma}) \triangleq \frac{1}{(2\pi)^{-D/2} |\mathbf{\Lambda}^{1/2}|} \text{exp}[-\frac{1}{2} (\mathbf{x} - \mathbf{\mu})^T \sum^{-1} (\mathbf{x} - \mathbf{\mu}) ]$

where $\mathbf{\mu} = \mathbb{E}[\mathbf{x}] \in \mathbb{R}^D$ is the mean vector, and $\Sigma = \text{cov}[\mathbf{x}]$ is the $D /times D$ covariance matrix. The precision matrix or concentration matrix is the inverse covariance matrix, $\Lambda = \Sigma^{-1}$ . The normalization constant $(2\pi)^{-D/2} |\mathbf{\Lambda}^{1/2}|$ just ensure that the pdf integrates to 1.

Multivariate Student $t$ distribution

Dirichlet distribution

Transformations of random varianbles

Linear transformation

Suppose $f()$ is a linear function:

$\mathbf{y} = f(\mathbf{x}) = \mathbf{A} \mathbf{x} + \mathbf{b}$

Linearity of expectation:

$\mathbb{E}[\mathbf{y}] = \mathbb{E}[\mathbf{A}\mathbf{x} + \mathbf{b}] = \mathbf{A}\mathbf{\mu} + \mathbf{b}$

where $\mathbf{\mu}=\mathbb{E}[\mathbf{x}]$.

Covariance:

$\text{cov}[\mathbf{y}] = \text{cov}[\mathbf{A}\mathbf{x}] = \mathbf{A}\Sigma\mathbf{A}^T$

where $\Sigma = \text{cov}[\mathbf{x}]$

If $f()$ is a scalar-valued function, $f(\mathbf{x}) = \mathbf{a}^T \mathbf{x} + b$, the mean is:

$\mathbb{E}[\mathbf{a}^T \mathbf{x} + b] = \mathbf{a}^T \mathbf{\mu} + b$

The covariance：

$\text{var}[y] = \text{var}[\mathbf{a}^T \mathbf{x} + b] = \mathbf{a}^T\Sigma \mathbf{a}$

Central limit theorem

Consider $N$ random variables with pdf’s $p(xi)$, each with mean $\mu$ and variance $\sigma^2$. We assume each variable is iid(independent and identically distributed). Let $$S_N = \sum{i=1}^N X_i$$ be the sum of the rv’s.

As $N$ increases, the distribution of this sum approaches

$p(S_N = s) = \frac{1}{\sqrt{2\pi N \sigma^2}} \text{exp} (- \frac{(s-N\mu)^2}{2N\sigma^2})$

Hence the distribution of the quantity

$Z_N \triangleq \frac{S_N - N \mu}{\sigma\sqrt{N}} = \frac{\bar{X} - \mu}{ \sigma / \sqrt{N}}$

converges to the standard normal, where $\bar{X} = \frac{1}{N} \sum_{i=1}^N x_i$ is the sample mean. (central limit theorem)

Monte Carlo approximation

Computing the distribution of a function of an rv using the change of variables formula is difficult.
Monte Carlo approximation: First generate $S$ samples from the distribution, called them $x_1,...,x_S$ . Given the samples, we approximate the distribution of $f(X)$ by using the empirical distribution of $\{f(x_s)\}^S_{s=1}$ .

Approximate the expected value with the arithmetic mean of the function applied to the samples:

$\mathbb{E}[f(X)] = \int f(x)p(x)dx \approx \frac{1}{S}\sum_{s=1}^S f(x_s)$

where $x_s \sim p(X)$ . This is called Monte Carlo integration.

$\bar{x} = \frac{1}{S} \sum_{s=1}^S x_s \rightarrow \mathbb{E}[X]$ $\frac{1}{S} \sum_{s=1}^S (x_s - \bar{x})^2 \rightarrow \text{var}[X]$ $\frac{1}{S}\#\{x_s \leq c\} \rightarrow p(X \leq c)$ $\text{median}\{x_1,...,x_S\} \rightarrow \text{median}(X)$

Information theory

Information theory represents data in a compact fashion, as well as with transmitting and storing it in a way that is robust to errors.

Entropy

Entropy of a rv $X$ with distribution $p$, denoted by $\mathbb{H}(X)$, is a measure of its uncertainty. For a discrete rv with $K$ states:

$\mathbb{H}(X) \triangleq -\sum_{k=1}^K p(X=k) log_2 p(X=k)$

Usually we use log base 2, where it is called bits (short for binary digits); whereas log base $e$ is called nats. The discrete distribution with maimum entropy is the uniform distribution.

KL divergence

Kullback-Leibler (KL) divergence or relative entropy: measures the dissimilarity of two probability distributions, $p$ and $q$.

$\mathbb{KL}(p||q) \triangleq \sum_{k=1}^K p_k \text{log} \frac{p_k}{q_k}$

where the sum gets replaced by an integral for pdf’s.

$\mathbb{KL}(p||q) = \sum_k p_k \text{log} p_k - \sum_k p_k \text{log} q_k = - \mathbb{H}(p) + \mathbb{H}(p,q)$

where $\mathbb{H}(p,q)$ is called the cross entropy.

$\mathbb{H}(p,q) \triangleq - \sum_k p_k \text{log}q_k$

Cross entropy: the average number of bits needed to encoder data coming from a source with distribution $p$ when we use model $q$ to define our codebook, i.e. KL divergence is the avarage number of extra bits needed to encode the data, due to the fact that we use distribution $q$ to encoder the data instead of the true distribution $p$.

The “extra number of bits” interpretation implies that $\mathbb{KL}(p||q) \geq 0 $, and KL is only equal to zero iff $q = p$.

Information inequality: $\mathbb{KL}(p||q) \geq 0 \text{ with equality iff } p=q$

Mutual information

Mutual Information (MI): determine how similar the joint distribution $p(X,Y)$ is to the factored distribution $p(X)p(Y)$.

$\mathbb{I}(X;Y) \triangleq \mathbb{KL}(p(X,Y)||p(X) p(Y)) = \sum_x\sum_y p(x,y) \text{log} \frac{p(x,y)}{p(x)p(y)}$

where $\mathbb{I}(X;Y) \geq 0$ woth equality iff $p(X,Y) = p(X) p(Y)$, i.e. the MI is zero iff the variables are independent.

$\mathbb{I}(X;Y) = \mathbb{H}(X) - \mathbb{H}(X|Y) = \mathbb{H}(Y) - \mathbb{H}(Y|X)$

where $\mathbb{H}(Y|X)$ is the conditional entropy, defined as $\mathbb{H}(Y|X) = \sum_x p(x)\mathbb{H}(Y|X=x)$ .
Hence we interpret the MI between $X$ and $Y$ as the reduction in uncertainty about $X$ after observing $Y$, or by symmetry, the reduction in uncertainty about $Y$ after observing $X$.

Pointwise mutual information (PMI): measures the discrepancy between events $x$ and $y$ occurring together compared to what would be expected by chance.

$PMI(x,y) \triangleq \text{log} \frac{p(x,y)}{p(x)p(y)} = \text{log} \frac{p(x|y)}{p(x)} = \text{log} \frac{p(y|x)}{p(y)}$

The MI of $X$ and $Y$ is the expected value of the PMI.

$PMI(x,y)= \text{log} \frac{p(x|y)}{p(x)} = \text{log} \frac{p(y|x)}{p(y)}$

We can interpret that PMI is the amount we update the prior $p(x)$ into the posterior $p(x|y)$, or equivalently update the prior $p(y)$ into $p(y|x)$.

The Gradient

A Review of Probability