Review: Backpropagation step by step

A quick note on MLP implementation using numpy.

MLP step by step

Here, we implement the forward and backward propagation for both one-layer and two-layer Multi-Layer Perceptrons (MLP) using numpy. The formulas for forward and backward propagation are provided along with the corresponding Python code.

1. One-Layer MLP

Forward Propagation

Given:

$X$ is the input matrix of shape $(m, n)$ , where $m$ is the number of samples, and $n$ is the number of input features.
$W$ is the weight matrix of shape $(n, 1)$ .
$b$ is the bias term of shape $(1, 1)$ .

Forward Pass Formula:

$Z = XW + b$ $A = \sigma(Z)$

Where:

$\sigma$ is the sigmoid activation function: $\sigma(x) = \frac{1}{1 + e^{-x}}$ .

Code:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

class OneLayerMLP:
    def __init__(self, input_size, output_size):
        # Initialize weights and biases
        self.weights = np.random.randn(input_size, output_size)  # Input to output weights
        self.bias = np.zeros((1, output_size))  # Bias

    def forward(self, X):
        # Forward pass
        self.X = X
        self.Z = np.dot(X, self.weights) + self.bias  # Z = X * W + b
        self.A = sigmoid(self.Z)  # Activation function output
        return self.A

Backward Propagation

Loss Function: Mean Squared Error (MSE):

$L = \frac{1}{2m} \sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2$

Where:

$\hat{y}$ is the predicted output from the network.

Gradient of Loss with Respect to Output:

$\frac{\partial L}{\partial A} = \frac{1}{m} (A - Y)$

Gradient with Respect to $Z$ :

$\frac{\partial L}{\partial Z} = \frac{\partial L}{\partial A} \cdot \sigma'(Z) = (A - Y) \cdot A \cdot (1 - A)$

Gradient with Respect to Weights and Bias:

$\frac{\partial L}{\partial W} = \frac{1}{m} X^T \cdot \frac{\partial L}{\partial Z}$ $\frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^m \frac{\partial L}{\partial Z}$

Weight and Bias Update:

$W = W - \eta \cdot \frac{\partial L}{\partial W}$ $b = b - \eta \cdot \frac{\partial L}{\partial b}$

Where $\eta$ is the learning rate.

Code:

def mean_squared_error_derivative(y_true, y_pred):
    return y_pred - y_true

class OneLayerMLP:
    def __init__(self, input_size, output_size):
        self.weights = np.random.randn(input_size, output_size)
        self.bias = np.zeros((1, output_size))

    def forward(self, X):
        self.X = X
        self.Z = np.dot(X, self.weights) + self.bias
        self.A = sigmoid(self.Z)
        return self.A

    def backward(self, Y, learning_rate=0.1):
        m = Y.shape[0]
        
        # Gradient for output layer
        dA = mean_squared_error_derivative(Y, self.A)
        dZ = dA * sigmoid_derivative(self.A)
        
        # Gradients for weights and bias
        dW = np.dot(self.X.T, dZ) / m
        db = np.sum(dZ, axis=0, keepdims=True) / m
        
        # Update weights and bias
        self.weights -= learning_rate * dW
        self.bias -= learning_rate * db
        
        loss = np.mean((Y - self.A) ** 2)  # MSE loss
        return loss

2. Two-Layer MLP

Forward Propagation

For a two-layer MLP, the architecture consists of:

Input layer: $X$
Hidden layer: with weights $W_1$ and bias $b_1$
Output layer: with weights $W_2$ and bias $b_2$

First Layer (Input to Hidden Layer):

$Z_1 = X W_1 + b_1$ $A_1 = \sigma(Z_1)$

Second Layer (Hidden to Output Layer):

$Z_2 = A_1 W_2 + b_2$ $A_2 = \sigma(Z_2)$

Where $A_2$ is the final output.

Code:

class TwoLayerMLP:
    def __init__(self, input_size, hidden_size, output_size):
        self.W1 = np.random.randn(input_size, hidden_size)  # Input to hidden weights
        self.b1 = np.zeros((1, hidden_size))  # Hidden layer bias
        self.W2 = np.random.randn(hidden_size, output_size)  # Hidden to output weights
        self.b2 = np.zeros((1, output_size))  # Output layer bias

    def forward(self, X):
        self.X = X
        self.Z1 = np.dot(X, self.W1) + self.b1  # Hidden layer linear output
        self.A1 = sigmoid(self.Z1)  # Hidden layer activation
        self.Z2 = np.dot(self.A1, self.W2) + self.b2  # Output layer linear output
        self.A2 = sigmoid(self.Z2)  # Output layer activation
        return self.A2

Backward Propagation

We compute the gradients of the loss with respect to weights and biases for both layers.

Gradient for Output Layer:

$\frac{\partial L}{\partial A_2} = \frac{1}{m} (A_2 - Y)$ $\frac{\partial L}{\partial Z_2} = \frac{\partial L}{\partial A_2} \cdot \sigma'(Z_2)$

Gradients for Weight $W_2$ and Bias $b_2$ :

$\frac{\partial L}{\partial W_2} = \frac{1}{m} A_1^T \cdot \frac{\partial L}{\partial Z_2}$ $\frac{\partial L}{\partial b_2} = \frac{1}{m} \sum_{i=1}^m \frac{\partial L}{\partial Z_2}$

Gradients for Hidden Layer:

To propagate the error back to the hidden layer:

$\frac{\partial L}{\partial A_1} = \frac{\partial L}{\partial Z_2} \cdot W_2^T$ $\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial A_1} \cdot \sigma'(Z_1)$

Gradients for Weight $W_1$ and Bias $b_1$ :

$\frac{\partial L}{\partial W_1} = \frac{1}{m} X^T \cdot \frac{\partial L}{\partial Z_1}$ $\frac{\partial L}{\partial b_1} = \frac{1}{m} \sum_{i=1}^m \frac{\partial L}{\partial Z_1}$

Code:

class TwoLayerMLP:
    def __init__(self, input_size, hidden_size, output_size):
        self.W1 = np.random.randn(input_size, hidden_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size)
        self.b2 = np.zeros((1, output_size))

    def forward(self, X):
        self.X = X
        self.Z1 = np.dot(X, self.W1) + self.b1
        self.A1 = sigmoid(self.Z1)
        self.Z2 = np.dot(self.A1, self.W2) + self.b2
        self.A2 = sigmoid(self.Z2)
        return self.A2

    def backward(self, Y, learning_rate=0.1):
        m = Y.shape[0]
        
        # Output layer gradients
        dA2 = mean_squared_error_derivative(Y, self.A2)
        dZ2 = dA2 * sigmoid_derivative(self.A2)
        dW2 = np.dot(self.A1.T, dZ2) / m
        db2 = np.sum(dZ2, axis=0, keepdims=True) / m
        
        # Hidden layer gradients
        dA1 = np.dot(dZ2, self.W2.T)
        dZ1 = dA1 * sigmoid_derivative(self.A1)
        dW1 = np.dot(self.X.T, dZ1) / m
        db1 = np.sum(dZ1, axis=0, keepdims=True) / m
        
        # Update weights and biases
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
        
        loss = np.mean((Y - self.A2) ** 2)  # MSE loss
        return loss