The Gradient

Language is not just words.

Fork me on GitHub

Review: Backpropagation step by step

A quick note on MLP implementation using numpy.

MLP step by step

Here, we implement the forward and backward propagation for both one-layer and two-layer Multi-Layer Perceptrons (MLP) using numpy. The formulas for forward and backward propagation are provided along with the corresponding Python code.

1. One-Layer MLP

Forward Propagation

Given:

  • is the input matrix of shape , where is the number of samples, and is the number of input features.
  • is the weight matrix of shape .
  • is the bias term of shape .

Forward Pass Formula:

Where:

  • is the sigmoid activation function: .

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
return x * (1 - x)

class OneLayerMLP:
def __init__(self, input_size, output_size):
# Initialize weights and biases
self.weights = np.random.randn(input_size, output_size) # Input to output weights
self.bias = np.zeros((1, output_size)) # Bias

def forward(self, X):
# Forward pass
self.X = X
self.Z = np.dot(X, self.weights) + self.bias # Z = X * W + b
self.A = sigmoid(self.Z) # Activation function output
return self.A

Backward Propagation

Loss Function: Mean Squared Error (MSE):

Where:

  • is the predicted output from the network.

Gradient of Loss with Respect to Output:

Gradient with Respect to :

Gradient with Respect to Weights and Bias:

Weight and Bias Update:

Where is the learning rate.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def mean_squared_error_derivative(y_true, y_pred):
return y_pred - y_true

class OneLayerMLP:
def __init__(self, input_size, output_size):
self.weights = np.random.randn(input_size, output_size)
self.bias = np.zeros((1, output_size))

def forward(self, X):
self.X = X
self.Z = np.dot(X, self.weights) + self.bias
self.A = sigmoid(self.Z)
return self.A

def backward(self, Y, learning_rate=0.1):
m = Y.shape[0]

# Gradient for output layer
dA = mean_squared_error_derivative(Y, self.A)
dZ = dA * sigmoid_derivative(self.A)

# Gradients for weights and bias
dW = np.dot(self.X.T, dZ) / m
db = np.sum(dZ, axis=0, keepdims=True) / m

# Update weights and bias
self.weights -= learning_rate * dW
self.bias -= learning_rate * db

loss = np.mean((Y - self.A) ** 2) # MSE loss
return loss

2. Two-Layer MLP

Forward Propagation

For a two-layer MLP, the architecture consists of:

  • Input layer:
  • Hidden layer: with weights and bias
  • Output layer: with weights and bias
  1. First Layer (Input to Hidden Layer):
  1. Second Layer (Hidden to Output Layer):

Where is the final output.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class TwoLayerMLP:
def __init__(self, input_size, hidden_size, output_size):
self.W1 = np.random.randn(input_size, hidden_size) # Input to hidden weights
self.b1 = np.zeros((1, hidden_size)) # Hidden layer bias
self.W2 = np.random.randn(hidden_size, output_size) # Hidden to output weights
self.b2 = np.zeros((1, output_size)) # Output layer bias

def forward(self, X):
self.X = X
self.Z1 = np.dot(X, self.W1) + self.b1 # Hidden layer linear output
self.A1 = sigmoid(self.Z1) # Hidden layer activation
self.Z2 = np.dot(self.A1, self.W2) + self.b2 # Output layer linear output
self.A2 = sigmoid(self.Z2) # Output layer activation
return self.A2

Backward Propagation

We compute the gradients of the loss with respect to weights and biases for both layers.

Gradient for Output Layer:

Gradients for Weight and Bias :

Gradients for Hidden Layer:

To propagate the error back to the hidden layer:

Gradients for Weight and Bias :

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class TwoLayerMLP:
def __init__(self, input_size, hidden_size, output_size):
self.W1 = np.random.randn(input_size, hidden_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size)
self.b2 = np.zeros((1, output_size))

def forward(self, X):
self.X = X
self.Z1 = np.dot(X, self.W1) + self.b1
self.A1 = sigmoid(self.Z1)
self.Z2 = np.dot(self.A1, self.W2) + self.b2
self.A2 = sigmoid(self.Z2)
return self.A2

def backward(self, Y, learning_rate=0.1):
m = Y.shape[0]

# Output layer gradients
dA2 = mean_squared_error_derivative(Y, self.A2)
dZ2 = dA2 * sigmoid_derivative(self.A2)
dW2 = np.dot(self.A1.T, dZ2) / m
db2 = np.sum(dZ2, axis=0, keepdims=True) / m

# Hidden layer gradients
dA1 = np.dot(dZ2, self.W2.T)
dZ1 = dA1 * sigmoid_derivative(self.A1)
dW1 = np.dot(self.X.T, dZ1) / m
db1 = np.sum(dZ1, axis=0, keepdims=True) / m

# Update weights and biases
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1

loss = np.mean((Y - self.A2) ** 2) # MSE loss
return loss