Day 2 Calculus in Deep Learning Along With Custom Auto Differential Project
Self Intro :¶
Hai
Bonjour Ciao
I’m Rohan Sai, alias Aiknight!
Welcome back to Day 2 of my 120 Days of Deep Learning. Today, we’ll cover Calculus: Derivatives, chain rule, gradients in Deep Learning, focusing on how derivatives and gradients are essential for model optimization.
I’ve developed a Custom Auto-Differentiation Library from scratch to help you understand these concepts better. Check it out on GitHub!
Fun Fact:¶
Gradient Descent is the optimization algorithm that powers most deep learning models. It’s like finding the lowest point in a hill by taking steps guided by the steepest descent.
Calculus for Deep Learning¶
Introduction¶
Calculus plays a pivotal role in deep learning, particularly in optimization algorithms used to train models. The core of calculus in deep learning revolves around:
- Derivatives: Measure how a function changes as its input changes.
- Chain Rule: Essential for computing derivatives of composite functions.
- Gradients: Generalize derivatives to multivariable functions and are critical in backpropagation.
1. Derivatives¶
A derivative measures the rate of change of a function with respect to its variable. For a neural network, derivatives help understand how a small change in inputs affects the outputs, which is critical for training models using backpropagation.
Types of Derivatives¶
First Derivative
Measures the slope or rate of change of a function. $ f'(x) = \frac{dy}{dx} $Second Derivative
Measures the rate of change of the rate of change. Useful for analyzing the curvature of a function and identifying minima or maxima. $ f''(x) = \frac{d^2y}{dx^2} $Partial Derivative
Measures the rate of change of a multivariable function with respect to one variable, keeping others constant. $ \frac{\partial f}{\partial x} $Directional Derivative
Measures the rate of change of a function in the direction of a vector. $ D_u f(x) = \nabla f(x) \cdot u $
Formulas¶
Constant Rule:
$
\frac{d}{dx}[c] = 0
$
Power Rule:
$
\frac{d}{dx}[x^n] = n \cdot x^{n-1}
$
Sum Rule:
$
\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)
$
Product Rule:
$
\frac{d}{dx}[u \cdot v] = u'v + uv'
$
Quotient Rule:
$
\frac{d}{dx}\left[\frac{u}{v}\right] = \frac{u'v - uv'}{v^2}
$
Chain Rule:
$
\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)
$
Example¶
Find the first and second derivatives of $ f(x) = 3x^2 + 5x + 7 $.
Solution¶
First Derivative: $ f'(x) = 6x + 5 $
Second Derivative: $ f''(x) = 6 $
Python Code¶
import sympy as sp
x = sp.symbols('x')
f = 3*x**2 + 5*x + 7
# First Derivative
first_derivative = sp.diff(f, x)
print(f"First Derivative: {first_derivative}")
# Second Derivative
second_derivative = sp.diff(first_derivative, x)
print(f"Second Derivative: {second_derivative}")
Applications¶
- Optimization: Used to minimize or maximize loss functions.
- Feature Scaling: Helps normalize input data based on sensitivity.
- Regularization: Identifies regions of overfitting.
Benefits¶
- Provides analytical insights into model behavior.
- Critical for backpropagation in deep learning.
Demerits¶
- Higher-order derivatives may be computationally expensive.
- Sensitive to small changes in input, leading to numerical instability.
2. Chain Rule¶
The chain rule is used to differentiate composite functions of the form $ f(g(x)) $. It is foundational in backpropagation, where functions are composed of multiple layers.
Types¶
Simple Chain Rule
For two nested functions $ f(g(x)) $: $ \frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} $Multivariable Chain Rule
For $ f(x, y, z) = g(h(x), k(y), m(z)) $: $ \frac{\partial f}{\partial x} = \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x} $
Example¶
Find the derivative of $ f(x) = (3x^2 + 2)^5 $.
Solution¶
- Outer function: $ f(u) = u^5 $, where $ u = 3x^2 + 2 $.
- Derivative of outer: $ \frac{df}{du} = 5u^4 $.
- Derivative of inner: $ \frac{du}{dx} = 6x $.
- Multiply: $ \frac{dy}{dx} = 5(3x^2 + 2)^4 \cdot 6x $
Python Code¶
u = 3*x**2 + 2
f = u**5
chain_rule_derivative = sp.diff(f, x)
print(f"Chain Rule Derivative: {chain_rule_derivative}")
Applications¶
- Neural Networks: Enables calculation of gradients across layers.
- Dynamic Systems: Analyzing changes in dependent variables.
Benefits¶
- Simplifies derivative calculations for complex functions.
- Handles multiple layers of abstraction in neural networks.
Demerits¶
- Computationally expensive for deep networks with many layers.
- Prone to vanishing gradients in deep models.
3. Gradients¶
A gradient is a vector that represents the partial derivatives of a function with respect to all its variables. In deep learning, gradients indicate the direction and magnitude of the steepest ascent.
Types¶
Gradient Vector
Represents all partial derivatives of a function: $ \nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right) $Jacobian Matrix
Generalizes the gradient to vector-valued functions.Hessian Matrix
Represents second-order partial derivatives.
Formulas¶
Gradient Vector:
$
\nabla f = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}\right)
$
Jacobian Matrix:
$
J_{ij} = \frac{\partial y_i}{\partial x_j}
$
Hessian Matrix:
$
H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}
$
Example¶
Find the gradient of $ f(x, y) = x^2 + y^2 $.
Solution¶
Partial derivative w.r.t $ x $: $ \frac{\partial f}{\partial x} = 2x $
Partial derivative w.r.t $ y $: $ \frac{\partial f}{\partial y} = 2y $
Gradient: $ \nabla f = (2x, 2y) $
Python Code¶
x, y = sp.symbols('x y')
f = x**2 + y**2
grad_x = sp.diff(f, x)
grad_y = sp.diff(f, y)
print(f"Gradient: (∂f/∂x = {grad_x}, ∂f/∂y = {grad_y})")
Topics for Basics in Deep Learning and Calculus¶
1. Weight Initialization¶
Weight initialization is critical to ensure that gradients flow effectively during training. Improper initialization can lead to:
- Vanishing Gradients: Gradients become too small, slowing training.
- Exploding Gradients: Gradients become excessively large, causing instability.
Techniques¶
Xavier Initialization:
- Designed for sigmoid or tanh activations.
- Ensures that variance of inputs and outputs remains the same across layers.
- Formula: $ W \sim \mathcal{N}(0, \frac{1}{n_{in}}) $
- Where ( n_{in} ) is the number of input neurons.
He Initialization:
- Specifically for ReLU activation.
- Formula: $ W \sim \mathcal{N}(0, \frac{2}{n_{in}}) $
Code Implementation: Weight Initialization¶
import numpy as np
def xavier_initialization(input_size, output_size):
"""
Xavier Initialization for weights.
"""
return np.random.randn(output_size, input_size) * np.sqrt(1 / input_size)
def he_initialization(input_size, output_size):
"""
He Initialization for weights.
"""
return np.random.randn(output_size, input_size) * np.sqrt(2 / input_size)
# Example Usage
input_size = 64 # Number of input neurons
output_size = 32 # Number of output neurons
# Xavier Initialization
xavier_weights = xavier_initialization(input_size, output_size)
print("Xavier Initialized Weights:")
print(xavier_weights)
# He Initialization
he_weights = he_initialization(input_size, output_size)
print("\nHe Initialized Weights:")
print(he_weights)
2. Optimization Algorithms¶
Concept¶
Optimization is the process of updating model parameters to minimize the loss function. Extensions to vanilla gradient descent improve convergence speed and stability.
Types¶
Stochastic Gradient Descent (SGD):
- Updates weights for each training example.
- Formula: $ W = W - \eta \cdot \nabla L(W) $
- Where $ \eta $ is the learning rate.
Momentum-Based Methods:
- Accelerates SGD by considering previous gradients.
- Formula: $ v = \gamma v + \eta \nabla L(W) $ $ W = W - v $
Adaptive Methods (Adam):
Combines momentum with adaptive learning rates.
Formula: $ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(W) $
$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(W))^2 $
$ \hat{m_t} = \frac{m_t}{1 - \beta_1^t}, \; \hat{v_t} = \frac{v_t}{1 - \beta_2^t} $
$ W = W - \eta \frac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} $
Code Implementation: SGD, Momentum, Adam¶
def sgd(weights, gradients, learning_rate=0.01):
"""
Stochastic Gradient Descent
"""
return weights - learning_rate * gradients
def momentum(weights, gradients, velocity, learning_rate=0.01, gamma=0.9):
"""
Momentum-Based Gradient Descent
"""
velocity = gamma * velocity + learning_rate * gradients
return weights - velocity, velocity
def adam(weights, gradients, m, v, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8):
"""
Adam Optimization Algorithm
"""
m = beta1 * m + (1 - beta1) * gradients
v = beta2 * v + (1 - beta2) * (gradients ** 2)
# Bias correction
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
weights = weights - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
return weights, m, v
# Example Usage
weights = np.array([0.5, 0.3]) # Example weights
gradients = np.array([0.01, -0.02]) # Example gradients
velocity = np.zeros_like(weights)
m, v, t = np.zeros_like(weights), np.zeros_like(weights), 1
# SGD
new_weights_sgd = sgd(weights, gradients, learning_rate=0.1)
print("SGD Updated Weights:", new_weights_sgd)
# Momentum
new_weights_momentum, velocity = momentum(weights, gradients, velocity, learning_rate=0.1)
print("Momentum Updated Weights:", new_weights_momentum)
# Adam
new_weights_adam, m, v = adam(weights, gradients, m, v, t, learning_rate=0.1)
print("Adam Updated Weights:", new_weights_adam)
3. Gradient Challenges¶
Vanishing Gradients¶
Vanishing gradients occur when the gradients become very small, close to zero, during backpropagation. This slows down the training process, especially for deep neural networks, as weight updates become negligible. This phenomenon is typically seen with activation functions like the sigmoid or tanh, where the gradients shrink when their input is either very large or very small.
Cause:¶
- Activation functions like sigmoid and tanh saturate, meaning the gradients are very small for extreme inputs.
- Gradients are multiplied by small values as they propagate back, leading to an exponentially diminishing gradient.
Formula:¶
Consider the sigmoid activation function: $ f(x) = \frac{1}{1 + e^{-x}} $
The derivative of the sigmoid function is:
$ f'(x) = f(x)(1 - f(x)) $
For large values of $x$, $f'(x)$ becomes very small, leading to vanishing gradients.
Exploding Gradients¶
Exploding gradients occur when gradients grow exponentially, leading to extremely large weight updates. This can cause numerical instability and prevent the model from converging.
Cause:¶
- When using large weights, backpropagating through many layers with high gradients can cause the gradients to grow.
- This is more common in deep networks, especially with activation functions like ReLU that do not saturate for large inputs.
Formula:¶
Consider a neural network with multiple layers. The gradient of the loss with respect to weights in a deep network can be expressed as: $ \frac{\partial L}{\partial W} = \prod_{l=1}^L \frac{\partial L}{\partial z_l} \cdot \frac{\partial z_l}{\partial W_l} $ If any of the terms in the product become very large, the gradient can "explode."
Solutions to Vanishing and Exploding Gradients¶
Gradient Clipping:
- A technique to prevent exploding gradients by setting a threshold for gradients during backpropagation.
- If the gradient exceeds the threshold, it is scaled to the threshold value.
Formula: $ \text{gradient} = \text{clip}(\text{gradient}, \text{threshold}) $
Proper Weight Initialization:
- Xavier and He Initialization (discussed earlier) help mitigate vanishing gradients by ensuring that the variance of weights and gradients is properly controlled.
Using Activation Functions Like ReLU:
- ReLU activation function does not saturate for positive values, which reduces the risk of vanishing gradients.
Code Implementation: Gradient Clipping¶
import numpy as np
def gradient_clipping(gradients, threshold):
"""
Clip gradients to the specified threshold.
"""
norm = np.linalg.norm(gradients)
if norm > threshold:
gradients = gradients * (threshold / norm)
return gradients
# Example Usage
gradients = np.array([0.1, 0.2, 0.3, 5.0, 10.0]) # Example gradients
threshold = 2.0 # Gradient threshold
clipped_gradients = gradient_clipping(gradients, threshold)
print("Clipped Gradients:", clipped_gradients)
4. Regularization Techniques¶
Regularization techniques are used to prevent overfitting and control the magnitude of gradients. In the context of deep learning, regularization helps to generalize the model well on unseen data.
Types of Regularization:¶
L1 and L2 Regularization:
- L1 Regularization (Lasso) adds the absolute values of the weights to the loss function.
- L2 Regularization (Ridge) adds the squared values of the weights to the loss function.
Formulas:
- L1 Regularization: $ L_{\text{L1}} = \lambda \sum_{i} |W_i| $
- L2 Regularization: $ L_{\text{L2}} = \lambda \sum_{i} W_i^2 $
Dropout:
- During training, randomly drop a certain percentage of neurons to prevent overfitting.
- Dropout rate (percentage) is a hyperparameter.
Gradient Clipping (already explained above):
- Used to cap the gradients to a fixed value to control large updates.
Code Implementation: L2 Regularization¶
def l2_regularization(weights, lambda_reg):
"""
L2 Regularization (Ridge).
"""
return lambda_reg * np.sum(weights ** 2)
# Example Usage
weights = np.array([0.5, 0.1, -0.3]) # Example weights
lambda_reg = 0.01 # Regularization strength
l2_loss = l2_regularization(weights, lambda_reg)
print("L2 Regularization Loss:", l2_loss)
5. Training Techniques¶
Learning Rate Scheduling¶
Learning rate scheduling involves changing the learning rate dynamically during training to improve convergence.
Exponential Decay: $ \eta_t = \eta_0 \cdot \gamma^t $ Where $ \eta_0 $ is the initial learning rate, $ \gamma $ is the decay rate, and $ t $ is the epoch.
Step Decay: The learning rate is reduced by a factor after a fixed number of epochs.
Batch Normalization¶
Batch normalization normalizes the input to each layer by adjusting and scaling activations. This reduces internal covariate shift and stabilizes training.
Code Implementation: Learning Rate Scheduling (Exponential Decay)¶
def exponential_decay_learning_rate(initial_lr, decay_rate, epoch):
"""
Learning rate scheduling using exponential decay.
"""
return initial_lr * np.exp(-decay_rate * epoch)
# Example Usage
initial_lr = 0.1 # Initial learning rate
decay_rate = 0.01 # Decay rate
epochs = 10 # Number of epochs
for epoch in range(epochs):
lr = exponential_decay_learning_rate(initial_lr, decay_rate, epoch)
print(f"Epoch {epoch+1}, Learning Rate: {lr}")
Neural Network Example¶
We will now walk through the complete process of building a basic neural network from scratch, including forward propagation, backpropagation, optimization with gradient descent, and the application of techniques like weight initialization, learning rate scheduling, and regularization.
Steps to Implement the Neural Network¶
- Forward Propagation: Compute activations layer by layer.
- Loss Calculation: Compute loss (e.g., Mean Squared Error or Cross-Entropy).
- Backpropagation: Compute gradients for weights and biases.
- Weight Update: Update the weights using an optimization algorithm (e.g., Gradient Descent, Adam).
- Regularization: Add regularization terms (L1/L2) to avoid overfitting.
- Learning Rate Scheduling: Apply a learning rate schedule during training to help convergence.
Neural Network with Iris Dataset Implementation¶
Let's test it out on a real dataset
We'll do these :
- Use the Iris dataset from
sklearn.datasets. - Handle multi-class classification (Iris dataset has three classes).
- Adjust the output layer to have three neurons (one for each class).
- Apply one-hot encoding to the target labels.
Code Implementation¶
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# Sigmoid activation function and its derivative
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
# Mean Squared Error Loss and its derivative
def mse_loss(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
def mse_loss_derivative(y_true, y_pred):
return 2 * (y_pred - y_true) / y_true.size
# Xavier Initialization
def xavier_initialization(input_size, output_size):
limit = np.sqrt(6 / (input_size + output_size))
return np.random.uniform(-limit, limit, (input_size, output_size))
# One-hot Encoding
def one_hot_encode(y, n_classes):
encoder = OneHotEncoder(sparse=False)
return encoder.fit_transform(y.reshape(-1, 1))
# Neural Network Class
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights and biases
self.weights_input_hidden = xavier_initialization(input_size, hidden_size)
self.weights_hidden_output = xavier_initialization(hidden_size, output_size)
self.bias_hidden = np.zeros((1, hidden_size))
self.bias_output = np.zeros((1, output_size))
self.learning_rate = 0.01
def forward(self, X):
# Forward pass
self.input = X
self.hidden_input = np.dot(X, self.weights_input_hidden) + self.bias_hidden
self.hidden_output = sigmoid(self.hidden_input)
self.output_input = np.dot(self.hidden_output, self.weights_hidden_output) + self.bias_output
self.output = sigmoid(self.output_input)
return self.output
def backward(self, X, y, reg_lambda=0.01):
# Backward pass and gradient computation
output_error = mse_loss_derivative(y, self.output)
# Gradient for the output layer
output_delta = output_error * sigmoid_derivative(self.output)
# Gradient for the hidden layer
hidden_error = output_delta.dot(self.weights_hidden_output.T)
hidden_delta = hidden_error * sigmoid_derivative(self.hidden_output)
# Compute gradients for weights and biases
d_weights_input_hidden = self.input.T.dot(hidden_delta) + reg_lambda * self.weights_input_hidden
d_weights_hidden_output = self.hidden_output.T.dot(output_delta) + reg_lambda * self.weights_hidden_output
d_bias_hidden = np.sum(hidden_delta, axis=0, keepdims=True)
d_bias_output = np.sum(output_delta, axis=0, keepdims=True)
# Update weights and biases
self.weights_input_hidden -= self.learning_rate * d_weights_input_hidden
self.weights_hidden_output -= self.learning_rate * d_weights_hidden_output
self.bias_hidden -= self.learning_rate * d_bias_hidden
self.bias_output -= self.learning_rate * d_bias_output
def train(self, X, y, epochs=1000, batch_size=32, reg_lambda=0.01, lr_schedule=False):
# Training the model with learning rate scheduling
for epoch in range(epochs):
# Forward pass
output = self.forward(X)
# Backpropagation
self.backward(X, y, reg_lambda)
# Learning rate scheduling (exponential decay)
if lr_schedule:
self.learning_rate = self.learning_rate * 0.99
# Print loss every 100 epochs
if epoch % 100 == 0:
loss = mse_loss(y, output)
print(f"Epoch {epoch}, Loss: {loss}")
def predict(self, X):
# Predict function to get output for new data
return self.forward(X)
# Load Iris dataset
iris = datasets.load_iris()
X = iris.data # Features (150 samples, 4 features)
y = iris.target # Labels (150 samples)
# One-hot encode the target labels
y_onehot = one_hot_encode(y, n_classes=3)
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.2, random_state=42)
# Create and train the Neural Network
nn = NeuralNetwork(input_size=4, hidden_size=10, output_size=3)
nn.train(X_train, y_train, epochs=1000, batch_size=32, reg_lambda=0.01, lr_schedule=True)
# Make predictions on the test set
predictions = nn.predict(X_test)
# Get the predicted class by selecting the index with the maximum value
predicted_classes = np.argmax(predictions, axis=1)
# Convert one-hot encoded labels back to class labels
true_classes = np.argmax(y_test, axis=1)
# Calculate accuracy
accuracy = accuracy_score(true_classes, predicted_classes)
print(f"Accuracy on test set: {accuracy * 100:.2f}%")
Explanation¶
Dataset:
- The Iris dataset is loaded using
sklearn.datasets.load_iris(). This dataset contains 150 samples with 4 features each (sepal length, sepal width, petal length, petal width). - The target labels are in the form of integers, but for neural network classification, we need to one-hot encode the labels using
OneHotEncoder.
- The Iris dataset is loaded using
One-Hot Encoding:
- The target labels (0, 1, 2 for each class) are converted into one-hot vectors, where each class is represented by a binary vector of length 3. For instance:
- Class 0 becomes
[1, 0, 0] - Class 1 becomes
[0, 1, 0] - Class 2 becomes
[0, 0, 1]
- Class 0 becomes
- The target labels (0, 1, 2 for each class) are converted into one-hot vectors, where each class is represented by a binary vector of length 3. For instance:
Neural Network:
- Input Layer: The network has 4 input neurons, corresponding to the 4 features in the dataset.
- Hidden Layer: We use 10 neurons in the hidden layer. The choice of 10 is arbitrary and can be tuned.
- Output Layer: The output layer has 3 neurons, corresponding to the 3 classes of the Iris dataset.
Training:
- The model is trained using gradient descent with a learning rate schedule (exponentially decaying learning rate) and L2 regularization to avoid overfitting.
- MSE Loss: Since we're dealing with a multi-class classification task, MSE is used, though cross-entropy could also be an alternative.
Prediction:
- After training, the network is tested on the test set, and predictions are made.
- The predicted classes are compared with the true labels (converted back from one-hot encoding).
- Accuracy is calculated using
accuracy_scorefrom scikit-learn.
Output Example¶
Upon running the above code, you will get output similar to:
Epoch 0, Loss: 0.4763318366939601
Epoch 100, Loss: 0.10983893449877996
Epoch 200, Loss: 0.055340366936174645
Epoch 300, Loss: 0.02982775728113701
Epoch 400, Loss: 0.01676481971119307
Epoch 500, Loss: 0.010170911481507208
Epoch 600, Loss: 0.006261236264779814
Epoch 700, Loss: 0.004026467472803571
Epoch 800, Loss: 0.002804340643518603
Epoch 900, Loss: 0.0019245433583275765
Accuracy on test set: 100.00%
That’s it for Day 2! 🚀
Explore the Custom Auto-Differentiation Project on GitHub, and don’t forget to subscribe to my blog
Also Make Sure to Follow me on LinkedIn and X
Happy Learning...
Comments
Post a Comment