Self-Introduction¶

Hai, Bonjour, Ciao! 👋

I’m Rohan Sai, also known as Aiknight! 🚀

Welcome to Day 3 of my 120 Days of Deep Learning journey! 🎉 Today, we’re diving deep into the fascinating world of Probability, Statistics and Optimization Theory in Deep Learning.

To make learning these concepts engaging, I’ve built a Optimizer Visualizer to help you understand these fundamental concepts better!

🔗 Explore my project: Optimizer Visualizer

Did You Know? 🤔¶

The chain rule is a mathematical ninja! 🥷 It enables deep learning models to efficiently calculate gradients during backpropagation, making them scalable to thousands of layers.

In essence, the chain rule allows us to compute the gradient of the loss function with respect to each parameter by breaking it into smaller, manageable parts! 🌟

Probability & Statistics in Deep Learning: Distributions and Sampling¶

1. Introduction to Probability and Statistics in Deep Learning¶

Probability and statistics are foundational to deep learning, helping model uncertainty, understand data distributions, and evaluate predictions. In deep learning, these concepts are crucial for:

Initializing weights (random sampling).
Understanding loss landscapes (optimization).
Working with probabilistic models (e.g., Variational Autoencoders, Bayesian Neural Networks).

2. Probability Distributions¶

A probability distribution describes how values are distributed. There are two main types:

Discrete Distributions: Probability is assigned to distinct outcomes.
Continuous Distributions: Probability is assigned over a range of values.

2.1 Discrete Probability Distributions¶

2.1.1 Bernoulli Distribution¶

Definition: Models binary outcomes (e.g., success/failure).
Formula: $ P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}, 0 \leq p \leq 1 $
Example: Flipping a biased coin.

Python Code:

import numpy as np
from scipy.stats import bernoulli

# Define probability of success
p = 0.7
# Generate 10 samples
samples = bernoulli.rvs(p, size=10)
print("Bernoulli Samples:", samples)

2.1.2 Binomial Distribution¶

Definition: Models the number of successes in $n$ independent Bernoulli trials.
Formula: $ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, \ldots, n\} $
Example: Tossing a coin 10 times.

Python Code:

from scipy.stats import binom

n, p = 10, 0.5  # 10 trials, 50% success probability
samples = binom.rvs(n, p, size=10)
print("Binomial Samples:", samples)

2.2 Continuous Probability Distributions¶

2.2.1 Uniform Distribution¶

Definition: All outcomes in the range $[a, b]$ are equally likely.
Formula: $ f(x) = \begin{cases} \frac{1}{b-a} & a \leq x \leq b \\ 0 & \text{otherwise} \end{cases} $
Example: Random initialization of neural network weights.

Python Code:

from scipy.stats import uniform

# Generate samples
a, b = 0, 1  # Uniform range
samples = uniform.rvs(loc=a, scale=b-a, size=10)
print("Uniform Samples:", samples)

2.2.2 Normal (Gaussian) Distribution¶

Definition: Bell-shaped curve, defined by mean $\mu$ and standard deviation $\sigma$.
Formula: $ f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $
Example: Weight initialization with Xavier or He initialization.

Python Code:

from scipy.stats import norm

# Generate samples
mu, sigma = 0, 1  # Mean and standard deviation
samples = norm.rvs(mu, sigma, size=10)
print("Normal Samples:", samples)

3. Sampling Methods¶

Sampling is the process of selecting data points to approximate the population distribution.

3.1 Random Sampling¶

Definition: Each data point is selected randomly, independent of others.

Example:

np.random.seed(42)
data = np.arange(10)
sample = np.random.choice(data, size=5, replace=False)
print("Random Sample:", sample)

3.2 Stratified Sampling¶

Definition: Ensures that samples represent different subgroups proportionally.

Example:

from sklearn.model_selection import train_test_split

# Example dataset
X = np.arange(100).reshape(50, 2)
y = np.array([0]*25 + [1]*25)  # Two classes

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
print("Train Class Distribution:", np.bincount(y_train))
print("Test Class Distribution:", np.bincount(y_test))

3.3 Importance Sampling¶

Definition: Samples are drawn more frequently from important regions.
Formula: Importance weight: $ w(x) = \frac{p(x)}{q(x)} $ where $p(x)$ is the target distribution, and $q(x)$ is the sampling distribution.
Example: Monte Carlo integration.

Python Code:

def importance_sampling(f, q, p, size):
    samples = q.rvs(size)
    weights = p.pdf(samples) / q.pdf(samples)
    return np.mean(f(samples) * weights)

from scipy.stats import norm

# Define distributions
q = norm(0, 2)  # Proposal distribution
p = norm(0, 1)  # Target distribution
f = lambda x: x**2  # Function to integrate

result = importance_sampling(f, q, p, size=10000)
print("Estimated Integral:", result)

4. Applications in Deep Learning¶

Weight Initialization: Sampling from uniform/normal distributions (e.g., Xavier/He initialization).
Dropout Regularization: Sampling neurons to drop during training.
Bayesian Neural Networks: Incorporates probabilistic distributions for weights.
Monte Carlo Sampling: Estimate integrals or gradients.

5. Benefits and Limitations¶

Benefits:¶

Enables efficient approximation of large datasets.
Supports uncertainty modeling in predictions.
Facilitates robust model evaluation and optimization.

Limitations:¶

Sampling bias can lead to poor generalization.
Computational overhead in importance sampling.

6. Advanced Probability and Statistics in Deep Learning¶

In deep learning, we often encounter probabilistic models and statistical methods that require deeper understanding. This section explores:

Complex Distributions and Their Uses in Deep Learning
Advanced Sampling Techniques
Bayesian Inference and Its Applications
Markov Chain Monte Carlo (MCMC)
Variational Inference (VI)
Information Theory Concepts in Deep Learning
Practical Code Implementations

6.1 Complex Probability Distributions¶

6.1.1 Multivariate Gaussian Distribution¶

Definition: A generalization of the normal distribution for multiple variables.
Formula: $ f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}} e^{-\frac{1}{2}(\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu)} $
- $\mu$: Mean vector
- $\Sigma$: Covariance matrix
Usage in Deep Learning:
- Latent space sampling in Variational Autoencoders (VAEs).
- Modeling correlated features.

Python Code:

import numpy as np
from scipy.stats import multivariate_normal

# Parameters
mean = [0, 0]
covariance = [[1, 0.5], [0.5, 1]]  # Correlation between variables

# Multivariate normal distribution
dist = multivariate_normal(mean, covariance)
samples = dist.rvs(size=10)
print("Multivariate Gaussian Samples:", samples)

6.1.2 Dirichlet Distribution¶

Definition: Generalization of the Beta distribution for multiple variables, often used in probabilistic models like Latent Dirichlet Allocation (LDA).
Formula: $ f(x_1, \ldots, x_k) = \frac{\Gamma(\alpha_0)}{\prod_{i=1}^k \Gamma(\alpha_i)} \prod_{i=1}^k x_i^{\alpha_i-1} $ where $\alpha_0 = \sum_{i=1}^k \alpha_i$.
Usage in Deep Learning:
- Topic modeling.
- Generating mixture distributions.

Python Code:

from numpy.random import dirichlet

# Parameters
alpha = [1, 2, 3]  # Concentration parameters
samples = dirichlet(alpha, size=5)
print("Dirichlet Samples:", samples)

6.1.3 Exponential Family Distributions¶

These distributions share a common functional form and include the Gaussian, Bernoulli, Binomial, and Poisson distributions. They are crucial in probabilistic models like:

Logistic Regression
Naive Bayes

6.2 Advanced Sampling Techniques¶

6.2.1 Markov Chain Monte Carlo (MCMC)¶

Definition: MCMC generates samples from a complex distribution by constructing a Markov Chain.
Algorithm: Metropolis-Hastings
1. Start with an initial state $x_0$.
2. Propose a new state $x'$ from a proposal distribution $q(x'|x_t)$.
3. Accept $x'$ with probability: $ A = \min\left(1, \frac{p(x')q(x_t|x')}{p(x_t)q(x'|x_t)}\right) $
Usage:
- Sampling from posterior distributions in Bayesian Neural Networks.

Python Code:

def metropolis_hastings(target_dist, proposal_dist, initial, n_samples):
    samples = []
    current = initial
    for _ in range(n_samples):
        proposal = proposal_dist.rvs()
        acceptance_ratio = min(1, target_dist.pdf(proposal) / target_dist.pdf(current))
        if np.random.rand() < acceptance_ratio:
            current = proposal
        samples.append(current)
    return np.array(samples)

from scipy.stats import norm

# Define target and proposal distributions
target = norm(0, 1)  # Standard normal
proposal = norm(0, 0.5)  # Narrower proposal

samples = metropolis_hastings(target, proposal, initial=0, n_samples=1000)
print("MCMC Samples:", samples[:10])

6.2.2 Importance Sampling (Advanced)¶

Improves accuracy by assigning weights to samples based on their importance.
Usage in Deep Learning:
- Estimating expectations in high-dimensional spaces.

6.3 Bayesian Inference in Deep Learning¶

Bayesian Neural Networks¶

Concept: Introduces probability distributions over weights instead of point estimates.
Benefits:
- Uncertainty quantification.
- Regularization via prior distributions.
Implementation: Variational Inference is commonly used for training Bayesian models.

6.4 Variational Inference (VI)¶

Goal: Approximate a target posterior distribution $p(z|x)$ with a simpler distribution $q(z)$.
Objective: Minimize the Kullback-Leibler (KL) divergence: $ \text{KL}(q(z) \| p(z|x)) = \mathbb{E}_{q(z)}[\log q(z) - \log p(z|x)] $
Usage:
- Variational Autoencoders.
- Generative models.

Python Code Example:

# Variational Inference Example
import torch
import torch.nn as nn
import torch.optim as optim

# Define simple VAE components
class Encoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.fc = nn.Linear(input_dim, latent_dim)

    def forward(self, x):
        return self.fc(x)

class Decoder(nn.Module):
    def __init__(self, latent_dim, output_dim):
        super().__init__()
        self.fc = nn.Linear(latent_dim, output_dim)

    def forward(self, z):
        return self.fc(z)

# Loss function
def vae_loss(recon_x, x, mu, log_var):
    recon_loss = nn.MSELoss()(recon_x, x)
    kl_div = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon_loss + kl_div

# Initialize model and optimizer
encoder = Encoder(10, 2)
decoder = Decoder(2, 10)
optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()))

# Dummy data
x = torch.randn(5, 10)
z = encoder(x)
recon_x = decoder(z)

6.5 Information Theory Concepts¶

Entropy and Cross-Entropy¶

Entropy: Measure of uncertainty. $ H(X) = -\sum_x p(x) \log p(x) $
Cross-Entropy: Measures the difference between two distributions. $ H(p, q) = -\sum_x p(x) \log q(x) $
Application:
- Loss functions in classification models.
- Information bottleneck in representation learning.

KL Divergence¶

Measures the difference between two distributions: $ \text{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} $

Optimization Theory in Deep Learning¶

Gradient Descent Variants¶

1. Introduction to Gradient Descent¶

Gradient Descent (GD) is the foundation of optimization in deep learning. Its purpose is to minimize the loss function by iteratively adjusting the model's parameters in the direction of the steepest descent of the loss function.

1.1 Gradient Descent Concept¶

Mathematical Objective: $ \theta_{t+1} = \theta_t - \eta \nabla_\theta J(\theta) $ Where:
- $\theta$: Parameters (weights) of the model.
- $\eta$: Learning rate.
- $J(\theta)$: Loss function.
- $\nabla_\theta J(\theta)$: Gradient of the loss with respect to $\theta$.

1.2 Key Components¶

Loss Function ($J(\theta)$): Measures the model's error. Examples include Mean Squared Error (MSE) and Cross-Entropy Loss.
Learning Rate ($\eta$): Controls the step size for parameter updates.

1.3 Challenges with Basic Gradient Descent¶

Slow convergence.
Poor handling of noisy gradients.
Stuck in saddle points or local minima.
Inefficiency in large datasets.

2. Types of Gradient Descent¶

Batch Gradient Descent (BGD): Processes the entire dataset at once for parameter updates.
- Pros: Accurate gradient calculation.
- Cons: Computationally expensive for large datasets.
```
for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, model)
    model.weights -= learning_rate * grad
```

Stochastic Gradient Descent (SGD): Processes one sample at a time.

Pros: Faster updates, computationally efficient.
Cons: Noisy updates, risk of instability.

for epoch in range(num_epochs):
    for i in range(len(X_train)):
        grad = compute_gradient(X_train[i], y_train[i], model)
        model.weights -= learning_rate * grad

Mini-Batch Gradient Descent: Processes a subset (batch) of the dataset.

Pros: Balances efficiency and stability.
Cons: Requires tuning batch size.

for epoch in range(num_epochs):
    for batch in generate_batches(X_train, y_train, batch_size):
        grad = compute_gradient(batch, model)
        model.weights -= learning_rate * grad

3. Advanced Gradient Descent Variants¶

Gradient descent has evolved to address the limitations of basic approaches. Below are popular variants with in-depth explanations and implementations.

3.1 Momentum-Based Gradient Descent¶

Concept: Adds momentum to smooth updates and accelerate convergence.
Update Rule: $ v_t = \beta v_{t-1} + (1 - \beta) \nabla_\theta J(\theta) $ $ \theta_{t+1} = \theta_t - \eta v_t $ Where:
- $v_t$: Velocity vector (accumulated gradient).
- $\beta$: Momentum coefficient (e.g., 0.9).
Usage: Handles noisy gradients and helps escape saddle points.

Code Implementation:

velocity = np.zeros_like(weights)
for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, model)
    velocity = beta * velocity + (1 - beta) * grad
    model.weights -= learning_rate * velocity

3.2 Nesterov Accelerated Gradient (NAG)¶

Concept: Looks ahead before updating parameters.
Update Rule: $ v_t = \beta v_{t-1} + \eta \nabla_\theta J(\theta - \beta v_{t-1}) $ $ \theta_{t+1} = \theta_t - v_t $
Pros: Faster convergence than momentum.

Code Implementation:

velocity = np.zeros_like(weights)
for epoch in range(num_epochs):
    lookahead_weights = weights - beta * velocity
    grad = compute_gradient(X_train, y_train, lookahead_weights)
    velocity = beta * velocity + learning_rate * grad
    weights -= velocity

3.3 Adaptive Gradient Methods (AdaGrad, RMSProp, Adam)¶

3.3.1 AdaGrad¶

Concept: Adapts learning rate for each parameter based on past gradients.
Update Rule: $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta J(\theta) $ Where:
- $G_t = \sum_{i=1}^t (\nabla_\theta J(\theta))^2$.
Pros: Handles sparse features well.
Cons: Learning rate diminishes too quickly.

Code Implementation:

G = np.zeros_like(weights)
for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, model)
    G += grad**2
    adjusted_grad = grad / (np.sqrt(G) + epsilon)
    model.weights -= learning_rate * adjusted_grad

3.3.2 RMSProp¶

Concept: Introduced to address AdaGrad's diminishing learning rate.
Update Rule: $ G_t = \beta G_{t-1} + (1 - \beta) (\nabla_\theta J(\theta))^2 $ $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta J(\theta) $
- $G_t$: Exponential moving average of squared gradients.

Code Implementation:

G = np.zeros_like(weights)
for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, model)
    G = beta * G + (1 - beta) * (grad**2)
    adjusted_grad = grad / (np.sqrt(G) + epsilon)
    model.weights -= learning_rate * adjusted_grad

3.4 Adam (Adaptive Moment Estimation)¶

Concept: Combines momentum and RMSProp for robust performance.
Update Rules: $ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J(\theta) $ $ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta J(\theta))^2 $ $ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $ $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $
Pros: Works well in most scenarios.

Code Implementation:

m, v = np.zeros_like(weights), np.zeros_like(weights)
for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, model)
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * (grad**2)
    m_hat = m / (1 - beta1**(epoch + 1))
    v_hat = v / (1 - beta2**(epoch + 1))
    weights -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)

4. Advanced Gradient Methods¶

AdamW: Introduces weight decay regularization in Adam.
AMSGrad: Modified Adam with a stable learning rate.
LBFGS: Quasi-Newton optimization for complex deep learning tasks.

Advanced Optimization Theory in Deep Learning: Gradient Descent Variants¶

This section delves deeper into advanced optimization concepts in deep learning, focusing on gradient descent variants. We'll explore their theoretical foundations, mathematical formulations, practical benefits, drawbacks, and advanced implementations.

1. Beyond Basic Gradient Descent¶

While basic gradient descent methods (Batch GD, Mini-Batch GD, and SGD) form the backbone of optimization, they have limitations like slow convergence, sensitivity to hyperparameters (e.g., learning rate), and susceptibility to local minima or saddle points. To address these issues, advanced optimization techniques have been developed. These include momentum-based methods, adaptive learning rates, second-order methods, and combinations of these approaches.

2. Advanced Momentum-Based Methods¶

2.1 Momentum¶

Concept Recap: Momentum accumulates past gradients to maintain directionality and overcome oscillations in steep or noisy terrains.
Mathematical Recap: $ v_t = \beta v_{t-1} + (1 - \beta) \nabla_\theta J(\theta) $ $ \theta_{t+1} = \theta_t - \eta v_t $
- $\beta \in [0, 1]$: Momentum coefficient. Common values: $0.9$ or $0.99$.
- Larger $\beta$: Greater smoothing of oscillations but slower response to rapid gradient changes.
Practical Notes:
Momentum works well for convex problems but may struggle in highly non-convex settings.

2.2 Nesterov Accelerated Gradient (NAG)¶

Advanced Insight: NAG not only smoothens updates but predicts the future gradient by "looking ahead" along the momentum path.
Update Rule: $ v_t = \beta v_{t-1} + \eta \nabla_\theta J(\theta - \beta v_{t-1}) $
- The term $\theta - \beta v_{t-1}$ is the lookahead position.
Advantages:
- Faster convergence than standard momentum.
- More precise parameter updates, especially in curved loss landscapes.
Limitations:
- Hyperparameter tuning for $\beta$ and learning rate $\eta$ is critical.

Implementation:

velocity = np.zeros_like(weights)
for epoch in range(num_epochs):
    lookahead = weights - beta * velocity
    grad = compute_gradient(X_train, y_train, lookahead)
    velocity = beta * velocity + learning_rate * grad
    weights -= velocity

3. Adaptive Learning Rate Methods¶

Adaptive learning rate algorithms adjust the learning rate dynamically for each parameter during training.

3.1 AdaGrad¶

Advanced Understanding: AdaGrad adapts the learning rate for each parameter based on the accumulated squared gradients.
Update Rule: $ G_t = \sum_{i=1}^t (\nabla_\theta J(\theta))^2 $ $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta J(\theta) $
- $G_t$: Accumulated squared gradients.
- $\epsilon$: Small constant for numerical stability.
Benefits:
- Effective for sparse data (e.g., NLP tasks).
Drawbacks:
- Learning rate decreases too quickly for dense gradients.

Code:

G = np.zeros_like(weights)
for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, weights)
    G += grad**2
    weights -= learning_rate * grad / (np.sqrt(G) + epsilon)

3.2 RMSProp¶

Refinement Over AdaGrad: RMSProp resolves AdaGrad's diminishing learning rate by introducing an exponentially weighted moving average of past gradients.
Update Rule: $ G_t = \beta G_{t-1} + (1 - \beta)(\nabla_\theta J(\theta))^2 $ $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta J(\theta) $
- Typical $\beta = 0.9$.
Advantages:
- Suited for non-stationary objectives.
- Stable updates even with noisy gradients.

Code:

G = np.zeros_like(weights)
for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, weights)
    G = beta * G + (1 - beta) * (grad**2)
    weights -= learning_rate * grad / (np.sqrt(G) + epsilon)

3.3 Adam (Adaptive Moment Estimation)¶

Adam combines the benefits of momentum and RMSProp, adapting both the learning rate and direction using moment estimates of gradients.

Mathematical Formulation: $ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J(\theta) $ $ v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta J(\theta))^2 $ Bias correction: $ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $ Parameter update: $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $
Parameters:
- $\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}$.
Strengths:
- Robust to noisy gradients.
- Fast convergence.

Code:

m, v = np.zeros_like(weights), np.zeros_like(weights)
for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, weights)
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * grad**2
    m_hat = m / (1 - beta1**(epoch + 1))
    v_hat = v / (1 - beta2**(epoch + 1))
    weights -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)

4. Regularization Techniques in Gradient Descent¶

4.1 AdamW¶

Concept: Combines Adam with weight decay to improve generalization.
Update Rule: $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \lambda \theta_t $
- $\lambda$: Weight decay coefficient.

Code:

for epoch in range(num_epochs):
    grad = compute_gradient(X_train, y_train, weights)
    weights = (1 - weight_decay * learning_rate) * weights - learning_rate * grad

4.2 Lookahead Optimizer¶

Works on the principle of tracking "fast weights" (local changes) and "slow weights" (stable updates).

5. Second-Order Optimization Methods¶

5.1 Newton’s Method¶

Uses the Hessian matrix ($H$) for more precise updates.
Update Rule: $ \theta_{t+1} = \theta_t - H^{-1} \nabla_\theta J(\theta) $
Limitation: Computationally expensive for large datasets.

5.2 L-BFGS (Limited-memory BFGS)¶

Approximates second-order optimization without explicitly calculating $H$.
Common in pre-training embeddings.

6. Best Practices and Hyperparameter Tuning¶

Choosing the Optimizer:
- Adam: General-purpose and robust.
- SGD with Momentum: Often preferred for computer vision tasks.
- RMSProp: Suitable for recurrent models.
Learning Rate Schedulers:
- Step Decay.
- Cosine Annealing.
- Learning Rate Warm-up.

That’s all for Day 3! 🚀

Don’t forget to explore my Optimizer Visualizer to see optimization in action.

Also, stay connected:
🔗 LinkedIn
🔗 X (Twitter)

Let’s keep learning and growing together. Happy Learning, and see you on Day 4! 😊

Day 3 : Probability , Statistics and Optimization Theory + Project : Optimizer Visualizer

Self-Introduction¶

Did You Know? 🤔¶

Probability & Statistics in Deep Learning: Distributions and Sampling¶

1. Introduction to Probability and Statistics in Deep Learning¶

2. Probability Distributions¶

2.1 Discrete Probability Distributions¶

2.1.1 Bernoulli Distribution¶

2.1.2 Binomial Distribution¶

2.2 Continuous Probability Distributions¶

2.2.1 Uniform Distribution¶

2.2.2 Normal (Gaussian) Distribution¶

3. Sampling Methods¶

3.1 Random Sampling¶

3.2 Stratified Sampling¶

3.3 Importance Sampling¶

4. Applications in Deep Learning¶

5. Benefits and Limitations¶

Benefits:¶

Limitations:¶

6. Advanced Probability and Statistics in Deep Learning¶

6.1 Complex Probability Distributions¶

6.1.1 Multivariate Gaussian Distribution¶

6.1.2 Dirichlet Distribution¶

6.1.3 Exponential Family Distributions¶

6.2 Advanced Sampling Techniques¶

6.2.1 Markov Chain Monte Carlo (MCMC)¶

6.2.2 Importance Sampling (Advanced)¶

6.3 Bayesian Inference in Deep Learning¶

Bayesian Neural Networks¶

6.4 Variational Inference (VI)¶

6.5 Information Theory Concepts¶

Entropy and Cross-Entropy¶

KL Divergence¶

Optimization Theory in Deep Learning¶

Gradient Descent Variants¶

1. Introduction to Gradient Descent¶

1.1 Gradient Descent Concept¶

1.2 Key Components¶

1.3 Challenges with Basic Gradient Descent¶

2. Types of Gradient Descent¶

3. Advanced Gradient Descent Variants¶

3.1 Momentum-Based Gradient Descent¶

3.2 Nesterov Accelerated Gradient (NAG)¶

3.3 Adaptive Gradient Methods (AdaGrad, RMSProp, Adam)¶

3.3.1 AdaGrad¶

3.3.2 RMSProp¶

3.4 Adam (Adaptive Moment Estimation)¶

4. Advanced Gradient Methods¶

Advanced Optimization Theory in Deep Learning: Gradient Descent Variants¶

1. Beyond Basic Gradient Descent¶

2. Advanced Momentum-Based Methods¶

2.1 Momentum¶

2.2 Nesterov Accelerated Gradient (NAG)¶

3. Adaptive Learning Rate Methods¶

3.1 AdaGrad¶

3.2 RMSProp¶

3.3 Adam (Adaptive Moment Estimation)¶

4. Regularization Techniques in Gradient Descent¶

4.1 AdamW¶

4.2 Lookahead Optimizer¶

5. Second-Order Optimization Methods¶

5.1 Newton’s Method¶

5.2 L-BFGS (Limited-memory BFGS)¶

6. Best Practices and Hyperparameter Tuning¶

Comments

Post a Comment

Popular posts from this blog

Day 4 : Python and Numpy - Deep Dive for Deep Learning and Neural Math Library

Day 2 Calculus in Deep Learning Along With Custom Auto Differential Project