Vanishing Gradient Problem
Demonstrate the vanishing gradient problem by computing the gradient magnitude at each layer of a deep sigmoid network. In deep networks with sigmoid activations, gradients shrink exponentially as they are backpropagated through many layers because the maximum value of sigmoid's derivative is 0.25. This makes early layers learn extremely slowly.
Demonstration:
For a network with N sigmoid layers, the gradient at layer k (from the output)
is approximately:
gradient_magnitude[k] ≈ (max_sigmoid_derivative)^k = 0.25^k
This means:
Layer 1 (closest to output): gradient ≈ 0.25^1 = 0.25
Layer 2: gradient ≈ 0.25^2 = 0.0625
Layer 5: gradient ≈ 0.25^5 = 0.000977
Layer 10: gradient ≈ 0.25^10 ≈ 0.0000001
The gradient vanishes exponentially with depth!
Example:
Input: num_layers = 5
sigmoid_derivative_max = 0.25
Gradient magnitudes per layer (from output to input):
Layer 5 (output): 0.25^1 = 0.25
Layer 4: 0.25^2 = 0.0625
Layer 3: 0.25^3 = 0.015625
Layer 2: 0.25^4 = 0.00390625
Layer 1 (input): 0.25^5 = 0.0009765625
Output: [0.0009765625, 0.00390625, 0.015625, 0.0625, 0.25]
**Explanation:** The vanishing gradient problem occurs because during backpropagation, gradients are multiplied by the derivatives of activation functions at each layer. For sigmoid, the maximum derivative is 0.25 (at x=0), so each layer multiplication shrinks the gradient by at least 75%. After many layers, the gradient reaching the early layers is essentially zero, preventing those layers from learning. This is why modern deep networks use ReLU activations (derivative = 1 for positive inputs) and skip connections (ResNet) to maintain gradient flow.
Constraints:
Test Cases
5[0.000977, 0.003906, 0.015625, 0.0625, 0.25]3[0.015625, 0.0625, 0.25]1[0.25]