Neural Networks & Deep Learning

Chapter 8: Activation Functions โ€” Adding Non-Linearity

The Tiny Non-Linear Functions That Give Neural Networks Their Power

โฑ๏ธ Reading Time: ~2 hours  |  ๐Ÿ“– Unit 3: The Shallow Network  |  ๐Ÿง  Theory + Code Chapter

๐Ÿ“‹ Prerequisites: Chapter 7 (Deep Neural Networks), Derivatives, Chain Rule

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the formula, range, and derivative of each activation function: sigmoid, tanh, ReLU, Leaky ReLU, ELU, GELU, Swish, softmax
๐Ÿ”ต UnderstandExplain why non-linearity is essential โ€” prove that stacked linear layers collapse to a single linear transformation
๐ŸŸข ApplyImplement all 8 activation functions and their derivatives from scratch in NumPy; use PyTorch equivalents
๐ŸŸก AnalyzeAnalyze the vanishing gradient problem in sigmoid/tanh and the dying ReLU problem โ€” trace how gradients flow
๐ŸŸ  EvaluateChoose the right activation function for a given architecture (CNN, Transformer, binary classifier, multi-class) using a decision tree
๐Ÿ”ด CreateDesign and run an experiment comparing all activations on a real dataset; interpret gradient flow visualizations
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Prove mathematically that a neural network with only linear activations is equivalent to a single linear transformation, regardless of depth
  • State the formula, derivative, output range, and computational cost for each of the 8 activation functions covered
  • Explain the vanishing gradient problem in sigmoid and tanh, and why ReLU largely solved it
  • Diagnose the dying ReLU problem โ€” identify when neurons die, detect it in training logs, and apply fixes
  • Compare GELU and Swish with ReLU, and explain why modern Transformer architectures (BERT, GPT) prefer GELU
  • Derive the softmax function from a log-linear model and compute its Jacobian matrix
  • Implement all activation functions and their derivatives from scratch in NumPy, and verify against PyTorch
  • Select the right activation function for any given task using a systematic decision tree
Section 2

Opening Hook

๐Ÿง  The Story of the Function That Changed Everything

In 2012, Alex Krizhevsky was building what would become AlexNet โ€” the neural network that launched the deep learning revolution. His team at the University of Toronto faced a brutal problem: their deep convolutional network simply refused to train. Gradients vanished layer after layer, and the sigmoid activations that everyone had used for decades turned into a wall.

Then they made a deceptively simple change. They replaced sigmoid with a function a first-year student could write: f(x) = max(0, x). That's it. No exponentials, no divisions, no complex math. Just "if positive, keep it; if negative, zero it."

The result? AlexNet trained 6ร— faster than with sigmoid. It won the ImageNet competition by a landslide, cutting the error rate nearly in half. The ReLU activation โ€” which researchers had ignored for years because it seemed "too simple" โ€” became the single most used activation function in deep learning.

But here's the twist: a decade later, when OpenAI built GPT and Google built BERT, they didn't use ReLU. They used GELU โ€” a smooth, probabilistic cousin of ReLU. Why? Because in Transformers, the sharp corner of ReLU at zero causes problems that matter at billion-parameter scale.

Without activation functions, a 100-layer neural network is just a fancy linear regression. This chapter is about the tiny non-linear functions that give neural networks their power โ€” and knowing which one to pick can be the difference between a model that learns and one that's dead on arrival.

AlexNet (2012)
GPT / BERT
Google Brain
Flipkart
Section 3

The Intuition First

The Valve Analogy

Imagine you're building a water distribution network for a city. You have pipes (weights) connecting various junctions (neurons), and water (data) flows through. If every junction is just a straight-through connection โ€” no valves, no gates โ€” then no matter how complex your pipe network is, the relationship between water in and water out is always linear. Add more pipes? Still linear. Make the network deeper? Still linear.

An activation function is like putting a valve at each junction. The valve can:

  • Block flow entirely (like ReLU zeroing out negatives)
  • Regulate flow (like sigmoid squashing it between 0 and 1)
  • Amplify selectively (like ELU boosting small negative signals)

With valves, suddenly your network can create incredibly complex flow patterns โ€” eddies, branches, feedback loops โ€” that a straight pipe network never could.

The human brain uses non-linear activation too! A neuron doesn't fire proportionally to its input โ€” it either fires or doesn't (roughly), following an S-shaped "firing rate curve" remarkably similar to the sigmoid function. Nature discovered activation functions 500 million years before us.

The "Aha" Question

๐Ÿค” If ReLU is just max(0, x) โ€” a function you could explain to a 10-year-old โ€” why did it take until 2012 for the deep learning community to embrace it? And why did Google Brain spend years searching for something better?

By the end of this chapter, you'll not only understand the answer, but you'll be able to derive why certain activations work better for certain architectures โ€” and make that choice yourself.

Section 4

Mathematical Foundation: Why Non-Linearity is Essential

The Collapse Theorem: Stacked Linear Layers = Single Linear Layer

Theorem: A neural network of any depth L, with linear (identity) activation functions at every layer, computes a function that is equivalent to a single linear transformation.

Step 1: Set up a 2-layer network with linear activations

Layer 1: zโ‚ = Wโ‚x + bโ‚, and aโ‚ = zโ‚ (linear activation)
Layer 2: zโ‚‚ = Wโ‚‚aโ‚ + bโ‚‚, and aโ‚‚ = zโ‚‚ (linear activation)

Step 2: Substitute aโ‚ into Layer 2

aโ‚‚ = Wโ‚‚(Wโ‚x + bโ‚) + bโ‚‚
aโ‚‚ = Wโ‚‚Wโ‚x + Wโ‚‚bโ‚ + bโ‚‚

Step 3: Define collapsed parameters

Let W' = Wโ‚‚Wโ‚ (a single matrix) and b' = Wโ‚‚bโ‚ + bโ‚‚ (a single bias vector)
Then: aโ‚‚ = W'x + b'

Step 4: Generalize to L layers by induction

For L layers: aโ‚— = Wโ‚—(Wโ‚—โ‚‹โ‚(...(Wโ‚x + bโ‚)...+ bโ‚—โ‚‹โ‚) + bโ‚—
This always collapses to: aโ‚— = W*x + b* where W* = Wโ‚—Wโ‚—โ‚‹โ‚...Wโ‚

Step 5: Conclusion

No matter how many layers you stack, without non-linear activation, your network is just doing y = Wx + b. All those extra parameters are wasted โ€” they add computational cost without adding representational power.

The Collapse Result:
y = WL(WL-1(...(W1x + b1)...)) = W*x + b*
where W* = WLWL-1...W1 โ† just one matrix multiplication!

What Non-Linearity Buys You

The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) states: a neural network with a single hidden layer and a non-linear activation function can approximate any continuous function to arbitrary accuracy, given enough hidden units.

The key phrase is "non-linear". Without it, you're stuck approximating only linear functions โ€” planes in 2D, hyperplanes in higher dimensions. With it, you can learn spirals, circles, XOR, and anything else.

LINEAR ACTIVATIONS NON-LINEAR ACTIVATIONS (can only learn lines) (can learn any shape) ยท ยท ยท ยท โ— โ— ยท ยท ยท ยท โ— โ— ยท ยท ยท โ— /โ— โ— ยท ยท ยท โ•ญโ”€โ”€โ— โ— ยท ยท ยท /โ— โ— โ— ยท ยท โ•ญโ”€โ•ฏโ— โ— โ— ยท ยท / โ— โ— โ— vs ยท ยท โ•ญโ•ฏ โ— โ— โ— ยท ยท / ยท โ— โ— โ— ยท ยทโ”‚ยท โ— โ— โ— ยท / ยท ยท โ— โ— ยท โ•ญโ”€โ•ฏยท ยท โ— โ— ยท / ยท ยท ยท ยท โ— ยทโ•ฐโ”€โ”€ยท ยท ยท ยท โ— โŒ Can only separate โœ… Can learn curved with a straight line decision boundaries
Q: Why is a 100-layer network with linear activations equivalent to a single layer?
A: Because the composition of linear functions is linear. Wโ‚โ‚€โ‚€ยทWโ‚‰โ‚‰ยท...ยทWโ‚ = W* (one matrix). So all 100 layers collapse into ลท = W*x + b*. Non-linear activations break this composability, allowing each layer to compute new non-linear features.
Section 5

Activation 1: Sigmoid โ€” The Classic S-Curve

ฯƒ(z) = 1 / (1 + eโˆ’z)
Formula
ฯƒ(z) = 1 / (1 + eโˆ’z)
Derivative (Elegant Self-Referential Form)
Derivation of ฯƒ'(z):

ฯƒ(z) = (1 + eโˆ’z)โˆ’1

Using the chain rule:

ฯƒ'(z) = โˆ’(1 + eโˆ’z)โˆ’2 ยท (โˆ’eโˆ’z)

ฯƒ'(z) = eโˆ’z / (1 + eโˆ’z)2

Now the trick โ€” multiply numerator and denominator by 1:

ฯƒ'(z) = [1/(1+eโˆ’z)] ยท [eโˆ’z/(1+eโˆ’z)]

ฯƒ'(z) = ฯƒ(z) ยท [1 โˆ’ ฯƒ(z)]

Key Result: ฯƒ'(z) = ฯƒ(z) ยท (1 โˆ’ ฯƒ(z))
Properties
PropertyValue
Output Range(0, 1)
Zero-Centered?โŒ No โ€” outputs always positive
Max Gradient0.25 (at z=0)
Monotonic?โœ… Yes
Saturates?โœ… Yes โ€” for |z| > 5, gradient โ‰ˆ 0
ASCII Graph
ฯƒ(z) 1.0 โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ•ฑ โ•ฑ 0.5 โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ X โ•ฑ โ•ฑ 0.0 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฑโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ -6 -4 -2 0 2 4 6 z โ†’ ฯƒ'(z) peaks at 0.25 when z = 0, vanishes at tails
โœ… Pros
  • Smooth, differentiable everywhere
  • Output bounded in (0, 1) โ€” natural interpretation as probability
  • Historically important โ€” basis of logistic regression
โŒ Cons
  • Vanishing Gradient: max derivative is only 0.25. With L layers, gradients shrink as 0.25L โ†’ 0
  • Not zero-centered: outputs always > 0, causing zig-zag gradient updates
  • Computationally expensive: requires exponential computation
When to Use

โœ… Output layer for binary classification (P(y=1|x))
โŒ Hidden layers of deep networks (vanishing gradients)

โŒ MYTH: "Sigmoid is dead โ€” never use it."

โœ… TRUTH: Sigmoid is still the correct choice for binary classification output layers and gating mechanisms (LSTM forget/input gates).

๐Ÿ” WHY IT MATTERS: In LSTM networks, sigmoid gates control information flow. Replacing them with ReLU would break the [0,1] gating logic entirely.

Section 6

Activation 2: Tanh โ€” Zero-Centered Sigmoid

tanh(z) = (ez โˆ’ eโˆ’z) / (ez + eโˆ’z)
Formula & Relationship to Sigmoid
tanh(z) = 2ฯƒ(2z) โˆ’ 1

Tanh is a scaled and shifted version of sigmoid! This means everything you know about sigmoid applies โ€” just rescaled to the range (โˆ’1, 1).

Derivative
Key Result: tanh'(z) = 1 โˆ’ tanhยฒ(z)
Quick derivation:

Let t = tanh(z) = (ez โˆ’ eโˆ’z) / (ez + eโˆ’z)

Using quotient rule or the identity tanh(z) = 1 โˆ’ 2/(e2z+1):

d/dz tanh(z) = sechยฒ(z) = 1 โˆ’ tanhยฒ(z)

Maximum at z=0: tanh'(0) = 1 โˆ’ 0ยฒ = 1.0 (4ร— larger than sigmoid's 0.25!)

Properties
PropertyValue
Output Range(โˆ’1, 1)
Zero-Centered?โœ… Yes โ€” this is its main advantage over sigmoid
Max Gradient1.0 (at z=0) โ€” 4ร— better than sigmoid
Saturates?โœ… Yes โ€” still vanishes for |z| > 5
ASCII Graph
tanh(z) +1.0 โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ•ฑ โ•ฑ 0.0 โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ X โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ•ฑ โ•ฑ -1.0 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฑ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ -6 -4 -2 0 2 4 6 z โ†’ Note: Steeper than sigmoid, symmetric around origin
Why Zero-Centered Matters

When sigmoid outputs are always positive (0 to 1), the gradients for weights in the next layer are all the same sign. This forces gradient descent to zig-zag toward the optimum. Tanh, centered at zero, allows gradients of mixed signs, leading to more direct paths to the optimum.

When to Use

โœ… Hidden layers when you need bounded, zero-centered activations (e.g., RNN hidden states)
โœ… When inputs are expected to have both positive and negative values
โŒ Deep networks where vanishing gradients are a concern

Tanh vs. Sigmoid cheat: tanh is almost always better than sigmoid for hidden layers because it's zero-centered. Andrew Ng's rule of thumb: "The only place I'd use sigmoid is the output layer of binary classification."

Section 7

Activation 3: ReLU โ€” The Game Changer

ReLU(z) = max(0, z)
Formula
ReLU(z) = max(0, z) = { z if z > 0 ; 0 if z โ‰ค 0 }
Derivative
ReLU'(z) = { 1 if z > 0 ; 0 if z < 0 ; undefined at z = 0 }

In practice, we define ReLU'(0) = 0 (or sometimes 0.5). Since the probability of z being exactly 0 is measure-zero for continuous inputs, this convention doesn't matter.

Properties
PropertyValue
Output Range[0, โˆž)
Zero-Centered?โŒ No โ€” outputs always โ‰ฅ 0
Gradient in active regionExactly 1.0 โ€” no vanishing gradient!
Computational CostExtremely cheap โ€” just a comparison
Sparse Activationโœ… ~50% of neurons output zero on average
ASCII Graph
ReLU(z) 6 โ”ค โ•ฑ โ”‚ โ•ฑ 4 โ”ค โ•ฑ โ”‚ โ•ฑ 2 โ”ค โ•ฑ โ”‚ โ•ฑ 0 โ”คโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ X โ”‚ -2 โ”ค โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€ -6 -4 -2 0 2 4 6 Dead zone (gradient=0) โ”‚ Active zone (gradient=1) โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ
Why ReLU Works: Three Reasons
  1. No Vanishing Gradient: In the active region (z > 0), the gradient is exactly 1. Gradients propagate through deep networks without shrinking.
  2. Sparse Activation: About 50% of neurons output exactly zero, creating a sparse representation. This acts as a form of regularization and is biologically plausible (not all brain neurons fire simultaneously).
  3. Computationally Trivial: Just a comparison and a branch โ€” no exponentials, no divisions. This is 6ร— faster than sigmoid in practice.
โŒ The Dying ReLU Problem

If a neuron's weights update such that Wx + b < 0 for all training inputs, that neuron will always output zero. With a zero output, its gradient is also zero, so the weights never update. The neuron is permanently dead.

When it happens most:

  • Large learning rate โ†’ weights overshoot โ†’ many neurons go negative
  • Poor weight initialization (too large)
  • Large negative bias terms
When to Use

โœ… Default choice for hidden layers in most networks (CNNs, MLPs)
โœ… When computational efficiency matters
โŒ When you're losing many neurons (switch to Leaky ReLU)

ReLU was proposed as early as 2000 by Hahnloser et al. in a neuroscience context, but nobody in the ML community used it until Nair & Hinton (2010) showed it worked well in Restricted Boltzmann Machines. It then became mainstream through AlexNet (2012). A decade of ignoring the simplest possible activation!

Paper: "Rectified Linear Units Improve Restricted Boltzmann Machines" โ€” Nair & Hinton, ICML 2010. The paper that started the ReLU revolution. Key insight: ReLU creates sparse representations similar to biological neurons, and its constant gradient prevents the vanishing gradient problem that had limited deep network training for years.

Section 8

Activation 4: Leaky ReLU โ€” Fixing the Dead Neuron Problem

Leaky ReLU(z) = max(ฮฑz, z), typically ฮฑ = 0.01
Formula
LeakyReLU(z) = { z if z > 0 ; ฮฑz if z โ‰ค 0 } where ฮฑ is small (typically 0.01)
Derivative
LeakyReLU'(z) = { 1 if z > 0 ; ฮฑ if z โ‰ค 0 }
Properties
PropertyValue
Output Range(โˆ’โˆž, โˆž)
Gradient for z < 0ฮฑ (small but non-zero โ€” neurons never die!)
Variant: PReLUฮฑ is a learnable parameter (He et al., 2015)
Variant: Randomizedฮฑ sampled randomly during training
ASCII Graph
Leaky ReLU(z), ฮฑ=0.1 (exaggerated for visibility) 6 โ”ค โ•ฑ โ”‚ โ•ฑ 4 โ”ค โ•ฑ โ”‚ โ•ฑ 2 โ”ค โ•ฑ โ”‚ โ•ฑ 0 โ”คโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ X โ”‚ โ•ฑ -1 โ”ค __โ•ฑ โ† small negative slope (ฮฑ) โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€ -6 -4 -2 0 2 4 6
Why the Small Leak Matters

That tiny ฮฑ = 0.01 slope means even neurons with negative inputs get some gradient. It's just 1% of the positive slope, but it's enough to keep gradients flowing and potentially revive a neuron during training.

PReLU: Making ฮฑ Learnable

Parametric ReLU (He et al., 2015) lets the network learn the optimal value of ฮฑ for each neuron via backpropagation. This adds very few extra parameters but can improve performance. PReLU won the ImageNet 2015 competition.

When to Use

โœ… When you observe dying ReLU in your training (many neurons stuck at zero)
โœ… As a safer default when you can't diagnose dead neurons easily
โœ… PReLU when you want max flexibility with minimal parameter overhead

Section 9

Activation 5: ELU โ€” Exponential Linear Unit

ELU(z) = { z if z > 0 ; ฮฑ(ez โˆ’ 1) if z โ‰ค 0 }
Formula
ELU(z) = { z if z > 0 ; ฮฑ(ez โˆ’ 1) if z โ‰ค 0 }, typically ฮฑ = 1.0
Derivative
ELU'(z) = { 1 if z > 0 ; ฮฑยทez = ELU(z) + ฮฑ if z โ‰ค 0 }
Properties
PropertyValue
Output Range(โˆ’ฮฑ, โˆž)
Zero-Centered?โ‰ˆ Yes (mean activation closer to zero)
Smooth at z=0?โœ… Yes โ€” unlike ReLU's sharp corner
Saturates for z โ‰ช 0?โœ… Approaches โˆ’ฮฑ (provides noise robustness)
ASCII Graph
ELU(z), ฮฑ=1.0 6 โ”ค โ•ฑ โ”‚ โ•ฑ 4 โ”ค โ•ฑ โ”‚ โ•ฑ 2 โ”ค โ•ฑ โ”‚ โ•ฑ 0 โ”คโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ X โ”‚ ___โ•ฑ -1 โ”คโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ __โ•ฑ โ† smooth exponential curve โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€ -6 -4 -2 0 2 4 6
ELU's Advantages Over ReLU and Leaky ReLU
  1. Smooth everywhere: No sharp corner at z=0, which can help optimization
  2. Negative saturation: For very negative inputs, ELU saturates at โˆ’ฮฑ. This acts like a denoising mechanism
  3. Near-zero mean activations: Pushes the mean of activations closer to zero, reducing the bias shift effect
When to Use

โœ… When you want zero-centered activations without bounded outputs
โœ… Deep networks where slight accuracy gains over ReLU justify the extra compute
โŒ When computational budget is tight (exponential is expensive)

Section 10

Activation 6: GELU โ€” The Transformer's Choice

GELU(z) = z ยท ฮฆ(z) where ฮฆ is the standard normal CDF
Formula
GELU(z) = z ยท P(Z โ‰ค z) = z ยท ฮฆ(z) where ฮฆ(z) = ยฝ[1 + erf(z/โˆš2)]
Approximate Formula (used in practice)
GELU(z) โ‰ˆ 0.5z(1 + tanh[โˆš(2/ฯ€)(z + 0.044715zยณ)])

This approximation is what BERT and GPT actually compute โ€” it avoids the expensive error function while being numerically almost identical.

Intuition: The Probabilistic Gate

Think of GELU as a "stochastic ReLU": instead of the hard decision "if positive keep, if negative drop," GELU makes a soft, probabilistic decision. Inputs that are very positive pass through almost unchanged (ฮฆ(z) โ‰ˆ 1). Inputs that are very negative are almost zeroed (ฮฆ(z) โ‰ˆ 0). But inputs near zero get a weighted pass โ€” the weight being the probability that a standard normal random variable would be less than z.

Derivative
GELU'(z) = ฮฆ(z) + z ยท ฯ†(z) where ฯ†(z) is the standard normal PDF
Properties
PropertyValue
Output Range(โ‰ˆ โˆ’0.17, โˆž)
Smooth?โœ… Infinitely differentiable
Non-monotonic?โœ… Has a small bump for z โ‰ˆ โˆ’0.75
Used inBERT, GPT-2, GPT-3, GPT-4, ViT
ASCII Graph
GELU(z) 6 โ”ค โ•ฑ โ”‚ โ•ฑ 4 โ”ค โ•ฑ โ”‚ โ•ฑ 2 โ”ค โ•ฑ โ”‚ โ•ฑโ•ฑ 0 โ”คโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€โ”€X โ”‚ ___โ”€โ•ฑ -0.2โ”คโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€__โ”€โ•ฑ โ† slight dip below zero! โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€ -6 -4 -2 0 2 4 6 Notice: Non-monotonic near z โ‰ˆ -0.75 (slight bump) This is a key difference from ReLU!
Why Transformers Prefer GELU Over ReLU
  1. Smoothness: ReLU's sharp corner at z=0 creates discontinuous gradients. At billion-parameter scale with attention mechanisms, this causes optimization instabilities
  2. Non-monotonicity: The small negative region allows GELU to "anti-correlate" certain features, which helps attention layers learn more expressive representations
  3. Probabilistic interpretation: GELU naturally fits the dropout/stochastic regularization framework used in Transformers
  4. Empirical wins: GELU consistently outperforms ReLU on NLP benchmarks by 0.5-2%

Paper: "Gaussian Error Linear Units (GELUs)" โ€” Dan Hendrycks & Kevin Gimpel, 2016 (arXiv:1606.08415). Originally a workshop paper, GELU became the default activation in nearly all Transformer models. The key insight: instead of deterministically zeroing out inputs (ReLU), scale them by their percentile in a Gaussian distribution. This "soft gating" is more compatible with the stochastic nature of dropout.

Section 11

Activation 7: Swish / SiLU โ€” The Neural Architecture Search Discovery

Swish(z) = z ยท ฯƒ(z) = z / (1 + eโˆ’z)
Formula
Swish(z) = z ยท ฯƒ(z) = z / (1 + eโˆ’z)
Derivative
Derivation using product rule:

Swish(z) = z ยท ฯƒ(z)

Swish'(z) = ฯƒ(z) + z ยท ฯƒ'(z)

Swish'(z) = ฯƒ(z) + z ยท ฯƒ(z)(1 โˆ’ ฯƒ(z))

Swish'(z) = ฯƒ(z) + z ยท ฯƒ(z) โˆ’ z ยท ฯƒยฒ(z)

Swish'(z) = ฯƒ(z)(1 + z(1 โˆ’ ฯƒ(z))) = ฯƒ(z) + Swish(z)(1 โˆ’ ฯƒ(z))

Properties
PropertyValue
Output Range(โ‰ˆ โˆ’0.278, โˆž)
Smooth?โœ… Infinitely differentiable
Non-monotonic?โœ… Similar to GELU
Self-gated?โœ… Uses its own value as the gate
Discovered byGoogle Brain via NAS (2017)
ASCII Graph
Swish(z) = z ยท ฯƒ(z) 6 โ”ค โ•ฑ โ”‚ โ•ฑ 4 โ”ค โ•ฑ โ”‚ โ•ฑ 2 โ”ค โ•ฑ โ”‚ โ•ฑโ•ฑ 0 โ”คโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€โ”€X โ”‚ ____โ”€โ•ฑ -0.3โ”คโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ __โ”€โ•ฑ โ† slightly deeper dip than GELU โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€ -6 -4 -2 0 2 4 6
The NAS Origin Story

Google Brain used Neural Architecture Search to test thousands of activation functions. They parametrized activations as compositions of unary and binary operations, then searched over this space. Swish (zยทฯƒ(z)) emerged as the winner โ€” beating ReLU on ImageNet, CIFAR, and machine translation tasks. The fascinating part? No human designed it โ€” a neural network found the best activation for neural networks!

GELU vs. Swish: Nearly Twins

GELU and Swish look almost identical graphically. The key difference: GELU uses the normal CDF ฮฆ(z) as its gate, while Swish uses ฯƒ(z). For most practical purposes, their performance is interchangeable. GELU tends to win in NLP/Transformers; Swish tends to win in vision models (EfficientNet uses Swish).

ML Engineer at Google/EfficientNet team: Swish is the default activation in the entire EfficientNet family (B0-B7, V2). If you're fine-tuning EfficientNet for production, understanding Swish's gradient properties helps you set learning rates correctly. Many Google production vision models use Swish.

Section 12

Activation 8: Softmax โ€” Multi-Class Output

Softmax(zi) = ezi / ฮฃโฑผ ezj
Formula
Softmax(zi) = ezi / ฮฃj=1K ezj for i = 1, 2, ..., K

Unlike all other activations in this chapter, softmax operates on an entire vector, not element-wise. It converts a vector of K raw scores (logits) into a probability distribution.

Derivation from Log-Linear Model
Step 1: Start with a log-linear model

We want: P(class = i | x) โˆ exp(score_i) where score_i = wแตขแต€x + bแตข

Step 2: Normalize to get valid probabilities

P(class = i | x) = exp(z_i) / ฮฃโฑผ exp(z_j)

This ensures: (a) all outputs โˆˆ (0,1), and (b) they sum to exactly 1.

Step 3: Connection to maximum entropy

Softmax is the unique distribution that maximizes entropy subject to the constraint that the expected features match observed features. It's the "least biased" way to turn scores into probabilities.

Step 4: Temperature scaling

Softmax(z_i/T): when Tโ†’0, becomes argmax (one-hot). When Tโ†’โˆž, becomes uniform (1/K).

Jacobian (Derivative)

Since softmax maps a vector to a vector, its derivative is a Jacobian matrix:

โˆ‚Softmax(zi)/โˆ‚zj = { Si(1 โˆ’ Si) if i = j ; โˆ’SiSj if i โ‰  j }
Compactly: โˆ‚Si/โˆ‚zj = Si(ฮดij โˆ’ Sj)
Numerical Stability Trick

Computing ez for large z causes overflow. The fix:

Softmax(zi) = e(zi โˆ’ max(z)) / ฮฃโฑผ e(zj โˆ’ max(z))

Subtracting max(z) doesn't change the result (it cancels in numerator and denominator) but prevents overflow.

Properties
PropertyValue
Output Range(0, 1) for each element; sum = 1
InputVector of K logits
OutputProbability distribution over K classes
When K=2Reduces to sigmoid!
Softmax with K=2 Equals Sigmoid: Proof

For K=2 classes, logits z = [zโ‚, zโ‚‚]:

Softmax(zโ‚) = ezโ‚ / (ezโ‚ + ezโ‚‚)

Divide numerator and denominator by ezโ‚:

= 1 / (1 + ezโ‚‚โˆ’zโ‚)

= 1 / (1 + eโˆ’(zโ‚โˆ’zโ‚‚))

= ฯƒ(zโ‚ โˆ’ zโ‚‚)

This is exactly sigmoid! So binary classification with softmax (2 outputs) โ‰ก sigmoid (1 output).

When to Use

โœ… Output layer for multi-class classification (exactly one class per input)
โœ… Attention mechanisms in Transformers (softmax over attention scores)
โŒ Multi-label classification (use sigmoid per output instead)

โŒ MYTH: "Softmax is an activation function like ReLU."

โœ… TRUTH: Softmax operates on the entire output vector, not element-wise. It creates competition between classes โ€” increasing one probability necessarily decreases others.

๐Ÿ” WHY IT MATTERS: If you accidentally apply softmax to hidden layers, you're forcing a probability distribution at each layer, destroying information. Softmax belongs only at the output layer for classification.

Section 13

Activation Selection Guide โ€” Decision Tree

The Master Comparison Table

ActivationFormulaRangeDerivativeVanishes?Zero-Centered?
Sigmoid1/(1+eโˆ’z)(0,1)ฯƒ(1โˆ’ฯƒ)โœ… YesโŒ
Tanh(ezโˆ’eโˆ’z)/(ez+eโˆ’z)(โˆ’1,1)1โˆ’tanhยฒโœ… Yesโœ…
ReLUmax(0,z)[0,โˆž)0 or 1โŒ (active)โŒ
Leaky ReLUmax(ฮฑz,z)(โˆ’โˆž,โˆž)ฮฑ or 1โŒ~
ELUz or ฮฑ(ezโˆ’1)(โˆ’ฮฑ,โˆž)1 or ELU+ฮฑโŒโ‰ˆโœ…
GELUzยทฮฆ(z)(โˆ’0.17,โˆž)ฮฆ+zยทฯ†โŒโ‰ˆโœ…
Swishzยทฯƒ(z)(โˆ’0.28,โˆž)ฯƒ+Swish(1โˆ’ฯƒ)โŒโ‰ˆโœ…
Softmaxezi/ฮฃezj(0,1), ฮฃ=1S(ฮดโˆ’S)N/AN/A

Decision Tree: Which Activation to Choose?

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ What's your layer? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ OUTPUT โ”‚ โ”‚ HIDDEN โ”‚ โ”‚ GATING โ”‚ โ”‚ LAYER โ”‚ โ”‚ LAYER โ”‚ โ”‚ MECHANISM โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ Sigmoid โ”‚ โ”‚ Binary? โ”‚ โ”‚Kโ‰ฅ3?โ”‚ โ”‚Reg?โ”‚ โ”‚ โ”‚ (LSTM, โ”‚ โ”‚Sigmoid โ”‚ โ”‚Softโ”‚ โ”‚Noneโ”‚ โ”‚ โ”‚ GRU) โ”‚ โ”‚ โ”‚ โ”‚max โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Transformer? โ”‚ โ”‚ CNN/MLP? โ”‚ โ”‚ RNN? โ”‚ โ”‚ โ†’ GELU โ”‚ โ”‚ โ†’ ReLU โ”‚ โ”‚ โ†’ Tanh โ”‚ โ”‚ (BERT, GPT) โ”‚ โ”‚ (default) โ”‚ โ”‚ (hidden) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Dying ReLU? โ”‚ โ”‚ Need extra โ”‚ โ”‚ โ†’ Leaky ReLU โ”‚ โ”‚ accuracy? โ”‚ โ”‚ or PReLU โ”‚ โ”‚ โ†’ Swish/ELU โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The 80/20 Rule for Activation Functions: In 80% of cases, use ReLU for hidden layers. Use sigmoid for binary output, softmax for multi-class output. This default gets you 95% of the way there. Only experiment with GELU/Swish/ELU when you've exhausted other hyperparameters first โ€” unless you're building Transformers, where GELU is the standard.

๐Ÿ‡ฎ๐Ÿ‡ณ India โ€” GATE/Placement Focus

What exams ask: Sigmoid/tanh derivatives, vanishing gradient definition, ReLU formula. GATE 2023 asked: "Which activation causes vanishing gradient?" (Sigmoid & Tanh).

Typical interview: TCS, Infosys, Wipro ask about sigmoid vs ReLU. Flipkart, Razorpay, Swiggy go deeper โ€” GELU, dying ReLU investigation.

๐Ÿ‡บ๐Ÿ‡ธ USA โ€” Industry Focus

What jobs need: Understanding why GELU is used in Transformers (FAANG interview staple). Debugging dying ReLU in production models. Knowing when Swish helps in EfficientNet fine-tuning.

Typical interview: Google, Meta, OpenAI ask about GELU intuition. Apple asks about activation trade-offs for on-device models (ReLU preferred for speed).

Section 14

Dying ReLU Investigation

What Is Dying ReLU?

A "dead" ReLU neuron is one where Wx + b < 0 for every single training example. Since ReLU(negative) = 0 and ReLU'(negative) = 0, the neuron outputs zero, receives zero gradient, and its weights never update. It's permanently stuck.

When Does It Happen?

  1. Large learning rate: A big gradient update can push weights to a region where the neuron becomes negative for all inputs. Think of it as the neuron "jumping off a cliff."
  2. Bad initialization: If weights are initialized too large (or with large negative bias), neurons start dead.
  3. Input distribution shift: If the data distribution changes during training, previously active neurons can die.

How to Detect Dead Neurons

Python
# After a forward pass, check what fraction of neurons are dead
def check_dead_neurons(activations):
    """activations: dict of layer_name -> activation tensor"""
    for name, act in activations.items():
        # A neuron is "dead" if it outputs 0 for ALL examples in the batch
        dead_mask = (act == 0).all(axis=0)  # per-neuron check
        dead_frac = dead_mask.mean()
        print(f"{name}: {dead_frac*100:.1f}% neurons dead")
        if dead_frac > 0.5:
            print(f"  โš ๏ธ WARNING: More than 50% dead in {name}!")

# Healthy: 0-10% dead. Concerning: 10-30%. Critical: >50%

How to Fix Dying ReLU

FixHowWhy It Works
Lower learning rateReduce by 2-10ร—Prevents weight overshooting into dead regions
Use Leaky ReLUReplace ReLU with LeakyReLU(ฮฑ=0.01)Dead neurons get ฮฑ gradient, can recover
He initializationW ~ N(0, โˆš(2/nin))Calibrates variance so ~50% of neurons start active
Batch NormalizationAdd BN before ReLUCenters pre-activation around zero, keeping ~50% active
Use PReLULearnable leak parameterNetwork adapts the leak per neuron

Bug: A student trains a 5-layer ReLU network. After epoch 10, accuracy plateaus at 52% (random for binary classification). They print activations and see this:

Layer 1: 48.2% neurons dead
Layer 2: 67.1% neurons dead
Layer 3: 85.4% neurons dead
Layer 4: 97.3% neurons dead
Layer 5: 99.8% neurons dead

Your task: (1) What's happening? (2) Identify the root cause. (3) Propose 3 fixes in order of priority.

Answer: (1) Cascading neuron death โ€” dead neurons in layer L mean reduced input variance for layer L+1, causing more neurons to die there. (2) Root cause is likely a learning rate that's too high, combined with poor initialization. (3) Fixes in priority: โ‘  Reduce learning rate by 10ร— โ‘ก Add Batch Normalization before each ReLU โ‘ข Switch to He initialization if not already used. If still dead, switch to Leaky ReLU.
Section 15

Worked Examples

Example 1: By-Hand Computation โ€” All Activations for z = โˆ’2, 0, 2

Input values: z = โˆ’2, z = 0, z = 2

Let's compute each activation function by hand.

Sigmoid: ฯƒ(z) = 1/(1+eโˆ’z)

ฯƒ(โˆ’2) = 1/(1+eยฒ) = 1/(1+7.389) = 1/8.389 โ‰ˆ 0.1192

ฯƒ(0) = 1/(1+1) = 0.5

ฯƒ(2) = 1/(1+eโˆ’2) = 1/(1+0.1353) = 1/1.1353 โ‰ˆ 0.8808

Sigmoid derivative: ฯƒ'(z) = ฯƒ(z)(1โˆ’ฯƒ(z))

ฯƒ'(โˆ’2) = 0.1192 ร— 0.8808 โ‰ˆ 0.1050

ฯƒ'(0) = 0.5 ร— 0.5 = 0.25 โ† maximum!

ฯƒ'(2) = 0.8808 ร— 0.1192 โ‰ˆ 0.1050

Tanh:

tanh(โˆ’2) โ‰ˆ โˆ’0.9640

tanh(0) = 0

tanh(2) โ‰ˆ 0.9640

ReLU:

ReLU(โˆ’2) = max(0, โˆ’2) = 0

ReLU(0) = max(0, 0) = 0

ReLU(2) = max(0, 2) = 2

Leaky ReLU (ฮฑ=0.01):

LReLU(โˆ’2) = 0.01 ร— (โˆ’2) = โˆ’0.02

LReLU(0) = 0

LReLU(2) = 2

Swish:

Swish(โˆ’2) = (โˆ’2) ร— ฯƒ(โˆ’2) = โˆ’2 ร— 0.1192 โ‰ˆ โˆ’0.2384

Swish(0) = 0 ร— 0.5 = 0

Swish(2) = 2 ร— ฯƒ(2) = 2 ร— 0.8808 โ‰ˆ 1.7616

Example 2: Gradient Flow โ€” 5-Layer Network Comparison

Setup: A 5-layer network. Let's trace how a gradient signal of 1.0 at the output gets attenuated as it flows backward.

LayerSigmoid (ร—0.25)Tanh (ร—1.0 best case)ReLU (ร—1.0 if active)
Layer 5 (output)1.00001.00001.0000
Layer 40.25001.00001.0000
Layer 30.06251.00001.0000
Layer 20.01561.00001.0000
Layer 10.00391.00001.0000

With sigmoid, the gradient reaching Layer 1 is only 0.39% of its original value! This is the vanishing gradient problem. With ReLU, gradients pass through unchanged (as long as the neuron is active). Note: tanh's best case is 1.0, but in practice tanh'(z) < 1 for z โ‰  0, so it also vanishes โ€” just slower than sigmoid.

Example 3: Softmax Computation

Given: Logits z = [2.0, 1.0, 0.1] for 3 classes Step 1: Compute exponentials

e2.0 = 7.389, e1.0 = 2.718, e0.1 = 1.105

Step 2: Sum

ฮฃ = 7.389 + 2.718 + 1.105 = 11.212

Step 3: Normalize

Softmax([2.0, 1.0, 0.1]) = [7.389/11.212, 2.718/11.212, 1.105/11.212]

= [0.659, 0.242, 0.099]

Verification: 0.659 + 0.242 + 0.099 = 1.000 โœ… With numerical stability trick (subtract max=2.0):

z' = [0.0, โˆ’1.0, โˆ’1.9]

e0.0=1.000, eโˆ’1.0=0.368, eโˆ’1.9=0.150

ฮฃ = 1.518

Result: [0.659, 0.242, 0.099] โ† Same answer, no overflow risk!

Section 16

Python Implementation โ€” From Scratch (NumPy)

All 8 Activations + Derivatives

Python โ€” NumPy
import numpy as np

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# 1. SIGMOID
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def sigmoid(z):
    """Numerically stable sigmoid."""
    return np.where(z >= 0,
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# 2. TANH
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z) ** 2

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# 3. ReLU
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(np.float64)

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# 4. LEAKY ReLU
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def leaky_relu_derivative(z, alpha=0.01):
    return np.where(z > 0, 1.0, alpha)

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# 5. ELU
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

def elu_derivative(z, alpha=1.0):
    return np.where(z > 0, 1.0, alpha * np.exp(z))

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# 6. GELU (approximate)
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def gelu(z):
    """Approximate GELU used in BERT/GPT."""
    return 0.5 * z * (1 + np.tanh(
        np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)
    ))

def gelu_derivative(z):
    """Numerical derivative for simplicity."""
    h = 1e-7
    return (gelu(z + h) - gelu(z - h)) / (2 * h)

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# 7. SWISH / SiLU
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def swish(z):
    return z * sigmoid(z)

def swish_derivative(z):
    s = sigmoid(z)
    return s + z * s * (1 - s)

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# 8. SOFTMAX
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def softmax(z):
    """Numerically stable softmax."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

Visualize All Activations on One Plot

Python โ€” Matplotlib
import matplotlib.pyplot as plt

z = np.linspace(-6, 6, 500)

activations = {
    'Sigmoid':    (sigmoid(z),    sigmoid_derivative(z),    '#6366f1'),
    'Tanh':       (tanh(z),       tanh_derivative(z),       '#0891b2'),
    'ReLU':       (relu(z),       relu_derivative(z),       '#16a34a'),
    'Leaky ReLU': (leaky_relu(z), leaky_relu_derivative(z), '#ea580c'),
    'ELU':        (elu(z),        elu_derivative(z),        '#0d9488'),
    'GELU':       (gelu(z),       gelu_derivative(z),       '#7c3aed'),
    'Swish':      (swish(z),      swish_derivative(z),      '#d946ef'),
}

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: Activation functions
for name, (act, deriv, color) in activations.items():
    axes[0].plot(z, act, label=name, color=color, linewidth=2)
axes[0].set_title('Activation Functions', fontweight='bold')
axes[0].set_xlabel('z'); axes[0].set_ylabel('f(z)')
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
axes[0].legend(); axes[0].set_ylim(-2, 6)

# Right: Derivatives
for name, (act, deriv, color) in activations.items():
    axes[1].plot(z, deriv, label=f"{name}'(z)", color=color, linewidth=2)
axes[1].set_title('Derivatives (Gradient Flow)', fontweight='bold')
axes[1].set_xlabel('z'); axes[1].set_ylabel("f'(z)")
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[1].axhline(y=1, color='gray', linestyle=':', alpha=0.4)
axes[1].legend(); axes[1].set_ylim(-0.5, 1.5)

plt.tight_layout()
plt.savefig('activation_functions_comparison.png', dpi=150)
plt.show()

Compare Gradient Flow Through 20 Layers

Python โ€” Gradient Experiment
def gradient_flow_experiment(activation_fn, deriv_fn, n_layers=20, n_samples=1000):
    """Simulate gradient flow through n_layers with given activation."""
    np.random.seed(42)
    hidden_size = 64

    # He initialization for all layers
    gradients = []
    grad = np.ones(hidden_size)  # Start with gradient of 1.0

    for l in range(n_layers):
        # Random pre-activation values (simulating forward pass)
        z = np.random.randn(hidden_size) * np.sqrt(2.0 / hidden_size)
        # Multiply by local gradient (activation derivative)
        local_grad = deriv_fn(z)
        grad = grad * local_grad
        gradients.append(np.mean(np.abs(grad)))

    return gradients

# Run for each activation
results = {}
for name, deriv_fn in [('Sigmoid', sigmoid_derivative),
                       ('Tanh', tanh_derivative),
                       ('ReLU', relu_derivative),
                       ('Leaky ReLU', leaky_relu_derivative),
                       ('GELU', gelu_derivative),
                       ('Swish', swish_derivative)]:
    results[name] = gradient_flow_experiment(sigmoid if name == 'Sigmoid' else relu,
                                             deriv_fn)

# Plot gradient magnitude vs layer depth
plt.figure(figsize=(10, 6))
for name, grads in results.items():
    plt.plot(range(1, 21), grads, label=name, linewidth=2, marker='o', markersize=3)
plt.yscale('log')
plt.xlabel('Layer (from output to input)')
plt.ylabel('Mean |gradient|')
plt.title('Gradient Flow: Vanishing Gradient Demonstration')
plt.legend(); plt.grid(True, alpha=0.3)
plt.show()
Expected output: Sigmoid โ€” gradient at layer 20: ~1e-12 (effectively zero!) Tanh โ€” gradient at layer 20: ~1e-4 (small but non-zero) ReLU โ€” gradient at layer 20: ~0.5 (strong signal preserved) GELU โ€” gradient at layer 20: ~0.4 (similar to ReLU) Swish โ€” gradient at layer 20: ~0.3 (slightly lower, still healthy)
Section 17

Library Implementations โ€” PyTorch & TensorFlow

PyTorch

PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F

z = torch.linspace(-6, 6, 100, requires_grad=True)

# All activations as one-liners
sig   = torch.sigmoid(z)               # or F.sigmoid(z)
tan   = torch.tanh(z)                  # or F.tanh(z)
rel   = F.relu(z)                      # or torch.relu(z)
lrel  = F.leaky_relu(z, 0.01)         # ฮฑ = 0.01
elu_  = F.elu(z, alpha=1.0)            # ฮฑ = 1.0
gel   = F.gelu(z)                      # exact or approximate='tanh'
swi   = F.silu(z)                      # SiLU = Swish(ฮฒ=1)
sft   = F.softmax(z.unsqueeze(0), dim=-1)

# Using as nn.Module layers in a network
class FlexibleNet(nn.Module):
    def __init__(self, activation='relu'):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)

        # Activation selection
        act_map = {
            'relu':       nn.ReLU(),
            'leaky_relu': nn.LeakyReLU(0.01),
            'elu':        nn.ELU(alpha=1.0),
            'gelu':       nn.GELU(),
            'silu':       nn.SiLU(),        # Swish
            'sigmoid':    nn.Sigmoid(),
            'tanh':       nn.Tanh(),
            'prelu':      nn.PReLU(),       # learnable ฮฑ
        }
        self.act = act_map[activation]

    def forward(self, x):
        x = self.act(self.fc1(x))
        x = self.act(self.fc2(x))
        return self.fc3(x)   # No activation on output (use CrossEntropyLoss)

# Compare activations on MNIST
for act_name in ['relu', 'sigmoid', 'gelu', 'silu']:
    model = FlexibleNet(activation=act_name)
    print(f"{act_name}: {sum(p.numel() for p in model.parameters())} params")

TensorFlow / Keras

TensorFlow / Keras
import tensorflow as tf
from tensorflow.keras import layers, models

# Build model with any activation
def build_model(activation='relu'):
    model = models.Sequential([
        layers.Dense(256, activation=activation, input_shape=(784,)),
        layers.Dense(128, activation=activation),
        layers.Dense(10, activation='softmax')
    ])
    return model

# Keras supports these strings directly:
# 'relu', 'sigmoid', 'tanh', 'elu', 'selu', 'gelu', 'swish'
# For LeakyReLU, use: layers.LeakyReLU(alpha=0.01)
# For PReLU: layers.PReLU()

# Custom activation example
@tf.function
def mish(z):
    """Mish activation: z * tanh(softplus(z))"""
    return z * tf.math.tanh(tf.math.softplus(z))

model = models.Sequential([
    layers.Dense(256, input_shape=(784,)),
    layers.Activation(mish),
    layers.Dense(10, activation='softmax')
])
Section 18

Visual Diagrams

All Activations Side-by-Side

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SIGMOID โ”‚ โ”‚ TANH โ”‚ โ”‚ ReLU โ”‚ โ”‚ ____ โ”‚ โ”‚ ____ โ”‚ โ”‚ / โ”‚ โ”‚ / โ”‚ โ”‚ / โ”‚ โ”‚ / โ”‚ โ”‚ โ•ฑ โ”‚ โ”‚ โ•ฑ โ”‚ โ”‚ / โ”‚ โ”‚X โ”‚ โ”‚ X โ”‚ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€X โ”‚ โ”‚ โ”‚ โ”‚โ•ฑ โ”‚ โ”‚ โ”‚ โ”‚ Range: (0,1) โ”‚ โ”‚____ โ”‚ โ”‚ Range: [0,โˆž) โ”‚ โ”‚ Max grad: 0.25 โ”‚ โ”‚ Range: (-1,1) โ”‚ โ”‚ Grad: 0 or 1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Max grad: 1.0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LEAKY ReLU โ”‚ โ”‚ ELU โ”‚ โ”‚ GELU / SWISH โ”‚ โ”‚ / โ”‚ โ”‚ / โ”‚ โ”‚ โ•ฑ โ”‚ โ”‚ / โ”‚ โ”‚ / โ”‚ โ”‚ โ•ฑโ•ฑ โ”‚ โ”‚ / โ”‚ โ”‚ / โ”‚ โ”‚ X โ”‚ โ”‚โ”€โ”€โ”€โ”€โ”€ _X โ”‚ โ”‚โ”€โ”€โ”€โ”€โ”€_X โ”‚ โ”‚โ”€โ”€โ”€โ”€ _โ•ฑ โ”‚ โ”‚ _โ•ฑ โ”‚ โ”‚ __โ•ฑ โ”‚ โ”‚ __โ•ฑ โ”‚ โ”‚ Range: (-โˆž,โˆž) โ”‚ โ”‚ Range: (-ฮฑ,โˆž) โ”‚ โ”‚ Range: (~-0.2,โˆž)โ”‚ โ”‚ Grad: ฮฑ or 1 โ”‚ โ”‚ Smooth at 0 โ”‚ โ”‚ Smooth+Non-mono โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Gradient Flow Through a Deep Network

FORWARD PASS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ Input Layer 1 Layer 2 Layer 3 Output โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”‚ x โ”‚โ”€โ”€โ”€โ–บโ”‚Wโ‚x+bโ‚โ”‚โ”€โ”€โ–บโ”‚Wโ‚‚aโ‚+bโ‚‚โ”‚โ”€โ”€โ–บโ”‚Wโ‚ƒaโ‚‚+bโ‚ƒโ”‚โ”€โ”€โ”€โ–บโ”‚ ลท โ”‚ โ””โ”€โ”€โ”€โ”˜ โ”‚ โ†“ โ”‚ โ”‚ โ†“ โ”‚ โ”‚ โ†“ โ”‚ โ””โ”€โ”€โ”€โ”˜ โ”‚ f(z) โ”‚ โ”‚ f(z) โ”‚ โ”‚ f(z) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ BACKWARD PASS (gradients) Sigmoid: ร—0.25 ร—0.25 ร—0.25 โ†’ 0.25ยณ = 0.016 ๐Ÿ˜ฑ ReLU: ร—1.0 ร—1.0 ร—1.0 โ†’ 1.0ยณ = 1.0 โœ… GELU: ร—~0.84 ร—~0.84 ร—~0.84 โ†’ ~0.59 โœ… After 10 layers: Sigmoid gradient: 0.25ยนโฐ โ‰ˆ 0.000001 โ† VANISHED! ReLU gradient: 1.0ยนโฐ = 1.0 โ† Perfect

Softmax Visualization

Raw Logits Softmax Output โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ zโ‚ = 2.0โ”‚โ”€โ”€โ” โ”‚ Pโ‚ = 0.659 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ โ”‚ zโ‚‚ = 1.0โ”‚โ”€โ”€โ”คโ”€โ”€Softmaxโ”€โ”€โ”‚ Pโ‚‚ = 0.242 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ โ”‚ zโ‚ƒ = 0.1โ”‚โ”€โ”€โ”˜ โ”‚ Pโ‚ƒ = 0.099 โ–ˆโ–ˆโ–Œ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Sum = 1.000 โœ… Temperature Scaling: T = 0.1 (sharp): [0.999, 0.001, 0.000] โ† nearly one-hot T = 1.0 (normal): [0.659, 0.242, 0.099] โ† balanced T = 10 (soft): [0.363, 0.332, 0.305] โ† nearly uniform
Section 19

Industry Case Studies

๐Ÿ‡ฎ๐Ÿ‡ณ India: Flipkart Product Categorization โ€” ReLU vs Sigmoid in Hidden Layers

Case Study: Flipkart's Product Classification Pipeline

Context: Flipkart handles 150M+ products across 80+ categories. Their product categorization pipeline uses a deep neural network that takes product title, description, and image embeddings as input and outputs one of 80 leaf categories.

The Problem

The initial model (2019) used sigmoid activation in hidden layers (a legacy decision from when the team adapted a logistic regression model). The 6-layer network showed:

  • Training accuracy: 78% (plateau after epoch 15)
  • Gradient magnitude at layer 1: ~10โˆ’8 (effectively zero)
  • Training time: 14 hours on 4ร— V100 GPUs

The Fix

Replaced sigmoid with ReLU in all 6 hidden layers. Added He initialization and Batch Normalization.

Results

MetricSigmoid HiddenReLU HiddenImprovement
Top-1 Accuracy78.2%91.7%+13.5%
Training Time14 hours3.2 hours4.4ร— faster
Layer 1 Gradient~10โˆ’8~10โˆ’2106ร— stronger
Convergence EpochEpoch 40+Epoch 123ร— fewer epochs

Key Takeaway

The difference wasn't in the model architecture โ€” it was identical. The difference was one line of code: changing the activation function. This is why understanding activations matters for production ML engineering.

The One-Line Fix
# Before (bad)
self.hidden = nn.Sequential(
    nn.Linear(512, 256), nn.Sigmoid(),  # โŒ Sigmoid in hidden
    nn.Linear(256, 128), nn.Sigmoid(),
)

# After (good)
self.hidden = nn.Sequential(
    nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(),  # โœ… ReLU + BN
    nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(),
)

๐Ÿ‡บ๐Ÿ‡ธ Global: GPT Architecture โ€” Why GELU Over ReLU in Transformers

Case Study: OpenAI's GPT and the Choice of GELU

Context: GPT-2 (2019), GPT-3 (2020), and GPT-4 (2023) all use GELU activation in their feed-forward layers. This was a deliberate departure from the ReLU that dominated CNNs.

Transformer Feed-Forward Block

Each Transformer layer has a feed-forward network (FFN) with two linear layers and an activation in between:

Architecture
FFN(x) = Wโ‚‚ ยท GELU(Wโ‚ ยท x + bโ‚) + bโ‚‚

# In GPT-3 (175B parameters):
# Wโ‚: [12288 ร— 49152]  (expand 4ร—)
# Wโ‚‚: [49152 ร— 12288]  (project back)
# GELU applied to 49152-dimensional vector

Why GELU Beats ReLU in Transformers

PropertyReLU in TransformerGELU in Transformer
Gradient at z=0Discontinuous (0 โ†’ 1)Smooth (โ‰ˆ 0.5)
Negative inputsHard zero โ€” information lostSoft suppression โ€” some signal preserved
Attention compatibilityCreates hard sparsity patternsSoft sparsity matches attention's soft weighting
Training stabilityCan cause loss spikes at scaleSmoother loss landscape
GLUE benchmarkBaseline+0.5-2% on most tasks

The Smoothness Argument

At billion-parameter scale, the sharp corner of ReLU at z=0 creates discontinuities in the loss landscape. With millions of neurons hitting zโ‰ˆ0 simultaneously, these tiny discontinuities accumulate and cause training instability (loss spikes). GELU's smooth transition eliminates this problem.

Code: GPT-Style FFN with GELU

PyTorch
class TransformerFFN(nn.Module):
    """Feed-forward network as used in GPT-2/3."""
    def __init__(self, d_model=768, d_ff=3072):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.gelu = nn.GELU()     # โ† THE key choice
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        x = self.gelu(self.fc1(x))  # Expand + activate
        x = self.dropout(self.fc2(x))  # Project back
        return x

Paper: "Searching for Activation Functions" โ€” Ramachandran, Zoph, Le (Google Brain, 2017). Used reinforcement learning to search over a space of activation functions. Swish (xยทฯƒ(x)) emerged as the best performer across multiple benchmarks, beating ReLU by 0.6-0.9% on ImageNet. This paper pioneered the idea of "learning to design activation functions."

Section 20

Common Misconceptions

โŒ MYTH: "ReLU neurons die because the function is zero for negative inputs."

โœ… TRUTH: Having zero output for some inputs is fine โ€” it's the sparsity feature! The problem is when a neuron outputs zero for ALL inputs. That happens because of bad weight updates (too large learning rate), not because of the activation function's definition.

๐Ÿ” WHY IT MATTERS: Students sometimes switch to Leaky ReLU preemptively. Understand the cause first โ€” often a learning rate fix or better initialization is sufficient.

โŒ MYTH: "Newer activations (GELU, Swish) are always better than ReLU."

โœ… TRUTH: ReLU is still the best default for CNNs and standard MLPs. GELU wins in Transformers specifically. Swish wins in EfficientNet specifically. There is no universally "best" activation โ€” it depends on the architecture.

๐Ÿ” WHY IT MATTERS: Blindly using GELU in a ResNet or Swish in an LSTM wastes compute without guaranteed improvement. Match the activation to the architecture.

โŒ MYTH: "The vanishing gradient problem means gradients become exactly zero."

โœ… TRUTH: Gradients become exponentially small (e.g., 10โˆ’12) but not zero. They're small enough that weight updates become negligible relative to floating-point precision, making training practically impossible.

๐Ÿ” WHY IT MATTERS: Understanding it's a numerical precision issue, not a mathematical one, helps you see why solutions like gradient clipping and mixed precision training can help.

โŒ MYTH: "Softmax makes the highest-scoring class approach probability 1."

โœ… TRUTH: Only with extreme logit differences. If logits are [2.0, 1.9, 1.8], softmax gives [0.356, 0.332, 0.312] โ€” nearly uniform! Softmax amplifies differences but doesn't create certainty from ambiguity.

๐Ÿ” WHY IT MATTERS: Overconfident softmax predictions (calibration) is a major issue in production ML. Models can output P=0.99 and still be wrong 30% of the time.

โŒ MYTH: "ReLU is not differentiable at z=0, so gradient descent shouldn't work."

โœ… TRUTH: The probability of z being exactly 0 is zero for continuous inputs (measure zero). In practice, we use a subgradient (define derivative as 0 at z=0), and it works perfectly.

๐Ÿ” WHY IT MATTERS: This is a classic GATE/exam question designed to trick students who confuse theoretical differentiability with practical computability.

Section 21

GATE / Exam Corner

Formula Sheet

Activation Functions โ€” Quick Reference
Functionf(z)f'(z)Range
Sigmoid1/(1+eโˆ’z)ฯƒ(1โˆ’ฯƒ)(0,1)
Tanh(ezโˆ’eโˆ’z)/(ez+eโˆ’z)1โˆ’tanhยฒ(z)(โˆ’1,1)
ReLUmax(0,z)0 or 1[0,โˆž)
Leaky ReLUmax(ฮฑz,z)ฮฑ or 1(โˆ’โˆž,โˆž)
Softmaxezi/ฮฃezjSi(ฮดijโˆ’Sj)(0,1), ฮฃ=1

Key identity: tanh(z) = 2ฯƒ(2z) โˆ’ 1

Vanishing gradient: ฯƒ'max = 0.25, after L layers: 0.25L

GATE Previous Year Style Questions

GATE Q1

Which activation function has a maximum derivative value of 0.25?

  1. ReLU
  2. Tanh
  3. Sigmoid
  4. Leaky ReLU
Answer: C. ฯƒ'(z) = ฯƒ(z)(1โˆ’ฯƒ(z)). Maximum at z=0: 0.5ร—0.5 = 0.25. Tanh's max derivative is 1.0. ReLU's is 1 (in active region). Leaky ReLU's max is 1.
RememberGATE CS 2023
GATE Q2

A neural network with 10 hidden layers uses only linear activation functions. The network has 784 input features and 10 outputs. What is the maximum number of learnable parameters needed to achieve the same representational power?

  1. 10 ร— 784 + 10 = 7,850
  2. 784 ร— 10 + 10 = 7,850
  3. Same as the 10-layer network
  4. Cannot be determined
Answer: B. With linear activations, 10 layers collapse to a single linear transformation: y = W*x + b* where W* โˆˆ โ„10ร—784 and b* โˆˆ โ„10. Total parameters = 784ร—10 + 10 = 7,850.
AnalyzeGATE CS 2022
GATE Q3

For the softmax function applied to logits z = [3, 1, โˆ’2], what is the approximate probability of class 1?

  1. 0.50
  2. 0.88
  3. 0.66
  4. 0.95
Answer: B. eยณ = 20.09, eยน = 2.72, eโˆ’2 = 0.14. Sum = 22.95. P(class 1) = 20.09/22.95 โ‰ˆ 0.875 โ‰ˆ 0.88.
ApplyNumerical
GATE Q4

The "dying ReLU" problem occurs when:

  1. The learning rate is too small
  2. All inputs to a neuron produce negative pre-activations
  3. The gradient becomes too large
  4. The activation output exceeds a threshold
Answer: B. A "dead" ReLU neuron has Wx+b < 0 for all training examples. Output is always 0, gradient is always 0, so weights never update. This typically happens due to large learning rates causing weight overshooting or bad initialization.
UnderstandGATE CS 2024
GATE Q5

Which of the following is NOT a property of the tanh activation function?

  1. Output is zero-centered
  2. It saturates for large |z|
  3. It is equivalent to 2ฯƒ(2z) โˆ’ 1
  4. Its maximum derivative is 0.5
Answer: D. The maximum derivative of tanh is 1.0 (at z=0), not 0.5. tanh'(z) = 1 โˆ’ tanhยฒ(z), and tanh(0) = 0, so tanh'(0) = 1 โˆ’ 0 = 1. All other statements are true.
RememberGATE DA

Prediction Table โ€” High-Probability GATE Topics

TopicProbabilityTypical Format
Sigmoid derivative computationโญโญโญโญโญMCQ / NAT
Vanishing gradient identificationโญโญโญโญโญMCQ
Softmax probability computationโญโญโญโญNAT (numerical answer)
Linear vs non-linear activationโญโญโญโญMCQ / MSQ
ReLU properties / dying ReLUโญโญโญMCQ
GELU / Swish (advanced)โญโญMSQ (if asked)
Section 22

Interview Prep

Conceptual Questions

Q1: "Why ReLU over sigmoid for hidden layers?"

Strong Answer (2 minutes)

Three reasons, in order of importance:

1. Vanishing gradient: Sigmoid's max derivative is 0.25. In a 10-layer network, gradients shrink by 0.2510 โ‰ˆ 10โˆ’6. ReLU's gradient is exactly 1.0 in the active region, so gradients pass through unchanged โ€” enabling training of much deeper networks.

2. Computational efficiency: Sigmoid requires computing eโˆ’z โ€” an expensive operation. ReLU is just max(0, z) โ€” a simple comparison. In practice, ReLU is 6ร— faster, which matters when you're training on millions of images.

3. Sparse activation: About 50% of ReLU neurons output zero at any given time, creating a sparse representation. This acts as implicit regularization and is biologically motivated โ€” human neurons are also sparsely active.

Caveat I'd add: Sigmoid is still correct for output layers in binary classification and for gating mechanisms in LSTMs.

Q2: "What's the dying ReLU problem and how do you fix it?"

Strong Answer

The problem: If a neuron's weights update such that the pre-activation Wx + b < 0 for every training example, it outputs zero permanently. With zero output, gradient is zero, weights never update. The neuron is dead.

Detection: After a forward pass, check what fraction of neurons in each layer output all zeros across the batch. Healthy: 0-10%. Concerning: 10-30%. Critical: 50%+.

Fixes, in priority order:

  1. Reduce learning rate (most common cause is overshooting)
  2. Use He initialization: W ~ N(0, โˆš(2/n))
  3. Add Batch Normalization before ReLU
  4. Switch to Leaky ReLU (ฮฑ=0.01) or PReLU

Q3: "Why does GPT use GELU instead of ReLU?"

Strong Answer (for FAANG / OpenAI interviews)

1. Smoothness at zero: ReLU has a discontinuous gradient at z=0. In Transformers with billions of parameters, many neurons are near zโ‰ˆ0 simultaneously. The accumulated discontinuities cause training instability โ€” loss spikes that don't occur with GELU's smooth transition.

2. Soft gating: GELU = zยทฮฆ(z) can be interpreted as scaling each input by its own percentile in a Gaussian distribution. This soft gating is philosophically consistent with the soft attention mechanism in Transformers (which uses softmax, another soft gate).

3. Non-monotonicity: GELU has a small negative region (minimum โ‰ˆ โˆ’0.17 at z โ‰ˆ โˆ’0.75). This allows the network to create "anti-features" โ€” neurons that weakly respond to things that are not present โ€” which helps in NLP where absence of a word can be informative.

4. Empirical: Consistently +0.5-2% improvement over ReLU on GLUE, SuperGLUE, and other NLP benchmarks.

Coding Interview Question

Coding: "Implement softmax that handles numerical overflow"

Python โ€” Interview Solution
def stable_softmax(z):
    """
    Numerically stable softmax.
    Args: z โ€” numpy array of shape (batch_size, n_classes)
    Returns: probability distribution, same shape as z
    """
    # Subtract max for numerical stability (prevents overflow)
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Test
z = np.array([[1000, 1001, 1002]])  # Would overflow without stability trick
print(stable_softmax(z))
# Output: [[0.0900, 0.2447, 0.6652]]

# Verify: without stability
# np.exp(1000) = inf โ†’ NaN! โ† Our version handles this.
Follow-up questions the interviewer might ask:
  • "What happens without the max subtraction?" โ†’ Overflow to inf, then inf/inf = NaN
  • "Why subtract max specifically?" โ†’ Any constant works (cancels in numerator/denominator), but max ensures all exponents are โ‰ค 0, preventing overflow
  • "Implement the Jacobian of softmax" โ†’ More advanced, shows you understand the S(ฮดโˆ’S) formula

Case Study Interview Question

Case: "Your model's accuracy is stuck at random chance. Diagnose."

Framework for answering:
  1. Check activation-related issues first (most common):
    • Are you using sigmoid/tanh in hidden layers of a deep network? โ†’ Vanishing gradients โ†’ Switch to ReLU
    • Print percentage of dead neurons per layer โ†’ If high, dying ReLU โ†’ Lower LR or use Leaky ReLU
  2. Check gradient flow:
    • Print gradient norms per layer. If they decrease exponentially โ†’ vanishing gradient problem
    • If they increase exponentially โ†’ exploding gradient problem โ†’ add gradient clipping
  3. Check output layer activation:
    • Binary classification โ†’ sigmoid output + BCE loss
    • Multi-class โ†’ softmax output + CE loss (or no activation + CrossEntropyLoss in PyTorch)
    • Regression โ†’ no activation (linear output)
๐Ÿ‡ฎ๐Ÿ‡ณ India Interview Focus

TCS/Infosys: "Define sigmoid, derivative, range" (textbook recall)

Flipkart/Swiggy: "Compare ReLU variants. When would you choose ELU?" (analysis)

GATE: Numerical computation โ€” "compute ฯƒ(2)" or "softmax of [1,2,3]"

๐Ÿ‡บ๐Ÿ‡ธ USA Interview Focus

Google/Meta: "Explain GELU intuitively. Why in Transformers?" (deep understanding)

Apple: "Which activation is cheapest for on-device inference?" (ReLU โ€” no exponentials)

OpenAI: "Design an experiment to find the best activation for your task" (research mindset)

Roles that need deep activation function knowledge:

  • ML Engineer (India/US): Choosing activations for production models, debugging dead neurons
  • Research Scientist: Designing new activations (like Google Brain's Swish search)
  • MLOps Engineer: Understanding why certain activations are faster (ReLU vs GELU on specific hardware)
  • NLP Engineer: Understanding Transformer internals โ€” GELU is everywhere
Section 23

Hands-On Lab / Mini-Project

๐Ÿ”ฌ Project: "The Great Activation Function Bake-Off"

Objective: Train the same neural network architecture with 7 different activation functions on the same dataset and compare: convergence speed, final accuracy, gradient health, and dead neuron count.

Setup

Python โ€” Project Template
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Dataset: Fashion-MNIST (10 classes, 28ร—28 grayscale)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
train_data = datasets.FashionMNIST('./data', train=True,
                                     download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)

# Architecture: 5-layer MLP (same for all activations)
class ActivationTestNet(nn.Module):
    def __init__(self, act_fn):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 512),  act_fn(),
            nn.Linear(512, 256),  act_fn(),
            nn.Linear(256, 128),  act_fn(),
            nn.Linear(128, 64),   act_fn(),
            nn.Linear(64, 10)
        )

    def forward(self, x):
        return self.layers(x.view(-1, 784))

# Activations to test
activations = {
    'Sigmoid':    nn.Sigmoid,
    'Tanh':       nn.Tanh,
    'ReLU':       nn.ReLU,
    'LeakyReLU': lambda: nn.LeakyReLU(0.01),
    'ELU':        nn.ELU,
    'GELU':       nn.GELU,
    'SiLU':       nn.SiLU,  # Swish
}

# Training loop (per activation)
def train_and_evaluate(act_name, act_fn, epochs=20):
    model = ActivationTestNet(act_fn)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()
    history = {'loss': [], 'acc': [], 'grad_norms': []}

    for epoch in range(epochs):
        total_loss, correct, total = 0, 0, 0
        for X, y in train_loader:
            out = model(X)
            loss = criterion(out, y)
            optimizer.zero_grad()
            loss.backward()

            # Record gradient norms
            grad_norm = sum(p.grad.norm().item()
                          for p in model.parameters()
                          if p.grad is not None)
            optimizer.step()
            total_loss += loss.item()
            correct += (out.argmax(1) == y).sum().item()
            total += y.size(0)

        history['loss'].append(total_loss / len(train_loader))
        history['acc'].append(correct / total)
        history['grad_norms'].append(grad_norm)
        print(f"{act_name} Epoch {epoch+1}: Loss={history['loss'][-1]:.4f}, Acc={history['acc'][-1]:.4f}")
    return history

# Run all experiments
all_results = {}
for name, fn in activations.items():
    print(f"\n{'='*50}\nTraining with {name}\n{'='*50}")
    all_results[name] = train_and_evaluate(name, fn)

Rubric (100 points)

CriterionPointsWhat to Demonstrate
Correct Implementation20All 7 activations train without errors, same architecture
Convergence Comparison Plot20Loss vs epoch for all activations on one plot, clearly labeled
Gradient Flow Analysis20Gradient norm vs layer depth for sigmoid vs ReLU vs GELU
Dead Neuron Analysis15Count and visualize dead ReLU neurons across training
Written Analysis151-page writeup explaining which activation won and why
Bonus: Custom Activation10Implement and test your own custom activation (e.g., Mish)
Section 24

Exercises (22 Questions)

Section A: Conceptual (5 Questions)

A1 โ€” Remember

State the output range and maximum derivative value for each: sigmoid, tanh, ReLU.

Sigmoid: range (0,1), max derivative 0.25. Tanh: range (โˆ’1,1), max derivative 1.0. ReLU: range [0,โˆž), max derivative 1.0 (in active region).
A2 โ€” Understand

Explain in your own words why a 50-layer network with linear activations is no more powerful than a 1-layer network. Use a matrix multiplication argument.

Each linear layer computes al = Wlal-1 + bl. Without non-linearity, this chain collapses: a50 = W50W49...W1x + b* = W*x + b*. The product of 50 matrices is just one matrix. So 50 layers of linear = 1 layer of linear. All extra parameters are wasted.
A3 โ€” Understand

Why is tanh preferred over sigmoid for hidden layers? Give two specific reasons.

1. Zero-centered: tanh outputs range from โˆ’1 to 1, so mean activation is near zero. This allows gradient updates of mixed signs, leading to more direct optimization paths. Sigmoid outputs are always positive (0 to 1), causing zig-zag gradient updates. 2. Stronger gradients: tanh's max derivative is 1.0 vs sigmoid's 0.25, so gradients vanish 4ร— slower per layer.
A4 โ€” Understand

Describe the "dying ReLU" problem. Why can't a dead neuron recover during training?

A dead ReLU neuron has Wx+b < 0 for all training inputs, so it always outputs 0. The gradient of ReLU for negative inputs is also 0. With zero gradient, the weight update ฮ”W = โˆ’ฮท ยท โˆ‚L/โˆ‚W = 0. Since weights never change, the neuron can never leave the dead region โ€” it's permanently stuck.
A5 โ€” Understand

Explain the intuition behind GELU as a "stochastic gate" and why it pairs well with Transformers.

GELU(z) = z ยท ฮฆ(z) scales each input by the probability that a standard normal random variable would be less than z. Large positive inputs pass through (ฮฆโ‰ˆ1), large negative inputs are zeroed (ฮฆโ‰ˆ0), and inputs near zero get a probabilistic pass. This soft gating matches the soft attention mechanism in Transformers, which also uses smooth weighting (softmax) rather than hard selection. The smoothness prevents loss spikes at billion-parameter scale.

Section B: Mathematical (8 Questions)

B1 โ€” Apply

Derive the derivative of sigmoid from first principles: show that ฯƒ'(z) = ฯƒ(z)(1โˆ’ฯƒ(z)).

ฯƒ(z) = (1+eโˆ’z)โˆ’1. Using chain rule: ฯƒ'(z) = โˆ’(1+eโˆ’z)โˆ’2 ยท (โˆ’eโˆ’z) = eโˆ’z/(1+eโˆ’z)2. Factor: = [1/(1+eโˆ’z)] ยท [eโˆ’z/(1+eโˆ’z)] = ฯƒ(z) ยท [1 โˆ’ 1/(1+eโˆ’z)] ... more carefully: eโˆ’z/(1+eโˆ’z) = (1+eโˆ’zโˆ’1)/(1+eโˆ’z) = 1 โˆ’ ฯƒ(z). So ฯƒ'(z) = ฯƒ(z)(1โˆ’ฯƒ(z)). โˆŽ
B2 โ€” Apply

Compute ฯƒ(0), ฯƒ(3), ฯƒ(โˆ’3), and their derivatives. Show all work.

ฯƒ(0) = 1/(1+1) = 0.5. ฯƒ'(0) = 0.5ร—0.5 = 0.25. ฯƒ(3) = 1/(1+eโˆ’3) = 1/(1+0.0498) โ‰ˆ 0.9526. ฯƒ'(3) = 0.9526ร—0.0474 โ‰ˆ 0.0452. ฯƒ(โˆ’3) = 1/(1+eยณ) = 1/(1+20.086) โ‰ˆ 0.0474. ฯƒ'(โˆ’3) = 0.0474ร—0.9526 โ‰ˆ 0.0452.
B3 โ€” Apply

Prove that tanh(z) = 2ฯƒ(2z) โˆ’ 1.

RHS: 2ฯƒ(2z) โˆ’ 1 = 2/(1+eโˆ’2z) โˆ’ 1 = [2 โˆ’ (1+eโˆ’2z)]/(1+eโˆ’2z) = (1โˆ’eโˆ’2z)/(1+eโˆ’2z). Multiply numerator and denominator by ez: = (ezโˆ’eโˆ’z)/(ez+eโˆ’z) = tanh(z) = LHS. โˆŽ
B4 โ€” Apply

Compute softmax([1.0, 2.0, 3.0]). Verify the outputs sum to 1.

eยน = 2.718, eยฒ = 7.389, eยณ = 20.086. Sum = 30.193. Softmax = [2.718/30.193, 7.389/30.193, 20.086/30.193] = [0.0900, 0.2447, 0.6652]. Sum = 0.0900 + 0.2447 + 0.6652 = 0.9999 โ‰ˆ 1.0 โœ…
B5 โ€” Analyze

In a 10-layer network with sigmoid activations, the gradient at the output is 1.0. What is the maximum possible gradient at layer 1? What is a typical (not best-case) gradient?

Maximum gradient: Each sigmoid derivative is at most 0.25. Through 10 layers: 0.2510 = 9.54 ร— 10โˆ’7. In practice, most neurons are not at z=0 (where derivative peaks), so typical gradients are even smaller โ€” often 0.110 = 10โˆ’10 or less. This is the vanishing gradient problem.
B6 โ€” Analyze

Derive the Jacobian matrix entry โˆ‚Si/โˆ‚zj for softmax, for both cases i=j and iโ‰ j.

Case i=j: โˆ‚Si/โˆ‚zi = โˆ‚(ezi/ฮฃ)/โˆ‚zi = [eziยทฮฃ โˆ’ eziยทezi]/ฮฃยฒ = Si โˆ’ Siยฒ = Si(1โˆ’Si). Case iโ‰ j: โˆ‚Si/โˆ‚zj = โˆ‚(ezi/ฮฃ)/โˆ‚zj = [0ยทฮฃ โˆ’ eziยทezj]/ฮฃยฒ = โˆ’SiSj. Combined: โˆ‚Si/โˆ‚zj = Si(ฮดij โˆ’ Sj).
B7 โ€” Apply

Compute the derivative of Swish at z = 1.0. Show all intermediate steps.

Swish'(z) = ฯƒ(z) + zยทฯƒ(z)(1โˆ’ฯƒ(z)). At z=1: ฯƒ(1) = 1/(1+eโˆ’1) = 1/1.3679 โ‰ˆ 0.7311. Swish'(1) = 0.7311 + 1ร—0.7311ร—(1โˆ’0.7311) = 0.7311 + 0.7311ร—0.2689 = 0.7311 + 0.1966 โ‰ˆ 0.9277.
B8 โ€” Analyze

Show that the derivative of ELU is continuous at z = 0 (when ฮฑ = 1).

For z > 0: ELU'(z) = 1. For z < 0: ELU'(z) = ฮฑยทez. At z = 0 from the left: limzโ†’0โป ฮฑยทez = ฮฑยทeโฐ = ฮฑ = 1. At z = 0 from the right: limzโ†’0โบ 1 = 1. Since left limit = right limit = 1, ELU' is continuous at z = 0. โˆŽ (Compare with ReLU where left limit = 0 โ‰  1 = right limit.)

Section C: Coding (4 Questions)

C1 โ€” Apply

Implement the GELU activation function from scratch in NumPy using the tanh approximation. Verify your implementation matches PyTorch's F.gelu() for z = [โˆ’3, โˆ’1, 0, 1, 3].

See Section 16's gelu() implementation. Verification: compute gelu(np.array([-3,-1,0,1,3])) and compare with torch.nn.functional.gelu(torch.tensor([-3.,-1.,0.,1.,3.])). Values should match to 4+ decimal places.
C2 โ€” Apply

Write a function count_dead_neurons(model, dataloader) that runs one epoch of data through a model and returns the percentage of dead ReLU neurons in each layer.

Track per-neuron activation counts across all batches. A neuron is "dead" if it outputs 0 for every single example in the dataset. Return the fraction of dead neurons per layer.
C3 โ€” Create

Create a visualization that shows all 7 activation functions and their derivatives on two subplots (side by side), for z โˆˆ [โˆ’6, 6]. Use different colors for each activation and include a legend.

See Section 16's visualization code. Key: use plt.subplots(1,2), plot activations on the left and derivatives on the right. Include axhline(y=0) and axvline(x=0) for reference.
C4 โ€” Create

Implement temperature-scaled softmax: softmax(z/T). Plot the output distribution for z = [2.0, 1.0, 0.5] with T = 0.1, 0.5, 1.0, 2.0, 10.0. Explain what happens as T โ†’ 0 and T โ†’ โˆž.

As T โ†’ 0: softmax approaches a one-hot vector (argmax). The highest logit gets probability โ‰ˆ1, all others โ‰ˆ0. As T โ†’ โˆž: softmax approaches a uniform distribution (1/K for all classes). Temperature controls the "confidence" of the distribution.

Section D: Critical Thinking (3 Questions)

D1 โ€” Evaluate

A colleague claims: "Since GELU is better than ReLU in Transformers, we should switch all our CNN models to GELU too." Evaluate this claim. Under what conditions might it be true or false?

False in general. ReLU is often preferred for CNNs due to: (1) computational efficiency โ€” ReLU is 2-3ร— faster than GELU per element, (2) sparsity โ€” ReLU's hard zeros provide stronger regularization which benefits vision models, (3) established best practices โ€” ResNet, VGG, etc. were designed with ReLU. GELU may help in some cases (e.g., ViT which is a vision Transformer), but the compute cost increase often doesn't justify the marginal accuracy gain in standard CNNs. Always benchmark on your specific task.
D2 โ€” Evaluate

ReLU is not differentiable at z = 0. Why doesn't this break gradient descent? Would it be better to use a smooth approximation like Softplus: log(1 + ez)?

It doesn't break gradient descent because: (1) the probability of z being exactly 0 is measure-zero for continuous inputs, (2) we use a subgradient convention (ReLU'(0) = 0), (3) in practice with floating-point arithmetic, z is never exactly 0. Softplus is smooth and approximates ReLU, but it's slower to compute (requires exp and log) and doesn't provide the exact-zero sparsity that makes ReLU effective. In practice, ReLU's "deficiency" is actually a feature.
D3 โ€” Analyze

Design an activation function that is: (1) zero-centered, (2) has gradient = 1 for positive inputs, (3) doesn't die for negative inputs, and (4) is smooth everywhere. Does such a function already exist? Compare your design to existing activations.

This describes something close to ELU or GELU. ELU satisfies (1) approximately, (2) exactly, (3) yes (exponential for negatives), (4) yes (smooth at z=0 when ฮฑ=1). GELU also satisfies all four but is non-monotonic. Students might invent something like f(z) = zยทtanh(softplus(z)) which is the Mish activation (Misra, 2019). The key insight: there's a trade-off between smoothness and computational cost.

โ˜… Starred Research Questions (2 Questions)

โ˜… R1 โ€” Create

Research Project: Read the paper "Searching for Activation Functions" (Ramachandran et al., 2017). Implement a simplified version of their search: parametrize activations as f(z) = z ยท g(z) where g is one of {ฯƒ, tanh, softplus, identity}. Test all 4 on Fashion-MNIST and compare. Can you find a combination that beats Swish?

โ˜… R2 โ€” Create

Research Question: GELU uses the Gaussian CDF, Swish uses sigmoid. What if you used other CDFs? Implement "Laplace-ELU": z ยท CDFLaplace(z) and "Cauchy-ELU": z ยท CDFCauchy(z). Compare their gradient flow properties with GELU in a 20-layer network. Does the choice of CDF matter?

Section 25

Connections

How This Chapter Connects

โ† Builds On
  • Chapter 5 (Logistic Regression): Where we first met sigmoid as the output activation for binary classification
  • Chapter 6 (Shallow Neural Networks): Where we used tanh/ReLU in hidden layers without deeply understanding why
  • Chapter 7 (Deep Neural Networks): Where the vanishing gradient problem first became apparent โ€” this chapter explains the root cause
โ†’ Enables
  • Chapter 9 (Regularization): Dropout interacts with activation functions โ€” understanding sparse activations (ReLU) helps understand implicit regularization
  • Chapter 10 (Batch Normalization): BN is placed before or after activation โ€” understanding activations helps you choose
  • Chapter 14 (LSTM/GRU): LSTM gates use sigmoid (why sigmoid and not ReLU?) โ€” now you can answer this
  • Chapter 15 (Transformers): The FFN layer uses GELU โ€” now you understand why
๐Ÿ”ฌ Research Frontier
  • Learnable Activations: Instead of fixing the activation, learn it as a B-spline or polynomial (KAN: Kolmogorov-Arnold Networks, 2024)
  • Activation-aware Quantization: How to compress models with different activations for edge deployment
  • Mish Activation: z ยท tanh(softplus(z)) โ€” a self-regularizing activation that won several Kaggle competitions
๐Ÿญ Industry Implementation
  • Hardware: NVIDIA GPUs have dedicated ReLU units. GELU requires software emulation, making it ~2ร— slower on older hardware
  • Compilers: TensorRT, ONNX Runtime, and XLA optimize common activations but may not support custom ones efficiently
  • Mobile: ReLU is preferred for on-device inference (TFLite, CoreML) due to its computational simplicity
Section 26

Chapter Summary

7 Key Takeaways

  1. Non-linearity is non-negotiable: Without it, any depth of linear layers collapses to a single linear transformation. Activation functions are what make deep learning "deep."
  2. Sigmoid and tanh suffer from vanishing gradients: Sigmoid's max gradient is only 0.25, meaning gradients shrink by at least 75% at each layer. After 10 layers, gradients are ~10โˆ’6. This is why deep networks with sigmoid couldn't train.
  3. ReLU solved the vanishing gradient problem: With a constant gradient of 1.0 in the active region, ReLU enables training of much deeper networks. Its simplicity (max(0,z)) also makes it 6ร— faster than sigmoid.
  4. Dying ReLU is real but fixable: Neurons can die permanently if all their inputs become negative. Fix with: lower learning rate, He initialization, Batch Normalization, or Leaky ReLU/PReLU.
  5. GELU is the Transformer standard: Its smooth, probabilistic gating (zยทฮฆ(z)) pairs naturally with soft attention and prevents training instabilities at billion-parameter scale. Used in BERT, GPT, and ViT.
  6. Swish was discovered by AI: Google Brain used neural architecture search to find zยทฯƒ(z), which outperforms ReLU in many vision tasks. It's the default in EfficientNet.
  7. Softmax converts logits to probabilities: It's the only activation that operates on entire vectors (not element-wise), and it reduces to sigmoid when K=2.
THE KEY EQUATION:

ฯƒ'(z) = ฯƒ(z)(1 โˆ’ ฯƒ(z)) โ†’ max = 0.25 โ†’ vanishes in deep nets

ReLU'(z) = { 1 if z>0, 0 if z<0 } โ†’ preserves gradients

THE KEY INTUITION: Activation functions are the "decision-makers" of a neural network. Linear layers propose (compute weighted sums), activation functions decide (what to keep, what to suppress, and by how much). The evolution from sigmoid โ†’ ReLU โ†’ GELU is the story of making better decisions โ€” from hard binary choices to soft probabilistic ones.

Section 27

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Resources

  • NPTEL: Deep Learning (IIT Madras, Prof. Mitesh Khapra) โ€” Weeks 3-4 cover activation functions with excellent Hindi/English explanations. Free on Swayam.
  • GATE CS/DA Previous Year Papers: 2022-2025 papers include activation function questions. Download from gate.iitk.ac.in.
  • "Deep Learning" by S. Haykin (Pearson India): Chapter 4 covers activation functions with mathematical rigor suitable for GATE preparation.
  • NPTEL: Machine Learning (IIT Kharagpur, Prof. Sudeshna Sarkar) โ€” Lecture on neural networks covers sigmoid and tanh in detail.

๐ŸŒ Global Resources

  • Paper: "Rectified Linear Units Improve Restricted Boltzmann Machines" โ€” Nair & Hinton, ICML 2010. The ReLU origin story.
  • Paper: "Gaussian Error Linear Units (GELUs)" โ€” Hendrycks & Gimpel, 2016. The activation behind Transformers.
  • Paper: "Searching for Activation Functions" โ€” Ramachandran, Zoph, Le (Google Brain, 2017). How Swish was discovered.
  • Paper: "Delving Deep into Rectifiers" (PReLU) โ€” He et al., ICCV 2015. He initialization + PReLU.
  • 3Blue1Brown โ€” "But what is a neural network?" (YouTube): Beautiful visualization of how activation functions transform linear spaces into non-linear ones.
  • Distill.pub: Multiple articles on feature visualization that show what different activations learn.
  • CS231n (Stanford) Notes: "Neural Networks Part 1" has an excellent section on activation functions with pros/cons.

Interactive Tools

  • TensorFlow Playground: playground.tensorflow.org โ€” Toggle between activations and watch how decision boundaries change
  • Desmos Graphing Calculator: desmos.com โ€” Plot any activation function interactively to build intuition
  • PyTorch Documentation: pytorch.org/docs โ€” Complete list of all supported activations with mathematical definitions