Sigmoid vs Tanh Activation Functions (Complete Guide)
๐ Table of Contents
- Introduction
- Sigmoid Function
- Tanh Function
- Mathematical Understanding
- Comparison
- When to Use Each
- Modern Perspective (ReLU)
- Key Takeaways
- Related Articles
Introduction
Activation functions are the backbone of neural networks. Without them, a neural network would behave like a simple linear model, no matter how many layers it has.
Sigmoid Function
The Sigmoid (logistic) function converts any input into a probability value between 0 and 1.
$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$๐ Interpretation
- If \( x \to +\infty \), then output → 1
- If \( x \to -\infty \), then output → 0
- Output range: (0,1)
- Used in binary classification
- Suffers from vanishing gradient
๐ป Code Example
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
Tanh Function
The Tanh function expands the output range to include negative values.
$$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$๐ Interpretation
- If \( x \to +\infty \), output → 1
- If \( x \to -\infty \), output → -1
- Output range: (-1,1)
- Zero-centered
- Better gradient flow than Sigmoid
๐ป Code Example
import numpy as np
def tanh(x):
return np.tanh(x)
๐ Mathematical Deep Dive
Derivative of Sigmoid
$$ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$This derivative becomes very small when \( \sigma(x) \) is near 0 or 1 → causing vanishing gradients.
Derivative of Tanh
$$ \tanh'(x) = 1 - \tanh^2(x) $$This maintains stronger gradients near zero compared to Sigmoid.
Vanishing Gradient Concept
Gradient-based learning depends on:
$$ \frac{\partial L}{\partial w} $$If gradients shrink → learning slows dramatically.
Comparison
| Feature | Sigmoid | Tanh |
|---|---|---|
| Range | (0,1) | (-1,1) |
| Zero-centered | No | Yes |
| Gradient | Weak | Stronger |
| Usage | Output layer | Hidden layers |
When to Use Each
- Sigmoid: Binary classification, probabilities
- Tanh: Hidden layers, faster convergence
Modern Perspective (ReLU)
Today, ReLU is preferred:
$$ f(x) = \max(0, x) $$It avoids vanishing gradients for positive values.
๐ฏ Key Takeaways
- Sigmoid outputs probabilities
- Tanh is zero-centered
- Both suffer from vanishing gradients
- ReLU is modern default
Conclusion
Sigmoid and Tanh are foundational activation functions that shaped modern deep learning. Understanding their mathematical behavior provides insight into how neural networks learn.