Showing posts with label vanishing gradient problem. Show all posts
Showing posts with label vanishing gradient problem. Show all posts

Monday, October 7, 2024

Sigmoid vs Tanh: Understanding Key Activation Functions in Neural Networks

Sigmoid vs Tanh Activation Functions | Complete Deep Learning Guide

Sigmoid vs Tanh Activation Functions (Complete Guide)

๐Ÿ“Œ Table of Contents


Introduction

Activation functions are the backbone of neural networks. Without them, a neural network would behave like a simple linear model, no matter how many layers it has.

๐Ÿ’ก Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

Sigmoid Function

The Sigmoid (logistic) function converts any input into a probability value between 0 and 1.

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

๐Ÿ“Š Interpretation

  • If \( x \to +\infty \), then output → 1
  • If \( x \to -\infty \), then output → 0
  • Output range: (0,1)
  • Used in binary classification
  • Suffers from vanishing gradient

๐Ÿ’ป Code Example

import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x))

Tanh Function

The Tanh function expands the output range to include negative values.

$$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$

๐Ÿ“Š Interpretation

  • If \( x \to +\infty \), output → 1
  • If \( x \to -\infty \), output → -1
  • Output range: (-1,1)
  • Zero-centered
  • Better gradient flow than Sigmoid

๐Ÿ’ป Code Example

import numpy as np def tanh(x): return np.tanh(x)

๐Ÿ“Š Mathematical Deep Dive

Derivative of Sigmoid

$$ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$

This derivative becomes very small when \( \sigma(x) \) is near 0 or 1 → causing vanishing gradients.

Derivative of Tanh

$$ \tanh'(x) = 1 - \tanh^2(x) $$

This maintains stronger gradients near zero compared to Sigmoid.

Vanishing Gradient Concept

Gradient-based learning depends on:

$$ \frac{\partial L}{\partial w} $$

If gradients shrink → learning slows dramatically.


Comparison

Feature Sigmoid Tanh
Range (0,1) (-1,1)
Zero-centered No Yes
Gradient Weak Stronger
Usage Output layer Hidden layers

When to Use Each

  • Sigmoid: Binary classification, probabilities
  • Tanh: Hidden layers, faster convergence

Modern Perspective (ReLU)

Today, ReLU is preferred:

$$ f(x) = \max(0, x) $$

It avoids vanishing gradients for positive values.

๐Ÿ’ก Sigmoid & Tanh are still important for understanding neural networks.

๐ŸŽฏ Key Takeaways

  • Sigmoid outputs probabilities
  • Tanh is zero-centered
  • Both suffer from vanishing gradients
  • ReLU is modern default

Conclusion

Sigmoid and Tanh are foundational activation functions that shaped modern deep learning. Understanding their mathematical behavior provides insight into how neural networks learn.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts