Showing posts with label rewards. Show all posts
Showing posts with label rewards. Show all posts

Thursday, October 24, 2024

What Is UCB1 Algorithm? Reinforcement Learning Explained Simply


UCB1 Explained – Exploration vs Exploitation

UCB1 Algorithm

A practical and intuitive solution to the exploration vs. exploitation problem in reinforcement learning and multi-armed bandits.

๐ŸŽฐ The Exploration vs. Exploitation Problem

Imagine playing a slot machine with multiple levers. Each lever gives a different payout, but you don’t know which one is best.

Pulling a new lever helps you learn (exploration), but repeatedly pulling the best-known lever helps you earn (exploitation).

The core challenge: How do you explore enough to learn — without sacrificing too much reward?

๐Ÿ“Œ What Is UCB1?

UCB1 (Upper Confidence Bound) selects actions by computing an optimistic estimate of each arm’s reward.

  • Exploitation: Prefer arms with high average reward
  • Exploration: Prefer arms with high uncertainty

Arms that are under-explored receive a temporary boost, ensuring they aren’t ignored too early.

๐Ÿงฎ UCB1 Formula

arm_t = argmax (
  mean_reward
  + sqrt( (2 * log(total_pulls)) / pulls_for_this_arm )
)
      
  • mean_reward: Average reward from the arm
  • total_pulls: Total pulls across all arms
  • pulls_for_this_arm: Pull count for the arm

๐Ÿ’ป CLI Simulation Example

$ python ucb1_simulation.py

Initializing arms...
Pulling each arm once...

Round 10:
Arm 1 | mean=0.50 | UCB=0.91
Arm 2 | mean=0.70 | UCB=0.88
Arm 3 | mean=0.30 | UCB=0.85

Selected Arm → 1

Round 100:
Arm 2 dominates with highest UCB
Exploration bonus shrinking...
    

๐Ÿš€ Why UCB1 Is Effective

  • No hyperparameters to tune
  • Strong theoretical regret guarantees
  • Simple and computationally efficient

๐Ÿ“Š Real-World Use Cases

  • Online advertising (CTR optimization)
  • Clinical trials
  • Game AI and strategy optimization

⚠️ Limitations

  • Assumes stationary reward distributions
  • Does not incorporate contextual information

For changing environments, consider Thompson Sampling or Contextual Bandits.

๐Ÿ’ก Key Takeaways

UCB1 offers a clean, mathematically grounded solution to exploration vs. exploitation — ideal when rewards are stable and simplicity matters.
Built for learning • Interactive • No external dependencies

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts