UCB1 Algorithm
A practical and intuitive solution to the exploration vs. exploitation problem in reinforcement learning and multi-armed bandits.
๐ฐ The Exploration vs. Exploitation Problem
Imagine playing a slot machine with multiple levers. Each lever gives a different payout, but you don’t know which one is best.
Pulling a new lever helps you learn (exploration), but repeatedly pulling the best-known lever helps you earn (exploitation).
๐ What Is UCB1?
UCB1 (Upper Confidence Bound) selects actions by computing an optimistic estimate of each arm’s reward.
- Exploitation: Prefer arms with high average reward
- Exploration: Prefer arms with high uncertainty
Arms that are under-explored receive a temporary boost, ensuring they aren’t ignored too early.
๐งฎ UCB1 Formula
arm_t = argmax (
mean_reward
+ sqrt( (2 * log(total_pulls)) / pulls_for_this_arm )
)
- mean_reward: Average reward from the arm
- total_pulls: Total pulls across all arms
- pulls_for_this_arm: Pull count for the arm
๐ป CLI Simulation Example
$ python ucb1_simulation.py
Initializing arms...
Pulling each arm once...
Round 10:
Arm 1 | mean=0.50 | UCB=0.91
Arm 2 | mean=0.70 | UCB=0.88
Arm 3 | mean=0.30 | UCB=0.85
Selected Arm → 1
Round 100:
Arm 2 dominates with highest UCB
Exploration bonus shrinking...
๐ Why UCB1 Is Effective
- No hyperparameters to tune
- Strong theoretical regret guarantees
- Simple and computationally efficient
๐ Real-World Use Cases
- Online advertising (CTR optimization)
- Clinical trials
- Game AI and strategy optimization
⚠️ Limitations
- Assumes stationary reward distributions
- Does not incorporate contextual information
For changing environments, consider Thompson Sampling or Contextual Bandits.