Reinforcement Learning & Bandits: Risk, Belief, and the Cost of Exploration
Imagine you are running a large e-commerce platform. Every day, you decide which discount to show, which product to promote, and which recommendation to surface. Each decision is a bet. You get feedback — clicks, purchases, churn — but never full certainty.
This is not a control problem. There is no state transition diagram. This is a bandit problem — and how you explore determines not just revenue, but trust, safety, and long-term truth.
Thompson Sampling vs UCB: Same Goal, Different Risk Profiles
Upper Confidence Bound (UCB) and Thompson Sampling are often taught as interchangeable. They are not.
UCB selects actions optimistically: it adds a confidence bonus and chooses the action that could plausibly be best. This encodes a deterministic optimism about the unknown, grounded in frequentist confidence bounds as formalized in regret-optimal strategies like those discussed in UCB-style exploration.
Thompson Sampling, by contrast, samples from a posterior belief. It does not ask “what could be best?” It asks “what do I currently believe?” This Bayesian framing aligns exploration directly with uncertainty, a philosophy echoed in Bayesian reinforcement learning views.
Operationally, this means UCB is risk-seeking early, while Thompson Sampling spreads risk probabilistically. Same objective. Very different failure modes.
Why UCB Fails in Non-Stationary Environments
UCB assumes the world is stationary: reward distributions do not change. But pricing sensitivity shifts. User preferences drift. Competitors react.
In gradual drift, confidence bounds shrink around outdated beliefs. In abrupt regime shifts, UCB becomes overconfident precisely when it should doubt itself. This mismatch mirrors broader non-stationarity problems described in non-stationary environments.
UCB fails not because it explores too little, but because it explores with certainty about assumptions that no longer hold.
Exploration Is Not Randomness
Epsilon-greedy exploration flips coins. Bandit exploration reasons statistically.
Directed exploration uses uncertainty estimates to decide *where* ignorance is valuable. Undirected exploration wastes samples on actions already understood. This distinction becomes clear when comparing naive policies with regret-aware methods, as discussed in structured exploration strategies.
Randomness is noise. Exploration is intent.
Bayesian vs Frequentist Worldviews
Thompson Sampling encodes belief uncertainty directly. The posterior is the model.
UCB encodes uncertainty indirectly through confidence bounds — a statement about long-run behavior, not belief. This philosophical divide matters when decisions affect people, not just metrics.
In clinical trials, Bayesian methods allow ethical stopping. Frequentist bounds do not encode belief — only guarantees.
Regret Decomposition: Short-Term Pain, Long-Term Truth
Regret measures lost opportunity relative to the best fixed action. Instant regret can be high while cumulative regret remains optimal.
This is why short-term “mistakes” can be rational. Optimizing reward at every step often traps systems in false certainty, a theme echoed in long-horizon reinforcement learning.
Minimizing regret is not maximizing happiness — it is buying information.
Asymptotics vs Finite Time Reality
Many bandit algorithms are asymptotically optimal. Businesses are not asymptotic.
Finite horizons magnify early priors, initialization bias, and variance explosions. What converges eventually may destroy trust today.
Delayed Rewards and Feedback Mismatch
In advertising, clicks come immediately but purchases come days later. In medicine, outcomes take months.
Delayed rewards break the core assumption that feedback reflects the last action. Bandits misattribute success, reinforcing noise. This mirrors issues seen in sequential decision problems discussed in reward attribution.
Adversarial vs Stochastic Worlds
Some environments are noisy. Others react.
Competitors change prices. Users adapt to recommendations. The world becomes adversarial, not stochastic. Classic bandits fail here because uncertainty never shrinks — a failure mode also observed in adaptive systems under pressure.
Cold Start, Priors, and Self-Confirmation
Early priors dominate behavior. Early randomness becomes destiny.
Feedback loops reinforce beliefs, even when wrong. Systems stop exploring because they think they already know. This is how recommendation engines create filter bubbles.
Scaling to Large Action Spaces
Thousands of products. Millions of users. Naive bandits collapse.
Structure must be introduced — via embeddings, hierarchies, or context — or exploration becomes impossible.
From Bandits to Reinforcement Learning and Causality
Bandits are reinforcement learning without state. That simplification hides dynamics, delayed effects, and causality.
Exploration is not just about reward. It is how systems learn what causes what. Without exploration, correlation masquerades as truth.
The Big Picture: Exploration as a Moral Choice
In pricing, exploration costs money. In medicine, it costs lives.
There is no universally optimal bandit algorithm. Optimality is contextual, ethical, and constrained.
Bandit algorithms fail when they pretend certainty exists. They succeed when they respect doubt.
No comments:
Post a Comment