๐ฏ Regret & Regret Optimality in Reinforcement Learning
In reinforcement learning (RL), one of the key objectives is for an agent to learn how to maximize cumulative rewards while interacting with an environment. However, achieving this is not always straightforward. This is where the concept of regret comes into play.
Regret measures how much reward an agent could have earned if it had followed the optimal policy from the very beginning.
It represents the opportunity cost of learning — the gap between ideal performance and actual performance.
$ Optimal policy reward: 1000
$ Agent collected reward: 850
$ Regret = 1000 - 850
$ Regret = 150
The regret after T time steps is defined as:
R(T) = T · V(s₀) − ฮฃ V(ฯ, s₀, t)
- T: Total time steps
- V(s₀): Optimal value from initial state
- ฯ: Agent’s learned policy
Exploration allows the agent to discover new actions, while exploitation focuses on known high-reward actions.
Regret reflects the cost of exploration — early mistakes increase regret, but learning reduces it over time.
A regret bound provides an upper limit on how much regret an algorithm accumulates.
R(T) = O(√T)
Sub-linear regret means the agent improves over time and learns efficiently.
- Faster convergence to optimal behavior
- Reduced opportunity cost during learning
- Better real-world decision-making
Applications include:
- Autonomous driving
- Recommendation systems
- Financial trading strategies
- Episodic: Regret measured across multiple episodes
- Continuing: Regret measured over long, uninterrupted interaction
Continuing tasks are often more challenging due to non-stationary environments.
Upper Confidence Bound (UCB)
Balances exploration and exploitation using confidence intervals.
Thompson Sampling
Uses probabilistic belief sampling to select actions.
Q-Learning with Exploration
Combines value learning with strategies like ฮต-greedy.
๐ก Key Takeaways
- Regret measures lost reward due to learning
- Low regret = efficient learning
- Sub-linear regret indicates improvement over time
- Regret optimality is critical for real-world RL systems
No comments:
Post a Comment