๐ค Policy Search in Reinforcement Learning
Think of reinforcement learning (RL) as training a dog: rewards for good behavior, penalties for mistakes. In RL, a policy is the strategy a computer follows to decide its actions based on the current situation.
Policy search is the process of finding the best strategy that maximizes long-term rewards.
๐ Table of Contents
- What is a Policy?
- Why Policy Search is Needed
- How Policy Search Works
- Policy Search Techniques
- Exploration vs Exploitation
- Applications
- Challenges
- Key Takeaways
- Related Articles
1️⃣ What is a Policy?
A policy is essentially a “rule book” for decision-making. It tells the agent which action to take in every possible state of the environment.
For example, in a game where you choose moves, a policy is the set of instructions for each step to maximize your score.
๐ Types of Policies
Deterministic Policy: Always selects the same action for a state.
Stochastic Policy: Chooses actions probabilistically, allowing exploration of multiple options.
2️⃣ Why Do We Need Policy Search?
In many RL problems, we don’t know the best strategy beforehand. A robot learning to walk initially tries random actions, gradually discovering sequences that prevent it from falling.
Policy search is the method to systematically discover the most effective strategies, especially in complex environments where the best action isn’t obvious.
3️⃣ How Policy Search Works
Policy search is like coaching an athlete: you adjust strategies based on performance feedback.
๐ Main Approaches
Direct Policy Search: Tweaks the policy directly and retains changes that improve performance.
Indirect Policy Search (Policy Gradient): Uses gradients to mathematically adjust the policy in the direction that increases reward.
4️⃣ Policy Search Techniques
a. Gradient-Based Methods
Calculate the slope of reward relative to policy parameters. The agent “climbs” uphill toward higher rewards.
๐ Example: Policy Gradient
Policy parameters are updated in small steps along the gradient of expected reward to improve performance iteratively.
b. Gradient-Free Methods
Instead of computing gradients, the agent samples random policies, evaluates them, and selects the best performers.
๐ Example: Evolutionary Strategies
Policies “evolve” like natural selection: best strategies survive and improve over generations.
๐งฎ The Math Behind Policy Search
Policy search is not just trial and error — it’s grounded in mathematics. The goal is to find a policy ฯ that maximizes the expected cumulative reward over time. Let’s break it down.
1️⃣ Expected Reward
In reinforcement learning, the agent receives a reward R after taking an action in a state. The expected reward of a policy ฯ is defined as:
J(ฯ) = E[ฮฃ_t ฮณ^t * R_t]
Where:
- ฮฃ_t – sum over all time steps
- ฮณ – discount factor (0 ≤ ฮณ ≤ 1) that prioritizes immediate rewards over distant rewards
- R_t – reward at time t
- E[ ] – expectation, averaging over all possible sequences of states and actions
Intuition: The agent wants a policy ฯ that gives the highest sum of rewards in the long run.
2️⃣ Policy Gradient (Direct Optimization)
Policy gradient methods adjust the policy in the direction that increases expected reward. The basic formula is:
∇_ฮธ J(ฯ_ฮธ) = E[∇_ฮธ log ฯ_ฮธ(a|s) * Q^ฯ(s, a)]
Explanation:
- ฮธ – parameters of the policy (think of weights in a neural network)
- ฯ_ฮธ(a|s) – probability of taking action a in state s
- Q^ฯ(s, a) – expected cumulative reward from taking action a in state s following policy ฯ
- The gradient ∇_ฮธ J(ฯ_ฮธ) tells us how to change ฮธ to improve expected reward
Intuition: If a certain action in a state gives high rewards, the policy adjusts to make that action more likely in the future.
3️⃣ Gradient-Free Optimization
Sometimes computing gradients is hard. Instead, gradient-free methods like Evolutionary Strategies treat policy parameters as a population:
ฮธ_new = ฮธ_old + ฮฑ * ฮฮธ
Where:
- ฮฮธ is determined by sampling multiple policies and selecting those with higher rewards
- ฮฑ is a learning rate controlling how much the policy changes
Intuition: Like natural selection, better-performing policies survive and gradually improve over generations without explicitly calculating derivatives.
๐ Summary
- Expected reward defines what the agent is optimizing. - Policy gradient uses calculus to climb toward better policies. - Gradient-free methods rely on sampling and selection to improve policies. Together, these mathematical tools allow RL agents to systematically improve their strategies rather than guessing randomly.
๐ป Policy Search Code Example
Here’s a minimal Python example using a policy gradient approach in a simple environment. It shows how a policy is updated based on rewards.
import numpy as np
# Example: 1D environment, 0=left, 1=right
states = [0, 1] # two possible states
actions = [0, 1] # two possible actions
theta = np.array([0.5, -0.5]) # initial policy parameters
learning_rate = 0.1
gamma = 0.9
def policy(state):
"""Return action probabilities using softmax"""
exp_vals = np.exp(theta * state)
return exp_vals / np.sum(exp_vals)
def sample_action(state):
probs = policy(state)
return np.random.choice(actions, p=probs)
def compute_reward(state, action):
# Example reward: +1 if action matches state, else 0
return 1 if state == action else 0
# Training loop
for episode in range(5):
state = np.random.choice(states)
action = sample_action(state)
reward = compute_reward(state, action)
# Policy gradient update
grad = (reward - 0) * (action - policy(state)) # simplified gradient
theta[state] += learning_rate * grad
print(f"Episode {episode}: State={state}, Action={action}, Reward={reward}, Theta={theta}")
๐ Explanation of the Code
- theta represents the policy parameters for each state. - policy(state) calculates the probability of each action using a softmax function. - sample_action(state) selects an action based on probabilities. - compute_reward(state, action) defines the reward signal. - The policy is updated using a simplified gradient step: actions that give higher rewards increase their probability. - This loop shows how the policy gradually improves over episodes.
5️⃣ Balancing Exploration and Exploitation
Exploration: Trying new actions to discover better policies.
Exploitation: Using known successful actions to maximize reward.
The challenge: too much exploitation risks missing better strategies, while too much exploration prevents convergence on an effective policy.
๐ Real-World Analogy
Imagine choosing restaurants in a new city. Exploration = trying new places. Exploitation = sticking with a favorite. Policy search must balance the two.
6️⃣ Applications of Policy Search
Policy search is foundational in modern RL applications:
- Robotics: Walking, object manipulation, navigation.
- Video Games: AI learns to play optimally against humans.
- Self-Driving Cars: Optimizes safe decision-making in unpredictable environments.
7️⃣ Challenges in Policy Search
Despite its power, policy search has hurdles:
- Complexity: Large action/state spaces make optimization slow.
- Local Optima: Policies may get stuck in suboptimal solutions.
- High Variance: Unstable rewards make learning noisy and inconsistent.
๐ก Key Takeaways
Policy search is the backbone of teaching agents to succeed in complex tasks. It is fundamentally trial-and-error learning guided by rewards. Balancing exploration with exploitation and choosing the right optimization method are critical for success.
No comments:
Post a Comment