Showing posts with label policy search. Show all posts
Showing posts with label policy search. Show all posts

Friday, October 25, 2024

A Beginner’s Guide to Policy Search in Reinforcement Learning

Policy Search in Reinforcement Learning | Beginner’s Guide

๐Ÿค– Policy Search in Reinforcement Learning

Think of reinforcement learning (RL) as training a dog: rewards for good behavior, penalties for mistakes. In RL, a policy is the strategy a computer follows to decide its actions based on the current situation.

Policy search is the process of finding the best strategy that maximizes long-term rewards.


๐Ÿ“Œ Table of Contents


1️⃣ What is a Policy?

A policy is essentially a “rule book” for decision-making. It tells the agent which action to take in every possible state of the environment.

For example, in a game where you choose moves, a policy is the set of instructions for each step to maximize your score.

๐Ÿ“– Types of Policies

Deterministic Policy: Always selects the same action for a state.
Stochastic Policy: Chooses actions probabilistically, allowing exploration of multiple options.


2️⃣ Why Do We Need Policy Search?

In many RL problems, we don’t know the best strategy beforehand. A robot learning to walk initially tries random actions, gradually discovering sequences that prevent it from falling.

Policy search is the method to systematically discover the most effective strategies, especially in complex environments where the best action isn’t obvious.


3️⃣ How Policy Search Works

Policy search is like coaching an athlete: you adjust strategies based on performance feedback.

๐Ÿ“– Main Approaches

Direct Policy Search: Tweaks the policy directly and retains changes that improve performance.
Indirect Policy Search (Policy Gradient): Uses gradients to mathematically adjust the policy in the direction that increases reward.


4️⃣ Policy Search Techniques

a. Gradient-Based Methods

Calculate the slope of reward relative to policy parameters. The agent “climbs” uphill toward higher rewards.

๐Ÿ“– Example: Policy Gradient

Policy parameters are updated in small steps along the gradient of expected reward to improve performance iteratively.

b. Gradient-Free Methods

Instead of computing gradients, the agent samples random policies, evaluates them, and selects the best performers.

๐Ÿ“– Example: Evolutionary Strategies

Policies “evolve” like natural selection: best strategies survive and improve over generations.


๐Ÿงฎ The Math Behind Policy Search

Policy search is not just trial and error — it’s grounded in mathematics. The goal is to find a policy ฯ€ that maximizes the expected cumulative reward over time. Let’s break it down.

1️⃣ Expected Reward

In reinforcement learning, the agent receives a reward R after taking an action in a state. The expected reward of a policy ฯ€ is defined as:

J(ฯ€) = E[ฮฃ_t ฮณ^t * R_t]

Where:

  • ฮฃ_t – sum over all time steps
  • ฮณ – discount factor (0 ≤ ฮณ ≤ 1) that prioritizes immediate rewards over distant rewards
  • R_t – reward at time t
  • E[ ] – expectation, averaging over all possible sequences of states and actions

Intuition: The agent wants a policy ฯ€ that gives the highest sum of rewards in the long run.

2️⃣ Policy Gradient (Direct Optimization)

Policy gradient methods adjust the policy in the direction that increases expected reward. The basic formula is:

∇_ฮธ J(ฯ€_ฮธ) = E[∇_ฮธ log ฯ€_ฮธ(a|s) * Q^ฯ€(s, a)]

Explanation:

  • ฮธ – parameters of the policy (think of weights in a neural network)
  • ฯ€_ฮธ(a|s) – probability of taking action a in state s
  • Q^ฯ€(s, a) – expected cumulative reward from taking action a in state s following policy ฯ€
  • The gradient ∇_ฮธ J(ฯ€_ฮธ) tells us how to change ฮธ to improve expected reward

Intuition: If a certain action in a state gives high rewards, the policy adjusts to make that action more likely in the future.

3️⃣ Gradient-Free Optimization

Sometimes computing gradients is hard. Instead, gradient-free methods like Evolutionary Strategies treat policy parameters as a population:

ฮธ_new = ฮธ_old + ฮฑ * ฮ”ฮธ

Where:

  • ฮ”ฮธ is determined by sampling multiple policies and selecting those with higher rewards
  • ฮฑ is a learning rate controlling how much the policy changes

Intuition: Like natural selection, better-performing policies survive and gradually improve over generations without explicitly calculating derivatives.

๐Ÿ“– Summary

- Expected reward defines what the agent is optimizing. - Policy gradient uses calculus to climb toward better policies. - Gradient-free methods rely on sampling and selection to improve policies. Together, these mathematical tools allow RL agents to systematically improve their strategies rather than guessing randomly.

๐Ÿ’ป Policy Search Code Example

Here’s a minimal Python example using a policy gradient approach in a simple environment. It shows how a policy is updated based on rewards.

import numpy as np

# Example: 1D environment, 0=left, 1=right
states = [0, 1]  # two possible states
actions = [0, 1] # two possible actions
theta = np.array([0.5, -0.5])  # initial policy parameters
learning_rate = 0.1
gamma = 0.9

def policy(state):
    """Return action probabilities using softmax"""
    exp_vals = np.exp(theta * state)
    return exp_vals / np.sum(exp_vals)

def sample_action(state):
    probs = policy(state)
    return np.random.choice(actions, p=probs)

def compute_reward(state, action):
    # Example reward: +1 if action matches state, else 0
    return 1 if state == action else 0

# Training loop
for episode in range(5):
    state = np.random.choice(states)
    action = sample_action(state)
    reward = compute_reward(state, action)
    
    # Policy gradient update
    grad = (reward - 0) * (action - policy(state))  # simplified gradient
    theta[state] += learning_rate * grad

    print(f"Episode {episode}: State={state}, Action={action}, Reward={reward}, Theta={theta}")
๐Ÿ“– Explanation of the Code

- theta represents the policy parameters for each state. - policy(state) calculates the probability of each action using a softmax function. - sample_action(state) selects an action based on probabilities. - compute_reward(state, action) defines the reward signal. - The policy is updated using a simplified gradient step: actions that give higher rewards increase their probability. - This loop shows how the policy gradually improves over episodes.

5️⃣ Balancing Exploration and Exploitation

Exploration: Trying new actions to discover better policies.
Exploitation: Using known successful actions to maximize reward.

The challenge: too much exploitation risks missing better strategies, while too much exploration prevents convergence on an effective policy.

๐Ÿ“– Real-World Analogy

Imagine choosing restaurants in a new city. Exploration = trying new places. Exploitation = sticking with a favorite. Policy search must balance the two.


6️⃣ Applications of Policy Search

Policy search is foundational in modern RL applications:

  • Robotics: Walking, object manipulation, navigation.
  • Video Games: AI learns to play optimally against humans.
  • Self-Driving Cars: Optimizes safe decision-making in unpredictable environments.

7️⃣ Challenges in Policy Search

Despite its power, policy search has hurdles:

  • Complexity: Large action/state spaces make optimization slow.
  • Local Optima: Policies may get stuck in suboptimal solutions.
  • High Variance: Unstable rewards make learning noisy and inconsistent.

๐Ÿ’ก Key Takeaways

Policy search is the backbone of teaching agents to succeed in complex tasks. It is fundamentally trial-and-error learning guided by rewards. Balancing exploration with exploitation and choosing the right optimization method are critical for success.


Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts