Showing posts with label Game Theory. Show all posts
Showing posts with label Game Theory. Show all posts

Thursday, October 24, 2024

What Is UCB1 Algorithm? Reinforcement Learning Explained Simply


UCB1 Explained – Exploration vs Exploitation

UCB1 Algorithm

A practical and intuitive solution to the exploration vs. exploitation problem in reinforcement learning and multi-armed bandits.

๐ŸŽฐ The Exploration vs. Exploitation Problem

Imagine playing a slot machine with multiple levers. Each lever gives a different payout, but you don’t know which one is best.

Pulling a new lever helps you learn (exploration), but repeatedly pulling the best-known lever helps you earn (exploitation).

The core challenge: How do you explore enough to learn — without sacrificing too much reward?

๐Ÿ“Œ What Is UCB1?

UCB1 (Upper Confidence Bound) selects actions by computing an optimistic estimate of each arm’s reward.

  • Exploitation: Prefer arms with high average reward
  • Exploration: Prefer arms with high uncertainty

Arms that are under-explored receive a temporary boost, ensuring they aren’t ignored too early.

๐Ÿงฎ UCB1 Formula

arm_t = argmax (
  mean_reward
  + sqrt( (2 * log(total_pulls)) / pulls_for_this_arm )
)
      
  • mean_reward: Average reward from the arm
  • total_pulls: Total pulls across all arms
  • pulls_for_this_arm: Pull count for the arm

๐Ÿ’ป CLI Simulation Example

$ python ucb1_simulation.py

Initializing arms...
Pulling each arm once...

Round 10:
Arm 1 | mean=0.50 | UCB=0.91
Arm 2 | mean=0.70 | UCB=0.88
Arm 3 | mean=0.30 | UCB=0.85

Selected Arm → 1

Round 100:
Arm 2 dominates with highest UCB
Exploration bonus shrinking...
    

๐Ÿš€ Why UCB1 Is Effective

  • No hyperparameters to tune
  • Strong theoretical regret guarantees
  • Simple and computationally efficient

๐Ÿ“Š Real-World Use Cases

  • Online advertising (CTR optimization)
  • Clinical trials
  • Game AI and strategy optimization

⚠️ Limitations

  • Assumes stationary reward distributions
  • Does not incorporate contextual information

For changing environments, consider Thompson Sampling or Contextual Bandits.

๐Ÿ’ก Key Takeaways

UCB1 offers a clean, mathematically grounded solution to exploration vs. exploitation — ideal when rewards are stable and simplicity matters.
Built for learning • Interactive • No external dependencies

Monday, October 21, 2024

Self-Play in Reinforcement Learning: How Agents Learn by Competing Against Themselves

Self-play is a fascinating concept in reinforcement learning (RL) that has gained widespread attention in recent years, especially with the success of algorithms in complex domains like Go, Chess, and video games. The idea is simple: an agent learns by playing against itself, improving over time without needing a human or external opponent. Let’s dive into the details of how this works and why it's so powerful.

#### What is Reinforcement Learning?

To understand self-play, it's essential to first grasp the basics of reinforcement learning. RL is a type of machine learning where an agent interacts with an environment, takes actions, and receives feedback in the form of rewards or penalties. The agent's goal is to learn a policy (a strategy or plan of action) that maximizes cumulative rewards over time.

The key components of RL are:
1. **Agent**: The learner or decision maker.
2. **Environment**: The world the agent interacts with.
3. **Actions**: Choices the agent can make.
4. **State**: The current situation the agent finds itself in.
5. **Reward**: Feedback from the environment that indicates success or failure of an action.

The agent explores different actions, learns from the results, and adjusts its policy to improve performance.

#### What is Self-Play?

Self-play is a method where the agent learns by competing or collaborating with itself. Instead of relying on external opponents or data, the agent plays against copies of itself or different versions of itself. Over time, it gets better as it encounters increasingly challenging situations. In some sense, self-play sets up a dynamic environment that evolves as the agent improves.

Imagine two copies of the same agent playing a game like Chess. At first, the moves might be random, and both agents play poorly. However, after multiple rounds, the agents start recognizing patterns, learning from mistakes, and gradually improve their performance.

#### Why is Self-Play Effective?

There are a few reasons why self-play is such a powerful tool in RL:

1. **Infinite Opponents**: Self-play provides an endless stream of opponents. The agent can always play against itself, creating a diverse set of experiences. This is crucial in games like Go or Chess, where mastering all potential situations would require an enormous amount of external data and human opponents.

2. **No Need for Labels**: In supervised learning, you need labeled data to train a model. In contrast, self-play in RL doesn’t require explicit labels. The only feedback comes from the game outcomes (win, loss, draw), and the agent learns to adjust its actions to achieve better outcomes over time.

3. **Learning from Mistakes**: Because the agent plays against itself, it learns directly from its mistakes. If it loses in one round, it adjusts its strategy and tries to avoid similar mistakes in the future.

4. **Balancing Exploration and Exploitation**: Self-play naturally encourages the agent to explore new strategies and exploit learned knowledge. As one version of the agent improves, its opponent (also itself) gets better as well. This forces both versions to continually explore new strategies to stay competitive.

5. **Dynamic Difficulty**: One of the biggest challenges in traditional RL is maintaining an appropriate level of difficulty for the agent. If the environment is too easy, the agent doesn’t learn effectively. If it’s too hard, the agent gets stuck. In self-play, the difficulty adjusts automatically as the agent improves. As one version of the agent gets better, so does its opponent, maintaining a constant challenge.

#### How Does Self-Play Work?

Here’s a simplified overview of how self-play works in reinforcement learning:

1. **Initialization**: The agent starts with a random or naive strategy. This can be as simple as random moves in a game like Chess.
   
2. **Training**: The agent plays against itself. During each game, it takes actions, receives feedback, and updates its policy. The feedback typically comes from the outcome of the game (e.g., a win, loss, or draw). This feedback is used to update the agent’s internal parameters to improve its future performance.

   Mathematically, the agent learns a policy `pi` that maximizes expected reward. Over time, the agent updates its policy using the following formula:
   
   Policy (new) = Policy (old) + learning rate * (Reward - Policy (old))

   The learning rate controls how much the agent changes its policy based on new experiences.

3. **Iteration**: The agent repeats this process, continuously playing against itself. Each iteration leads to slight improvements in the agent’s performance, and over time, the agent becomes increasingly skilled.

4. **Evaluation**: Periodically, the agent is evaluated against human players or a fixed version of itself. This helps track progress and determine if the learning process is effective.

#### Self-Play in Action: AlphaGo

One of the most famous examples of self-play in action is **AlphaGo**, developed by DeepMind. AlphaGo became the first AI to beat a professional human player in the game of Go, which is known for its enormous complexity and number of possible moves.

AlphaGo used a combination of deep learning and self-play to achieve superhuman performance. It started by training on a dataset of human expert games but quickly transitioned to self-play to refine its skills. During self-play, AlphaGo played millions of games against itself, exploring various strategies and continuously improving its policy.

The outcome was remarkable—AlphaGo not only surpassed human players but also discovered strategies that were previously unknown to the Go community.

#### Challenges of Self-Play

While self-play is powerful, it’s not without challenges:

1. **Stagnation**: If both versions of the agent learn similar strategies, they can get stuck in a local optimum, where they don’t discover new, better strategies. This is known as the "self-play trap," where the agent stops making meaningful progress.

2. **Imbalance**: If one version of the agent gets too strong too quickly, it can dominate the other version, leading to poor learning outcomes. Techniques like dynamic opponent selection (where the agent plays against different versions of itself) help address this.

3. **Computation Costs**: Self-play requires a significant amount of computational power, especially when dealing with complex environments or large action spaces. AlphaGo, for example, required vast computational resources to simulate millions of games.

#### Self-Play Beyond Games

While self-play has been most prominently used in board games like Go and Chess, it has broader applications. For instance, it’s used in training agents for robotic control, autonomous driving, and even negotiations. In these contexts, the agent learns by interacting with different versions of itself or by simulating future scenarios, allowing it to handle real-world tasks more effectively.

#### Conclusion

Self-play is a groundbreaking concept in reinforcement learning that allows agents to learn complex strategies without needing external opponents or labeled data. It has been responsible for some of the most impressive advances in AI, including AlphaGo’s success. By constantly challenging itself, an agent can continuously improve, adapt to new situations, and discover innovative strategies. While there are challenges in its implementation, the potential of self-play extends far beyond just games and could drive the next wave of advancements in AI applications across diverse fields.

Wednesday, October 16, 2024

Q-Learning Implementation for Rock, Paper, Scissors with Custom Rewards and Strategy Analysis


Q-Learning Rock Paper Scissors Tutorial | Reinforcement Learning Explained

Implementing Q-Learning for Rock Paper Scissors

This article explains how to train a Reinforcement Learning agent using Q-learning to play the classic game Rock Paper Scissors.

Instead of manually programming strategies, the agent learns through trial and error by observing rewards from its actions.


๐Ÿ“š Table of Contents


Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns by interacting with an environment and receiving rewards or penalties.

Instead of learning from labeled datasets, the agent learns through experience.

  • Agent takes an action
  • Environment returns a reward
  • Agent updates its knowledge
Why Reinforcement Learning Matters

Reinforcement Learning powers many modern technologies such as:

  • Game-playing AI systems
  • Autonomous robotics
  • Recommendation engines
  • Financial trading algorithms

Game Mechanics

The Rock Paper Scissors game contains three actions:

  • Rock
  • Paper
  • Scissors

Each action has a deterministic outcome against another action.

Action Beats
Rock Scissors
Paper Rock
Scissors Paper

Reward Matrix Design

To train a reinforcement learning agent, we convert game outcomes into numerical rewards.

Outcome Reward
Win +1
Loss -1
Tie 0

These rewards guide the learning algorithm toward optimal strategies.


Understanding Q-Learning

Q-learning is a reinforcement learning algorithm that learns the value of taking an action in a specific state.

The algorithm maintains a table called the Q-table.

The Q-table stores expected rewards for each state-action pair.

Q-Learning Formula


Q(s,a) = Q(s,a) + ฮฑ [R + ฮณ max(Q(s',a')) - Q(s,a)]

  • s = current state
  • a = action
  • ฮฑ = learning rate
  • ฮณ = discount factor
  • R = reward
Intuition Behind Q-Learning

The algorithm updates knowledge using:

  • Immediate reward
  • Best possible future reward

Over many iterations the values converge toward optimal behavior.


Python Implementation

Initialize Q-table


import numpy as np

import random

actions = ["Rock","Paper","Scissors"]

Q = np.zeros((3,3))

alpha = 0.1

gamma = 0.9

epsilon = 0.1

reward_matrix = [

[0,-1,1],

[1,0,-1],

[-1,1,0]

]

The Q-table starts with zeros, meaning the agent initially has no knowledge.


Training the Agent


for episode in range(10000):

    state = random.randint(0,2)

    if random.random() < epsilon:

        action = random.randint(0,2)

    else:

        action = np.argmax(Q[state])

    opponent = random.randint(0,2)

    reward = reward_matrix[action][opponent]

    Q[state][action] = Q[state][action] + 0.1 * (

        reward + 0.9 * np.max(Q[action]) - Q[state][action]

    )

During training the agent sometimes explores random actions to discover better strategies.


CLI Output Example


$ python rps_qlearning.py

Training started...

Episode 1000 complete

Episode 5000 complete

Episode 10000 complete

Final Q Table:

[[ 0.12 0.88 -0.44]

 [-0.32 0.21 0.92]

 [0.71 -0.51 0.08]]

Optimal Strategy Learned:

Rock -> Paper

Paper -> Scissors

Scissors -> Rock


Understanding the Q-Table

The Q-table stores expected rewards for each action.

State Rock Paper Scissors
Rock 0.12 0.88 -0.44
Paper -0.32 0.21 0.92
Scissors 0.71 -0.51 0.08

Interactive Demo

Play against a simple agent:


๐Ÿ’ก Key Insights

  • Reinforcement Learning learns through rewards
  • Q-learning uses a table of expected action rewards
  • Exploration allows discovery of better strategies
  • Rock Paper Scissors demonstrates RL concepts clearly
  • Q-tables help interpret the learning process


Author: Subham

Saturday, August 3, 2024

Challenges with Alpha-Beta Pruning in Real-Life Chess Scenarios


Challenges with Alpha-Beta Pruning in Real-Life Chess Scenarios

Challenges with Alpha-Beta Pruning in Real-Life Chess Scenarios


๐Ÿง  Introduction

Alpha-beta pruning is a widely used optimization technique in artificial intelligence, especially in game-playing systems like chess engines.

While it dramatically reduces computation, real-world chess scenarios introduce complexities that challenge its effectiveness.

๐Ÿ’ก Core Idea: Alpha-beta pruning speeds up decision-making, but may miss deeper strategic opportunities if not implemented carefully.

♟️ What is Alpha-Beta Pruning?

Alpha-beta pruning is an enhancement of the minimax algorithm that eliminates branches of the game tree that do not need to be explored.

  • Alpha: Best already explored option for maximizer
  • Beta: Best already explored option for minimizer
๐Ÿ“– Expand for Simple Explanation

Imagine evaluating moves in chess. If you already found a strong move, you can skip exploring weaker alternatives.


⚠️ Key Challenges in Real Chess

1. Initial Promising Moves

A move that looks strong initially might actually be inferior in deeper analysis.

2. Delayed Tactical Benefits

Some sacrifices only pay off after multiple moves — pruning may remove them early.

3. Move Ordering Dependency

Efficiency heavily depends on evaluating the best moves first.

๐Ÿ” Why Move Ordering Matters

If bad moves are evaluated first, pruning becomes ineffective, leading to higher computation and missed opportunities.


♞ Real-Life Chess Scenario

Consider sacrificing a queen for long-term positional advantage.

  • Short-term: Material loss ❌
  • Long-term: Winning position ✅
๐Ÿ’ก Engines without deep evaluation may reject such moves prematurely.

๐Ÿ’ป Code Example (Python)

def alpha_beta(node, depth, alpha, beta, maximizing):
    if depth == 0 or node.is_terminal():
        return node.evaluate()

    if maximizing:
        value = float('-inf')
        for child in node.children():
            value = max(value, alpha_beta(child, depth-1, alpha, beta, False))
            alpha = max(alpha, value)
            if alpha >= beta:
                break  # Beta cut-off
        return value
    else:
        value = float('inf')
        for child in node.children():
            value = min(value, alpha_beta(child, depth-1, alpha, beta, True))
            beta = min(beta, value)
            if beta <= alpha:
                break  # Alpha cut-off
        return value

๐Ÿ–ฅ️ CLI Output Simulation

Evaluating Move: Qxh7+ (Sacrifice)
Depth 1 Score: -9
Depth 3 Score: +2
Depth 5 Score: +8 (Winning)

Alpha-Beta Pruning Skipped Branches: 42%
Nodes Evaluated: 1,230

๐Ÿš€ Heuristics & Improvements

  • Better Evaluation Functions – Capture positional strength
  • Move Ordering – Evaluate captures, checks first
  • Iterative Deepening – Gradually increase depth
  • Transposition Tables – Cache results
๐Ÿ“Š Advanced Optimization Techniques
  • Killer Heuristic
  • History Heuristic
  • Principal Variation Search

๐ŸŽฏ Key Takeaways

  • Alpha-beta pruning is powerful but not foolproof
  • Move ordering directly impacts efficiency
  • Deep tactics can be missed without good heuristics
  • Real-world chess requires hybrid strategies


© 2026 Data Dive with Subham

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts