Showing posts with label gym environment. Show all posts
Showing posts with label gym environment. Show all posts

Sunday, October 20, 2024

Teaching an AI to Play Tic-Tac-Toe Using Reinforcement Learning

Reinforcement learning (RL) is an exciting branch of machine learning where an agent learns to make decisions by interacting with an environment. Unlike traditional supervised learning, the agent in RL is not explicitly told what action to take; instead, it explores, receives feedback, and improves over time based on rewards.

In this blog, we will explore how reinforcement learning works by building a simple Tic-Tac-Toe game environment and training an AI agent to play the game using a popular RL algorithm called **Q-Learning**.

### What is Reinforcement Learning?

At its core, reinforcement learning involves an agent that takes actions in an environment. The environment responds with a new state and a reward, and the agent's goal is to maximize the total reward it receives over time. This framework can be applied to many problems like robotics, gaming, and decision-making systems.

Here's a brief breakdown of the key components in RL:
- **Agent**: The learner or decision-maker (our AI player in this case).
- **Environment**: Where the agent acts (the Tic-Tac-Toe board).
- **Actions**: The possible moves the agent can take (placing 'X' or 'O' on the board).
- **State**: A representation of the current situation in the environment (the current board configuration).
- **Reward**: Feedback from the environment after each action (positive for winning, negative for losing).

Now, let's dive into how we can train an AI agent to play Tic-Tac-Toe using these principles.

### Step 1: The Tic-Tac-Toe Environment

We first create the Tic-Tac-Toe game environment. This environment will be responsible for:
1. Managing the game board (a 3x3 grid).
2. Letting two players (Player 1 and Player 2) take turns making their moves.
3. Determining when a player wins or the game ends in a draw.

In the environment class, the `step()` method takes an action (the position where the player wants to place their marker) and returns:
- The new state of the board.
- A reward: +1 for winning, 0 for a draw, and -1 for an invalid move.
- Whether the game is finished or not.

We also have a `render()` method to visualize the board and a `reset()` method to reset the board for a new game.

### Step 2: The Q-Learning Agent

Q-Learning is a popular RL algorithm that allows the agent to learn which actions to take based on its experiences. The agent uses a **Q-table** to store the expected rewards for different actions in different states.

#### How Does the Q-Table Work?

The Q-table is a mapping between **states** and **actions**. For each state (the board configuration), the table stores a value (Q-value) for each possible action (placing a marker in any of the 9 cells). These Q-values represent how "good" it is to take a particular action in a particular state.

Initially, the Q-table is empty or filled with random values. Over time, the agent learns to update these values based on its experiences using the following rule:


Q(state, action) = Q(state, action) + learning_rate * (reward + discount_factor * max(Q(next_state)) - Q(state, action))


Let’s break this down:
- `Q(state, action)` is the current Q-value for a particular state-action pair.
- `learning_rate` controls how much new information overrides old information.
- `reward` is the feedback received after taking the action.
- `discount_factor` determines how much future rewards are considered (a value close to 1 means future rewards are highly valued).
- `max(Q(next_state))` is the highest Q-value for the next state, which represents the best future action the agent can take.

#### Exploration vs Exploitation

During training, the agent has to decide whether to **explore** new actions (try something random) or **exploit** its current knowledge (pick the action with the highest Q-value). This balance is controlled by a probability `exploration_prob`, which starts high (the agent explores more) and decays over time as the agent learns.

### Step 3: Training the Agent

Training the agent involves playing thousands of games of Tic-Tac-Toe. Each game is an episode, and within each episode:
1. The agent resets the board.
2. The agent selects an action based on the current state using the Q-table.
3. The environment updates the board, returns the reward, and moves to the next state.
4. The agent updates its Q-table based on the reward and the next state.

Over time, the agent becomes better at picking actions that lead to wins, and its Q-table is filled with useful values that help it make optimal decisions.

#### Epsilon Decay

In the beginning, the agent often chooses random actions to explore the environment. However, as it learns, it should start relying more on its knowledge. This is achieved by gradually decreasing the probability of exploration (`epsilon_decay`), allowing the agent to exploit its learned knowledge more as training progresses.

### Step 4: Testing the Agent

After training, we can test the agent by playing a few games and observing how it performs. The agent now uses its learned Q-table to make decisions, and we can see it choosing moves that give it the best chance of winning.

During testing, we visualize the game board after each move, allowing us to observe the agent’s strategies. The agent can either win, draw, or lose based on its learned experience.

### Conclusion

In this blog, we’ve walked through how reinforcement learning can be applied to teach an AI agent to play Tic-Tac-Toe. We used a simple Q-learning algorithm, which allowed the agent to improve its decision-making through trial and error over thousands of games. The key takeaway is that RL allows an agent to learn from experience and adjust its actions based on the rewards it receives.

This simple example demonstrates the power of reinforcement learning in problem-solving. By training on a game like Tic-Tac-Toe, we’ve laid the groundwork for applying these techniques to more complex problems in areas like robotics, finance, and autonomous systems.

If you want to try this out for yourself, you can experiment with different parameters like the learning rate, discount factor, and exploration probability to see how they affect the agent’s performance. Happy coding!

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts