Showing posts with label Game AI. Show all posts
Showing posts with label Game AI. Show all posts

Sunday, October 20, 2024

Teaching an AI to Play Tic-Tac-Toe Using Reinforcement Learning

Reinforcement learning (RL) is an exciting branch of machine learning where an agent learns to make decisions by interacting with an environment. Unlike traditional supervised learning, the agent in RL is not explicitly told what action to take; instead, it explores, receives feedback, and improves over time based on rewards.

In this blog, we will explore how reinforcement learning works by building a simple Tic-Tac-Toe game environment and training an AI agent to play the game using a popular RL algorithm called **Q-Learning**.

### What is Reinforcement Learning?

At its core, reinforcement learning involves an agent that takes actions in an environment. The environment responds with a new state and a reward, and the agent's goal is to maximize the total reward it receives over time. This framework can be applied to many problems like robotics, gaming, and decision-making systems.

Here's a brief breakdown of the key components in RL:
- **Agent**: The learner or decision-maker (our AI player in this case).
- **Environment**: Where the agent acts (the Tic-Tac-Toe board).
- **Actions**: The possible moves the agent can take (placing 'X' or 'O' on the board).
- **State**: A representation of the current situation in the environment (the current board configuration).
- **Reward**: Feedback from the environment after each action (positive for winning, negative for losing).

Now, let's dive into how we can train an AI agent to play Tic-Tac-Toe using these principles.

### Step 1: The Tic-Tac-Toe Environment

We first create the Tic-Tac-Toe game environment. This environment will be responsible for:
1. Managing the game board (a 3x3 grid).
2. Letting two players (Player 1 and Player 2) take turns making their moves.
3. Determining when a player wins or the game ends in a draw.

In the environment class, the `step()` method takes an action (the position where the player wants to place their marker) and returns:
- The new state of the board.
- A reward: +1 for winning, 0 for a draw, and -1 for an invalid move.
- Whether the game is finished or not.

We also have a `render()` method to visualize the board and a `reset()` method to reset the board for a new game.

### Step 2: The Q-Learning Agent

Q-Learning is a popular RL algorithm that allows the agent to learn which actions to take based on its experiences. The agent uses a **Q-table** to store the expected rewards for different actions in different states.

#### How Does the Q-Table Work?

The Q-table is a mapping between **states** and **actions**. For each state (the board configuration), the table stores a value (Q-value) for each possible action (placing a marker in any of the 9 cells). These Q-values represent how "good" it is to take a particular action in a particular state.

Initially, the Q-table is empty or filled with random values. Over time, the agent learns to update these values based on its experiences using the following rule:


Q(state, action) = Q(state, action) + learning_rate * (reward + discount_factor * max(Q(next_state)) - Q(state, action))


Let’s break this down:
- `Q(state, action)` is the current Q-value for a particular state-action pair.
- `learning_rate` controls how much new information overrides old information.
- `reward` is the feedback received after taking the action.
- `discount_factor` determines how much future rewards are considered (a value close to 1 means future rewards are highly valued).
- `max(Q(next_state))` is the highest Q-value for the next state, which represents the best future action the agent can take.

#### Exploration vs Exploitation

During training, the agent has to decide whether to **explore** new actions (try something random) or **exploit** its current knowledge (pick the action with the highest Q-value). This balance is controlled by a probability `exploration_prob`, which starts high (the agent explores more) and decays over time as the agent learns.

### Step 3: Training the Agent

Training the agent involves playing thousands of games of Tic-Tac-Toe. Each game is an episode, and within each episode:
1. The agent resets the board.
2. The agent selects an action based on the current state using the Q-table.
3. The environment updates the board, returns the reward, and moves to the next state.
4. The agent updates its Q-table based on the reward and the next state.

Over time, the agent becomes better at picking actions that lead to wins, and its Q-table is filled with useful values that help it make optimal decisions.

#### Epsilon Decay

In the beginning, the agent often chooses random actions to explore the environment. However, as it learns, it should start relying more on its knowledge. This is achieved by gradually decreasing the probability of exploration (`epsilon_decay`), allowing the agent to exploit its learned knowledge more as training progresses.

### Step 4: Testing the Agent

After training, we can test the agent by playing a few games and observing how it performs. The agent now uses its learned Q-table to make decisions, and we can see it choosing moves that give it the best chance of winning.

During testing, we visualize the game board after each move, allowing us to observe the agent’s strategies. The agent can either win, draw, or lose based on its learned experience.

### Conclusion

In this blog, we’ve walked through how reinforcement learning can be applied to teach an AI agent to play Tic-Tac-Toe. We used a simple Q-learning algorithm, which allowed the agent to improve its decision-making through trial and error over thousands of games. The key takeaway is that RL allows an agent to learn from experience and adjust its actions based on the rewards it receives.

This simple example demonstrates the power of reinforcement learning in problem-solving. By training on a game like Tic-Tac-Toe, we’ve laid the groundwork for applying these techniques to more complex problems in areas like robotics, finance, and autonomous systems.

If you want to try this out for yourself, you can experiment with different parameters like the learning rate, discount factor, and exploration probability to see how they affect the agent’s performance. Happy coding!

Reinforcement Learning for Tic-Tac-Toe Using Q-Learning

You want to create and train an artificial agent to play **Tic-Tac-Toe** using **Reinforcement Learning**. Specifically, you will use a Q-Learning algorithm, where the agent learns to make optimal decisions by exploring different game states and learning from the rewards it gets. The game environment will provide feedback to the agent by updating the board with each move and indicating when a player has won or the game has ended in a draw. The agent will learn over many rounds (episodes) and gradually improve its performance in selecting the best moves.

### Solution:
1. **Environment Setup**: 
   - A custom Tic-Tac-Toe game environment is created. The game board is a 3x3 grid, initially empty. Two players take turns placing their respective markers ('X' and 'O').
   - The game tracks the current player and provides actions (placing a marker in an empty cell). The possible actions (placing in 9 different cells) are represented by numbers 0 to 8.
   - The environment checks for a winner after each move, and if no winner exists and the board is full, the game ends in a draw.
   - Each game step returns an observation (the board's state), a reward (positive for a win, negative for an invalid move), and a signal if the game is over (done).

2. **Q-Learning Agent**:
   - The agent’s job is to learn the optimal strategy for playing Tic-Tac-Toe. It does this by using a **Q-table**, where each possible board configuration (state) is mapped to a value representing the expected reward for taking each action (placing a marker in one of the 9 cells).
   - At the start, the agent explores different actions to learn the effects. As it gains experience, it balances between exploration (trying new actions) and exploitation (selecting actions based on what it has learned).
   - The Q-table is updated using the **Q-Learning update rule**, which uses feedback from each step (reward and next state) to adjust the action values.

3. **Training the Agent**:
   - The agent is trained over 10,000 episodes. In each episode, it plays a game of Tic-Tac-Toe by selecting actions (moves) based on the current state of the board.
   - At each step, the agent takes an action, receives feedback from the environment, and updates its Q-table based on the rewards.
   - As training progresses, the agent becomes better at identifying the best moves by using the Q-values stored in the Q-table. The exploration rate (chance of taking a random action) gradually decreases, meaning the agent increasingly exploits its learned knowledge to win games.

4. **Testing the Agent**:
   - After training, the agent is tested over a set number of games. During testing, the agent plays without much exploration, meaning it mainly uses the strategies it learned during training to win.
   - During the test games, the board is displayed after each move, and the result (win or loss) is printed at the end of the game.

### Key Concepts:
- **Exploration vs Exploitation**: In the beginning, the agent explores by making random moves to gather information. Over time, it starts exploiting its knowledge by choosing the best possible move based on what it has learned.
- **Q-Learning**: The Q-learning algorithm updates the value for each state-action pair based on the rewards received and the estimated value of future states. This helps the agent learn an optimal strategy for playing Tic-Tac-Toe.
- **Game Feedback**: Each game gives feedback in the form of rewards (positive for winning, negative for invalid moves) and the game status (ongoing, won, or draw), which the agent uses to adjust its strategy.

### Final Outcome:
After training, the agent learns to play Tic-Tac-Toe optimally. In the testing phase, it can play and display the game with significantly improved decision-making skills, increasing the likelihood of winning or drawing, depending on its opponent.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts