This blog explores data science and networking, combining theoretical concepts with practical implementations. Topics include routing protocols, network operations, and data-driven problem solving, presented with clarity and reproducibility in mind.
Wednesday, December 11, 2024
How DQN and Fitted Q Iteration Work in Reinforcement Learning
Tuesday, December 10, 2024
A Beginner’s Guide to LSPI and Fitted Q Iteration in Reinforcement Learning
๐ง LSPI vs Fitted Q Iteration (FQI)
Reinforcement learning (RL) teaches an agent to make decisions that maximize reward. When data is limited, Least-Squares Policy Iteration (LSPI) and Fitted Q Iteration (FQI) are two powerful, data-efficient approaches.
- Policy: A rule mapping states to actions
- Q-Function: Expected long-term reward of taking an action in a state
Q(state, action) → expected future reward
LSPI improves a policy by estimating the Q-function using least-squares regression over a fixed dataset.
How LSPI Works
- Collect experience data (S, A, R, S')
- Represent states/actions with features
- Solve Q-function using least-squares
- Update policy greedily
Dataset → Feature Matrix
→ Least-Squares Q
→ Greedy Policy Update
- Data efficient
- Offline learning
- Handles continuous state/action spaces
- Interpretable linear models
FQI learns the Q-function by repeatedly fitting it to Bellman updates using powerful function approximators.
Q(s, a) = r + ฮณ · max Q(s', a')
FQI Process
- Initialize Q-function
- Apply Bellman update to dataset
- Fit a model (NN, tree, etc.)
- Repeat until convergence
| Aspect | LSPI | FQI |
|---|---|---|
| Main Focus | Policy improvement | Q-function approximation |
| Function Approximation | Linear features | Neural nets / trees |
| Data Size | Small to medium | Medium to large |
| Interpretability | High | Lower |
Use LSPI if:
- Limited data
- Simple features
- Need interpretability
Use FQI if:
- Complex environments
- Large datasets
- Non-linear value functions
๐ก Key Takeaways
- Both LSPI and FQI are data-efficient RL methods
- LSPI is simple, linear, and interpretable
- FQI is powerful and scales to complex problems
- Choice depends on data size and environment complexity
A Beginner’s Guide to LSTD and LSTDQ in Reinforcement Learning
Wednesday, October 23, 2024
Regret Optimality Explained in Reinforcement Learning (Simple Guide)
๐ฏ Regret & Regret Optimality in Reinforcement Learning
In reinforcement learning (RL), one of the key objectives is for an agent to learn how to maximize cumulative rewards while interacting with an environment. However, achieving this is not always straightforward. This is where the concept of regret comes into play.
Regret measures how much reward an agent could have earned if it had followed the optimal policy from the very beginning.
It represents the opportunity cost of learning — the gap between ideal performance and actual performance.
$ Optimal policy reward: 1000
$ Agent collected reward: 850
$ Regret = 1000 - 850
$ Regret = 150
The regret after T time steps is defined as:
R(T) = T · V(s₀) − ฮฃ V(ฯ, s₀, t)
- T: Total time steps
- V(s₀): Optimal value from initial state
- ฯ: Agent’s learned policy
Exploration allows the agent to discover new actions, while exploitation focuses on known high-reward actions.
Regret reflects the cost of exploration — early mistakes increase regret, but learning reduces it over time.
A regret bound provides an upper limit on how much regret an algorithm accumulates.
R(T) = O(√T)
Sub-linear regret means the agent improves over time and learns efficiently.
- Faster convergence to optimal behavior
- Reduced opportunity cost during learning
- Better real-world decision-making
Applications include:
- Autonomous driving
- Recommendation systems
- Financial trading strategies
- Episodic: Regret measured across multiple episodes
- Continuing: Regret measured over long, uninterrupted interaction
Continuing tasks are often more challenging due to non-stationary environments.
Upper Confidence Bound (UCB)
Balances exploration and exploitation using confidence intervals.
Thompson Sampling
Uses probabilistic belief sampling to select actions.
Q-Learning with Exploration
Combines value learning with strategies like ฮต-greedy.
๐ก Key Takeaways
- Regret measures lost reward due to learning
- Low regret = efficient learning
- Sub-linear regret indicates improvement over time
- Regret optimality is critical for real-world RL systems
What Is Asymptotic Correctness? A Simple Guide for RL Beginners
Monday, October 21, 2024
Why Exploration Matters in Reinforcement Learning: Beyond Stored Knowledge
Sunday, October 20, 2024
Teaching an AI to Play Tic-Tac-Toe Using Reinforcement Learning
Reinforcement Learning for Tic-Tac-Toe Using Q-Learning
Saturday, October 19, 2024
Reinforcement Learning Basics: Snack Selection with Q-Learning
๐ฅค The Vending Machine That Learned Your Favorite Snack (A Story About AI)
Imagine this…
You walk into a quiet room. In the corner stands a vending machine glowing softly. It offers three choices:
- ๐ Chips
- ๐ซ Candy
- ๐ฅค Soda
But here’s the twist… this isn’t a normal vending machine.
It doesn’t know which snack is best—but it wants to figure it out.
๐ Table of Contents
- The Story Begins
- Exploration vs Exploitation
- The Math Made Simple
- 10-Round Simulation
- Code Example
- CLI Output
- What the Machine Learns
- Key Takeaways
- Related Articles
๐ The Story Begins
On Day 1, the vending machine has no idea what tastes good.
So it assigns random scores:
| Snack | Estimated Quality |
|---|---|
| Chips | 0.5 |
| Candy | 0.3 |
| Soda | 0.7 |
But these are just guesses.
The real journey begins when people start using it.
⚖️ Exploration vs Exploitation
Every time someone presses a button, the machine must decide:
- Explore (30%) → Try something random
- Exploit (70%) → Pick the best-known snack
If it only explores, it never settles.
If it only exploits, it might miss something better.
๐ The Math (Super Simple)
1. Updating the Quality Score
\[ Q_{new} = Q_{old} + \frac{1}{N}(Reward - Q_{old}) \]
What does this mean?
- Q = estimated quality
- Reward = how good the snack actually was
- N = number of times tried
๐ฎ 10 Rounds of Learning
Click to Expand Full Simulation
Round 1: Explore → Candy → Reward: 0.6 Round 2: Exploit → Soda → Reward: 0.8 Round 3: Explore → Chips → Reward: 0.7 Round 4: Exploit → Soda → Reward: 0.9 Round 5: Exploit → Soda → Reward: 0.85 Round 6: Explore → Candy → Reward: 0.65 Round 7: Exploit → Soda → Reward: 0.88 Round 8: Exploit → Soda → Reward: 0.9 Round 9: Explore → Chips → Reward: 0.75 Round 10: Exploit → Soda → Reward: 0.92
๐ป Code Example
import random
snacks = ["Chips", "Candy", "Soda"]
values = [0.5, 0.3, 0.7]
counts = [1, 1, 1]
epsilon = 0.3
for i in range(10):
if random.random() < epsilon:
choice = random.randint(0,2)
else:
choice = values.index(max(values))
```
reward = random.uniform(0.6, 1.0)
counts[choice] += 1
values[choice] += (reward - values[choice]) / counts[choice]
```
print(values)
๐ฅ️ CLI Output
Click to View Output
Final Estimated Values: Chips: 0.72 Candy: 0.63 Soda: 0.88 Best Snack: Soda ๐ฅค
๐ง What the Machine Learned
After 10 rounds, something interesting happens…
- Soda consistently gave higher rewards
- The machine started choosing Soda more often
- Exploration helped confirm Soda was truly best
๐ก Key Takeaways
- Exploration helps discover new possibilities
- Exploitation maximizes current knowledge
- Balance is crucial
- This is the foundation of Q-learning
๐ฏ Final Scene
Next time you grab a snack, imagine this vending machine quietly learning… adapting… improving.
Because in the world of AI, even choosing a snack can become a smart decision.
Thursday, October 17, 2024
Turn-Based Game Simulation Using Q-Learning for AI Decision Making
๐ฎ Learning Q-Learning Through a Game
Let’s move away from formulas for a moment and think in terms of a game.
Two numbers exist: A = 12 and B = 51. Two players take turns — a human and an AI.
On each turn, a player chooses a number k and applies a move:
new_value = old_value - k × other_value
The objective is simple: force either A or B to become zero.
But beneath this simple rule lies a powerful idea — this game is a playground for reinforcement learning.
๐ Table of Contents
- Game Intuition
- How the Game Progresses
- How the AI Learns
- Understanding the Q-Table
- Code Example
- Game Output
- Key Takeaways
๐ง Game Intuition: More Than Just Numbers
At first glance, this looks like a mathematical game. But in reality, it is a decision-making problem under uncertainty.
Every move changes the state of the system. Every decision affects future possibilities.
The AI does not know the best move at the beginning. It learns through experience — by playing, failing, and improving.
๐ Think Deeper
This is exactly how humans learn strategy games. We don’t start with perfect knowledge — we experiment, observe outcomes, and adjust.
๐ How the Game Actually Works
The game unfolds in rounds. Each round begins with the same initial values of A and B.
Players take turns. On each turn:
The player chooses:
1. A value of k 2. Whether to reduce A or B
Then the formula is applied, changing the state.
The moment either value becomes zero, the game ends.
What makes this interesting is that every move is not just a step — it is a strategic decision that shapes the entire future of the game.
๐ค How the AI Learns Over Time
The AI does not start intelligent. Initially, it behaves almost randomly.
Sometimes it explores — trying random values of k. Sometimes it exploits — using what it has learned so far.
This balance between exploration and exploitation is the core of Q-learning.
Over time, the AI begins to notice patterns:
“Certain moves lead to winning more often.” “Certain states are dangerous.”
And slowly, it becomes strategic.
๐ Why Exploration Matters
If the AI only used known strategies, it would never discover better ones. Exploration allows it to improve beyond its current knowledge.
๐ Understanding the Q-Table (The AI's Memory)
The Q-table is where the AI stores its experience.
Each entry answers a question:
"If I am in this state, and I take this action, how good is it?"
The state is defined by the current values of A and B. The action is the chosen k and the variable being reduced.
After every move, the AI updates this table.
If a move leads to winning, it becomes more valuable. If it leads to losing, its value decreases.
Over many games, this table transforms from random guesses into a decision guide.
๐ป Code Example
import random
A, B = 12, 51
exploration_prob = 0.3
def choose_action(state, q_table):
if random.random() < exploration_prob:
return random.randint(1, 5)
return max(q_table.get(state, {1:0}), key=q_table.get(state, {1:0}).get)
This snippet shows how the AI decides between exploring and exploiting.
๐ฅ️ Sample Game Output
Game Start: A=12, B=51 AI chooses k=2 → Reduces B → New B=27 Human chooses k=1 → Reduces A → New A= -15 Game Ends Winner: AI
Each move updates the state — and the AI learns from the result.
๐ก Key Takeaways
This simple game reveals a powerful truth:
Learning is not about knowing the answer — it is about improving decisions over time.
Q-learning allows machines to:
Understand consequences Adapt strategies Improve through experience
And most importantly, learn without being explicitly told what is correct.
๐ Related Articles
- How Thresholds Shape Decisions
- Hierarchy in Reinforcement Learning
- NLP with Reinforcement Learning
- Decision Making Strategies
- Pruning Decision Trees
๐ Final Thought
What looks like a small game is actually a model of intelligence.
The AI is not just playing — it is learning how to think.
Wednesday, October 16, 2024
Q-Learning Implementation for Rock, Paper, Scissors with Custom Rewards and Strategy Analysis
Implementing Q-Learning for Rock Paper Scissors
This article explains how to train a Reinforcement Learning agent using Q-learning to play the classic game Rock Paper Scissors.
Instead of manually programming strategies, the agent learns through trial and error by observing rewards from its actions.
๐ Table of Contents
- Introduction to Reinforcement Learning
- Game Mechanics
- Reward Matrix Design
- Understanding Q-Learning
- Python Implementation
- Training the Agent
- CLI Training Output
- Understanding the Q-Table
- Interactive Demo
- Key Insights
- Related Articles
Introduction to Reinforcement Learning
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns by interacting with an environment and receiving rewards or penalties.
Instead of learning from labeled datasets, the agent learns through experience.
- Agent takes an action
- Environment returns a reward
- Agent updates its knowledge
Why Reinforcement Learning Matters
Reinforcement Learning powers many modern technologies such as:
- Game-playing AI systems
- Autonomous robotics
- Recommendation engines
- Financial trading algorithms
Game Mechanics
The Rock Paper Scissors game contains three actions:
- Rock
- Paper
- Scissors
Each action has a deterministic outcome against another action.
| Action | Beats |
|---|---|
| Rock | Scissors |
| Paper | Rock |
| Scissors | Paper |
Reward Matrix Design
To train a reinforcement learning agent, we convert game outcomes into numerical rewards.
| Outcome | Reward |
|---|---|
| Win | +1 |
| Loss | -1 |
| Tie | 0 |
These rewards guide the learning algorithm toward optimal strategies.
Understanding Q-Learning
Q-learning is a reinforcement learning algorithm that learns the value of taking an action in a specific state.
The algorithm maintains a table called the Q-table.
The Q-table stores expected rewards for each state-action pair.
Q-Learning Formula
Q(s,a) = Q(s,a) + ฮฑ [R + ฮณ max(Q(s',a')) - Q(s,a)]
- s = current state
- a = action
- ฮฑ = learning rate
- ฮณ = discount factor
- R = reward
Intuition Behind Q-Learning
The algorithm updates knowledge using:
- Immediate reward
- Best possible future reward
Over many iterations the values converge toward optimal behavior.
Python Implementation
Initialize Q-table
import numpy as np import random actions = ["Rock","Paper","Scissors"] Q = np.zeros((3,3)) alpha = 0.1 gamma = 0.9 epsilon = 0.1 reward_matrix = [ [0,-1,1], [1,0,-1], [-1,1,0] ]
The Q-table starts with zeros, meaning the agent initially has no knowledge.
Training the Agent
for episode in range(10000):
state = random.randint(0,2)
if random.random() < epsilon:
action = random.randint(0,2)
else:
action = np.argmax(Q[state])
opponent = random.randint(0,2)
reward = reward_matrix[action][opponent]
Q[state][action] = Q[state][action] + 0.1 * (
reward + 0.9 * np.max(Q[action]) - Q[state][action]
)
During training the agent sometimes explores random actions to discover better strategies.
CLI Output Example
$ python rps_qlearning.py Training started... Episode 1000 complete Episode 5000 complete Episode 10000 complete Final Q Table: [[ 0.12 0.88 -0.44] [-0.32 0.21 0.92] [0.71 -0.51 0.08]] Optimal Strategy Learned: Rock -> Paper Paper -> Scissors Scissors -> Rock
Understanding the Q-Table
The Q-table stores expected rewards for each action.
| State | Rock | Paper | Scissors |
|---|---|---|---|
| Rock | 0.12 | 0.88 | -0.44 |
| Paper | -0.32 | 0.21 | 0.92 |
| Scissors | 0.71 | -0.51 | 0.08 |
Interactive Demo
Play against a simple agent:
๐ก Key Insights
- Reinforcement Learning learns through rewards
- Q-learning uses a table of expected action rewards
- Exploration allows discovery of better strategies
- Rock Paper Scissors demonstrates RL concepts clearly
- Q-tables help interpret the learning process
Related Articles
- Natural Language Generation with Reinforcement Learning
- Scalar Rewards in Reinforcement Learning
- Hammer Lifestyle Shark Tank Case Study
Author: Subham
Featured Post
How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing
The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...
Popular Posts
-
EIGRP Stub Routing In complex network environments, maintaining stability and efficienc...
-
Modern NTP Practices – Interactive Guide Modern NTP Practices – Interactive Guide Network Time Protocol (NTP)...
-
DeepID-Net and Def-Pooling Layer Explained | Interactive Guide DeepID-Net and Def-Pooling Layer Explaine...
-
GET VPN COOP Explained Simply: Key Server Redundancy Made Easy GET VPN COOP Explained (Simple + Practica...
-
Modern Cisco ASA Troubleshooting (Post-9.7) Modern Cisco ASA Troubleshooting (Post-9.7) With evolving netwo...
-
When Machine Learning Looks Right but Goes Wrong When Machine Learning Looks Right but Goes Wrong Picture a f...
-
Latent Space & Vector Arithmetic Explained | AI Image Transformations Latent Space & Vector Arit...
-
Process Synchronization – Interactive OS Guide Process Synchronization – Interactive Operating Systems Guide In an operati...
-
Event2Mind – Teaching Machines Human Intent and Emotion Event2Mind: Teaching Machines to Understand Human Intent...
-
Linear Regression vs Classification – Interactive Guide Linear Regression vs Classification – Interactive Theory Guide Line...