Saturday, October 19, 2024

Reinforcement Learning Basics: Snack Selection with Q-Learning

Exploration vs Exploitation – A Story-Based Guide to Q-Learning

🥤 The Vending Machine That Learned Your Favorite Snack (A Story About AI)

Imagine this…

You walk into a quiet room. In the corner stands a vending machine glowing softly. It offers three choices:

🍟 Chips
🍫 Candy
🥤 Soda

But here’s the twist… this isn’t a normal vending machine.

It learns.

It doesn’t know which snack is best—but it wants to figure it out.

📖 The Story Begins

On Day 1, the vending machine has no idea what tastes good.

So it assigns random scores:

Snack	Estimated Quality
Chips	0.5
Candy	0.3
Soda	0.7

But these are just guesses.

The real journey begins when people start using it.

⚖️ Exploration vs Exploitation

Every time someone presses a button, the machine must decide:

Explore (30%) → Try something random
Exploit (70%) → Pick the best-known snack

This balance is the heart of reinforcement learning.

If it only explores, it never settles.
If it only exploits, it might miss something better.

📐 The Math (Super Simple)

1. Updating the Quality Score

\[ Q_{new} = Q_{old} + \frac{1}{N}(Reward - Q_{old}) \]

What does this mean?

Q = estimated quality
Reward = how good the snack actually was
N = number of times tried

👉 In simple words:  
“New estimate = old estimate + correction based on experience”

🎮 10 Rounds of Learning

Click to Expand Full Simulation

Round 1: Explore → Candy → Reward: 0.6
Round 2: Exploit → Soda → Reward: 0.8
Round 3: Explore → Chips → Reward: 0.7
Round 4: Exploit → Soda → Reward: 0.9
Round 5: Exploit → Soda → Reward: 0.85
Round 6: Explore → Candy → Reward: 0.65
Round 7: Exploit → Soda → Reward: 0.88
Round 8: Exploit → Soda → Reward: 0.9
Round 9: Explore → Chips → Reward: 0.75
Round 10: Exploit → Soda → Reward: 0.92

💻 Code Example


import random

snacks = ["Chips", "Candy", "Soda"]
values = [0.5, 0.3, 0.7]
counts = [1, 1, 1]

epsilon = 0.3

for i in range(10):
if random.random() < epsilon:
choice = random.randint(0,2)
else:
choice = values.index(max(values))

```
reward = random.uniform(0.6, 1.0)
counts[choice] += 1
values[choice] += (reward - values[choice]) / counts[choice]
```

print(values)

🖥️ CLI Output

Click to View Output

Final Estimated Values:
Chips: 0.72
Candy: 0.63
Soda: 0.88

Best Snack: Soda 🥤

🧠 What the Machine Learned

After 10 rounds, something interesting happens…

Soda consistently gave higher rewards
The machine started choosing Soda more often
Exploration helped confirm Soda was truly best

The machine didn’t guess—it learned from experience.

💡 Key Takeaways

Exploration helps discover new possibilities
Exploitation maximizes current knowledge
Balance is crucial
This is the foundation of Q-learning

🎯 Final Scene

Next time you grab a snack, imagine this vending machine quietly learning… adapting… improving.

Because in the world of AI, even choosing a snack can become a smart decision.

Pages

Saturday, October 19, 2024

🥤 The Vending Machine That Learned Your Favorite Snack (A Story About AI)

📚 Table of Contents

📖 The Story Begins

⚖️ Exploration vs Exploitation

📐 The Math (Super Simple)

1. Updating the Quality Score

What does this mean?

🎮 10 Rounds of Learning

💻 Code Example

🖥️ CLI Output

🧠 What the Machine Learned

💡 Key Takeaways

🎯 Final Scene

Featured Post

Popular Posts

🧠 AI Quiz

🎯 Guess Game

⚡ Speed Test

✊ Rock Paper Scissors

🔢 Quick Math

🧩 Memory Game

⌨️ Typing Speed

🟥 Color Click

🎲 Dice Game

Latest Posts

AI Category

🚀 Trending AI Projects

📊 Data Science Resources

📚 Latest Research Papers

🔥 New AI Tools

💬 Developer Discussions

Contact Form

Followers