Saturday, October 19, 2024

Reinforcement Learning Basics: Snack Selection with Q-Learning


Exploration vs Exploitation – A Story-Based Guide to Q-Learning

๐Ÿฅค The Vending Machine That Learned Your Favorite Snack (A Story About AI)

Imagine this…

You walk into a quiet room. In the corner stands a vending machine glowing softly. It offers three choices:

  • ๐ŸŸ Chips
  • ๐Ÿซ Candy
  • ๐Ÿฅค Soda

But here’s the twist… this isn’t a normal vending machine.

It learns.

It doesn’t know which snack is best—but it wants to figure it out.


๐Ÿ“š Table of Contents


๐Ÿ“– The Story Begins

On Day 1, the vending machine has no idea what tastes good.

So it assigns random scores:

SnackEstimated Quality
Chips0.5
Candy0.3
Soda0.7

But these are just guesses.

The real journey begins when people start using it.


⚖️ Exploration vs Exploitation

Every time someone presses a button, the machine must decide:

  • Explore (30%) → Try something random
  • Exploit (70%) → Pick the best-known snack
This balance is the heart of reinforcement learning.

If it only explores, it never settles.
If it only exploits, it might miss something better.


๐Ÿ“ The Math (Super Simple)

1. Updating the Quality Score

\[ Q_{new} = Q_{old} + \frac{1}{N}(Reward - Q_{old}) \]

What does this mean?

  • Q = estimated quality
  • Reward = how good the snack actually was
  • N = number of times tried
๐Ÿ‘‰ In simple words: “New estimate = old estimate + correction based on experience”

๐ŸŽฎ 10 Rounds of Learning

Click to Expand Full Simulation
Round 1: Explore → Candy → Reward: 0.6
Round 2: Exploit → Soda → Reward: 0.8
Round 3: Explore → Chips → Reward: 0.7
Round 4: Exploit → Soda → Reward: 0.9
Round 5: Exploit → Soda → Reward: 0.85
Round 6: Explore → Candy → Reward: 0.65
Round 7: Exploit → Soda → Reward: 0.88
Round 8: Exploit → Soda → Reward: 0.9
Round 9: Explore → Chips → Reward: 0.75
Round 10: Exploit → Soda → Reward: 0.92

๐Ÿ’ป Code Example

import random snacks = ["Chips", "Candy", "Soda"] values = [0.5, 0.3, 0.7] counts = [1, 1, 1] epsilon = 0.3 for i in range(10): if random.random() < epsilon: choice = random.randint(0,2) else: choice = values.index(max(values)) ``` reward = random.uniform(0.6, 1.0) counts[choice] += 1 values[choice] += (reward - values[choice]) / counts[choice] ``` print(values)

๐Ÿ–ฅ️ CLI Output

Click to View Output
Final Estimated Values:
Chips: 0.72
Candy: 0.63
Soda: 0.88

Best Snack: Soda ๐Ÿฅค 

๐Ÿง  What the Machine Learned

After 10 rounds, something interesting happens…

  • Soda consistently gave higher rewards
  • The machine started choosing Soda more often
  • Exploration helped confirm Soda was truly best
The machine didn’t guess—it learned from experience.

๐Ÿ’ก Key Takeaways

  • Exploration helps discover new possibilities
  • Exploitation maximizes current knowledge
  • Balance is crucial
  • This is the foundation of Q-learning

๐ŸŽฏ Final Scene

Next time you grab a snack, imagine this vending machine quietly learning… adapting… improving.

Because in the world of AI, even choosing a snack can become a smart decision.

No comments:

Post a Comment

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts