๐ฅค The Vending Machine That Learned Your Favorite Snack (A Story About AI)
Imagine this…
You walk into a quiet room. In the corner stands a vending machine glowing softly. It offers three choices:
- ๐ Chips
- ๐ซ Candy
- ๐ฅค Soda
But here’s the twist… this isn’t a normal vending machine.
It doesn’t know which snack is best—but it wants to figure it out.
๐ Table of Contents
- The Story Begins
- Exploration vs Exploitation
- The Math Made Simple
- 10-Round Simulation
- Code Example
- CLI Output
- What the Machine Learns
- Key Takeaways
- Related Articles
๐ The Story Begins
On Day 1, the vending machine has no idea what tastes good.
So it assigns random scores:
| Snack | Estimated Quality |
|---|---|
| Chips | 0.5 |
| Candy | 0.3 |
| Soda | 0.7 |
But these are just guesses.
The real journey begins when people start using it.
⚖️ Exploration vs Exploitation
Every time someone presses a button, the machine must decide:
- Explore (30%) → Try something random
- Exploit (70%) → Pick the best-known snack
If it only explores, it never settles.
If it only exploits, it might miss something better.
๐ The Math (Super Simple)
1. Updating the Quality Score
\[ Q_{new} = Q_{old} + \frac{1}{N}(Reward - Q_{old}) \]
What does this mean?
- Q = estimated quality
- Reward = how good the snack actually was
- N = number of times tried
๐ฎ 10 Rounds of Learning
Click to Expand Full Simulation
Round 1: Explore → Candy → Reward: 0.6 Round 2: Exploit → Soda → Reward: 0.8 Round 3: Explore → Chips → Reward: 0.7 Round 4: Exploit → Soda → Reward: 0.9 Round 5: Exploit → Soda → Reward: 0.85 Round 6: Explore → Candy → Reward: 0.65 Round 7: Exploit → Soda → Reward: 0.88 Round 8: Exploit → Soda → Reward: 0.9 Round 9: Explore → Chips → Reward: 0.75 Round 10: Exploit → Soda → Reward: 0.92
๐ป Code Example
import random
snacks = ["Chips", "Candy", "Soda"]
values = [0.5, 0.3, 0.7]
counts = [1, 1, 1]
epsilon = 0.3
for i in range(10):
if random.random() < epsilon:
choice = random.randint(0,2)
else:
choice = values.index(max(values))
```
reward = random.uniform(0.6, 1.0)
counts[choice] += 1
values[choice] += (reward - values[choice]) / counts[choice]
```
print(values)
๐ฅ️ CLI Output
Click to View Output
Final Estimated Values: Chips: 0.72 Candy: 0.63 Soda: 0.88 Best Snack: Soda ๐ฅค
๐ง What the Machine Learned
After 10 rounds, something interesting happens…
- Soda consistently gave higher rewards
- The machine started choosing Soda more often
- Exploration helped confirm Soda was truly best
๐ก Key Takeaways
- Exploration helps discover new possibilities
- Exploitation maximizes current knowledge
- Balance is crucial
- This is the foundation of Q-learning
๐ฏ Final Scene
Next time you grab a snack, imagine this vending machine quietly learning… adapting… improving.
Because in the world of AI, even choosing a snack can become a smart decision.