๐ผ Pandas Data Cleaning: where vs Boolean Indexing
Data cleaning is one of the most important steps in any data science workflow. In this guide, we’ll explore two powerful techniques in Pandas:
where()- Boolean Indexing
You’ll learn when to use each, with real-world examples, interactive sections, and best practices.
๐ Table of Contents
- Using where()
- Boolean Indexing
- Real-World Scenario
- Comparison Table
- CLI Outputs
- Key Takeaways
- Related Articles
๐น 1. Using where()
Code Example
import pandas as pd
import numpy as np
data = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
result = data.where(data > 2, np.nan)
print(result)
Explanation
where() keeps values where the condition is true and replaces others.
CLI Output
View Output
A
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
✅ Pros
- Clean and readable
- Single-step replacement
- Great for chaining
❌ Cons
- Less flexible for complex logic
๐น 2. Boolean Indexing
Code Example
import pandas as pd
import numpy as np
data = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
data[data <= 2] = np.nan
print(data)
Explanation
This method directly modifies values based on a condition.
CLI Output
View Output
A
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
✅ Pros
- Very flexible
- Precise control
❌ Cons
- Can overwrite data accidentally
- Slightly more verbose
๐ก️ Real-World Scenario: Sensor Data Cleaning
Imagine a temperature sensor. Values below 0°C are invalid.
Dataset
import pandas as pd
import numpy as np
data = pd.DataFrame({
'Temperature': [22.5, -300.0, 23.7, 25.1, -400.0, 26.8]
})
print(data)
CLI Output
View Dataset
Temperature 0 22.5 1 -300.0 2 23.7 3 25.1 4 -400.0 5 26.8
✔ Using where()
result = data.where(data['Temperature'] >= 0, np.nan)
print(result)
Output
Temperature 0 22.5 1 NaN 2 23.7 3 25.1 4 NaN 5 26.8
✔ Using Boolean Indexing
data.loc[data['Temperature'] < 0, 'Temperature'] = np.nan
print(data)
Output
Temperature 0 22.5 1 NaN 2 23.7 3 25.1 4 NaN 5 26.8
✔ Using np.where()
data['Temperature'] = np.where(
data['Temperature'] >= 0,
data['Temperature'],
np.nan
)
๐ Comparison Table
| Method | Readability | Flexibility | Safety |
|---|---|---|---|
| where() | High | Medium | High |
| Boolean Indexing | Medium | High | Medium |
| np.where() | Low | Medium | Medium |
๐ฅ️ Interactive Practice
Try modifying conditions like:
- Replace values greater than 25
- Replace values between 10 and 20
- Apply multiple conditions
Example Challenge
Replace temperatures > 25 with NaN
๐ก Key Takeaways
where()is best for clean, readable replacements- Boolean indexing is more powerful but riskier
np.where()is useful but less intuitive- Always prioritize readability in data cleaning
๐ฏ Final Thoughts
Choosing the right method depends on your goal. For most cleaning tasks, where() provides the best balance between simplicity and safety.
Master these techniques, and your data preprocessing skills will improve significantly.