Calculating Percentiles, Interquartile Range (IQR), and Outlier Detection
Understanding percentiles, quartiles, and the Interquartile Range (IQR) is essential in statistics and data analysis. These concepts help identify how data is distributed and help detect unusual values known as outliers.
In this guide, you will learn how to calculate the IQR, detect outliers, and apply the technique to grouped datasets.
๐ Table of Contents
- Understanding Interquartile Range (IQR)
- Step-by-Step IQR Calculation
- Identifying Outliers Using IQR
- Replacing Outliers on a Groupby Basis
- Python Code Example
- CLI Output Example
- Key Takeaways
- Related Articles
Understanding Interquartile Range (IQR)
The Interquartile Range (IQR) measures the spread of the middle 50% of a dataset. It is widely used in statistics because it is resistant to extreme values.
The IQR is calculated using two quartiles:
- Q1 (First Quartile) – 25th percentile
- Q3 (Third Quartile) – 75th percentile
The formula is:
IQR = Q3 − Q1
This means the IQR represents the range where the middle half of the data lies.
Step-by-Step IQR Calculation Example
Consider the dataset:
3, 7, 8, 5, 12, 7, 9, 15
Step 1 — Sort the numbers
3, 5, 7, 7, 8, 9, 12, 15Sorting the numbers helps locate the quartiles accurately.
Step 2 — Find the Median
Since there are 8 numbers, the median is the average of the 4th and 5th values.Median = (7 + 8) / 2 Median = 7.5
Step 3 — Divide the Data
Lower Half:3, 5, 7, 7Upper Half:
8, 9, 12, 15
Step 4 — Calculate Q1
Median of lower half:Q1 = (5 + 7) / 2 Q1 = 6
Step 5 — Calculate Q3
Median of upper half:Q3 = (9 + 12) / 2 Q3 = 10.5
Step 6 — Calculate IQR
IQR = Q3 − Q1 IQR = 10.5 − 6 IQR = 4.5So the IQR = 4.5
Identifying Outliers Using IQR
Outliers are data points that are significantly different from other observations.
To detect them using IQR:
- Lower Bound = Q1 − 1.5 × IQR
- Upper Bound = Q3 + 1.5 × IQR
Example Calculation
Q1 = 6 Q3 = 10.5 IQR = 4.5 Lower Bound = 6 − 1.5 × 4.5 Lower Bound = -0.75 Upper Bound = 10.5 + 1.5 × 4.5 Upper Bound = 17.25Any value outside this range is considered an outlier.
For the dataset:
3, 5, 7, 7, 8, 9, 12, 15All values fall between **-0.75 and 17.25**, therefore: No outliers exist in this dataset.
Replacing Outliers on a Groupby Basis
When working with grouped datasets (for example grouped by weather event), outliers should be handled within each group separately.
This prevents distortion across different categories.
Example scenario:
- Event = Snow
- Median Temperature = 20
- Upper Bound = 24
- Observed Value = 28
Replace 28 → 24This technique keeps the data consistent while removing extreme variations.
Python Code Example
import pandas as pd
def replace_outliers(group, column):
Q1 = group[column].quantile(0.25)
Q3 = group[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
group[column] = group[column].clip(lower, upper)
return group
df = df.groupby("event").apply(lambda g: replace_outliers(g,"temperature"))
CLI Output Example
Example console output after applying the transformation:
$ python outlier_cleaning.py Processing dataset... Event: Snow Original value: 28 Upper bound: 24 Replaced value: 24 Processing completed successfully.
๐ก Key Takeaways
- The IQR measures the spread of the middle 50% of the data.
- Q1 is the 25th percentile and Q3 is the 75th percentile.
- Outliers are values outside 1.5 × IQR from quartiles.
- Handling outliers per group maintains dataset integrity.
- Replacing extreme values with bounds prevents skewed analysis.
๐ Related Articles
- Calculating Percentiles and Interquartile Range
- Data Analysis Tutorials
- Python Data Cleaning Techniques
By understanding IQR and outlier detection, analysts can ensure that datasets remain reliable, accurate, and ready for meaningful insights.
No comments:
Post a Comment