This blog explores data science and networking, combining theoretical concepts with practical implementations. Topics include routing protocols, network operations, and data-driven problem solving, presented with clarity and reproducibility in mind.
Sunday, December 8, 2024
How to Evaluate and Ensure Your Data Has No Outliers
Thursday, August 22, 2024
Limitations of Plotly for Outlier Detection in Data Analysis
๐ Understanding Outliers: Visualization vs Mathematical Analysis
Outliers are values in a dataset that lie far from most other data points. Detecting outliers is essential because they can skew statistics, mislead machine learning models, and affect decision-making.
๐ Table of Contents
- Plotly's Role
- Mathematical Foundations of Outliers
- IQR Method Explained
- Z-Score Method Explained
- Python Example with Pandas & NumPy
- Key Takeaways
- Related Articles
1️⃣ Plotly's Role
Plotly is excellent for visually identifying potential outliers in datasets through scatter plots, box plots, or violin plots.
However, Plotly does not provide the statistical rigor
While a point may look unusual on a plot, its statistical significance depends on its position relative to the dataset's distribution.
Plotly cannot calculate Z-scores, IQR thresholds, or other numerical criteria that define outliers mathematically.
๐ Explanation
2️⃣ Mathematical Foundations of Outliers
To rigorously identify outliers, we rely on statistics:
Mean & Standard Deviation
For a dataset X = {x₁, x₂, ..., xโ}, the mean ฮผ is:
ฮผ = (1/n) * ฮฃ(xแตข)
The standard deviation ฯ is:
ฯ = sqrt((1/n) * ฮฃ(xแตข - ฮผ)²)
Points that are far from ฮผ (typically more than 2 or 3 ฯ) can be considered outliers.
Z-Score
The Z-score of a point xแตข measures how many standard deviations it is from the mean:
Zแตข = (xแตข - ฮผ) / ฯ
Common rule: |Z| > 3 → potential outlier.
Interquartile Range (IQR)
The IQR focuses on the middle 50% of the data:
- Q1 = 25th percentile
- Q3 = 75th percentile
- IQR = Q3 − Q1
x < Q1 - 1.5 * IQR
x > Q3 + 1.5 * IQR
3️⃣ IQR Method: Step-by-Step
1. Sort the data.
2. Calculate Q1 (25th percentile) and Q3 (75th percentile).
3. Compute IQR = Q3 − Q1.
4. Any value less than Q1 − 1.5×IQR or greater than Q3 + 1.5×IQR is an outlier.
4️⃣ Z-Score Method: Step-by-Step
1. Compute the mean (ฮผ) and standard deviation (ฯ) of the dataset.
2. For each value, compute Z = (x - ฮผ) / ฯ.
3. Values with |Z| > 3 (or another threshold) are considered outliers.
๐ Why Z-Score Works
Z-score standardizes data to a common scale. A Z-score of 3 means the point is 3 standard deviations away from the mean, which is statistically rare in a normal distribution (~0.3% probability).
5️⃣ Python Example Using IQR
import pandas as pd
import numpy as np
# Sample dataset
df = pd.DataFrame({'values': [10, 12, 12, 13, 12, 100, 11, 13, 12, 14]})
# Compute Q1, Q3, and IQR
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1
# Identify outliers
outliers = df[(df['values'] < Q1 - 1.5 * IQR) | (df['values'] > Q3 + 1.5 * IQR)]
print(outliers)
This outputs all values outside the IQR-based thresholds, providing a mathematically sound identification of outliers.
๐ก Key Takeaways
- Plotly is excellent for visualization but cannot replace statistical rigor.
- Use Z-scores or IQR to identify outliers mathematically.
- Outliers should always be interpreted in context — not all extreme points are errors.
- Visualization + statistical analysis together provide the clearest understanding.
๐ Related Articles
Saturday, August 3, 2024
Impact of Removing Outliers on Median: Practical Examples and Potential Pitfall
Featured Post
How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing
The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...
Popular Posts
-
EIGRP Stub Routing In complex network environments, maintaining stability and efficienc...
-
Modern NTP Practices – Interactive Guide Modern NTP Practices – Interactive Guide Network Time Protocol (NTP)...
-
DeepID-Net and Def-Pooling Layer Explained | Interactive Guide DeepID-Net and Def-Pooling Layer Explaine...
-
GET VPN COOP Explained Simply: Key Server Redundancy Made Easy GET VPN COOP Explained (Simple + Practica...
-
Modern Cisco ASA Troubleshooting (Post-9.7) Modern Cisco ASA Troubleshooting (Post-9.7) With evolving netwo...
-
When Machine Learning Looks Right but Goes Wrong When Machine Learning Looks Right but Goes Wrong Picture a f...
-
Latent Space & Vector Arithmetic Explained | AI Image Transformations Latent Space & Vector Arit...
-
Process Synchronization – Interactive OS Guide Process Synchronization – Interactive Operating Systems Guide In an operati...
-
Event2Mind – Teaching Machines Human Intent and Emotion Event2Mind: Teaching Machines to Understand Human Intent...
-
Linear Regression vs Classification – Interactive Guide Linear Regression vs Classification – Interactive Theory Guide Line...