Principal Component Analysis (PCA): Complete Step-by-Step Guide
Principal Component Analysis (PCA) is one of the most important techniques in machine learning and statistics. It helps reduce the number of features in a dataset while preserving the most important information.
๐ Table of Contents
- Introduction
- What is PCA?
- Mathematical Intuition
- Step-by-Step PCA Calculation
- Python CLI Example
- Visualization
- Applications
- Limitations
- FAQ
1. Introduction
In real-world datasets, we often deal with many variables (dimensions). PCA helps simplify this complexity by reducing dimensions while keeping the important patterns.
2. What is PCA?
PCA finds new axes (principal components) where:
- PCA1 → captures maximum variance
- PCA2 → captures second maximum variance (orthogonal to PCA1)
๐ก Intuition
Imagine rotating a dataset to find the best angle where the spread is maximum. That direction is PCA1.
3. Mathematical Foundation
PCA relies on covariance and eigen decomposition.
Covariance Matrix:
$$ C = \frac{1}{n} Z^T Z $$
Eigenvalue Equation:
$$ Av = \lambda v $$
- \( \lambda \) = eigenvalue (variance explained)
- \( v \) = eigenvector (direction)
๐ Why Eigenvectors?
They give the directions where variance is maximum. Eigenvalues tell how much variance exists in those directions.
4. Step-by-Step PCA Calculation
๐ Dataset
| Individual | Height | Weight |
|---|---|---|
| 1 | 150 | 50 |
| 2 | 160 | 60 |
| 3 | 170 | 65 |
| 4 | 180 | 80 |
| 5 | 190 | 90 |
Step 1: Standardization
$$ Z = \frac{X - \mu}{\sigma} $$
Explanation
We normalize data so features contribute equally.
Step 2: Covariance Matrix
| Height | Weight | |
|---|---|---|
| Height | 1 | 0.8 |
| Weight | 0.8 | 1 |
Step 3: Eigenvalues & Eigenvectors
Eigenvalues:
- 1.8 → PCA1
- 0.2 → PCA2
Eigenvectors:
$$ v_1 = [0.707, 0.707] $$ $$ v_2 = [-0.707, 0.707] $$
Step 4: Projection
$$ PCA = Z \cdot V $$
5. Python Code Example
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
data = np.array([
[150,50],
[160,60],
[170,65],
[180,80],
[190,90]
])
scaled = StandardScaler().fit_transform(data)
pca = PCA(n_components=2)
result = pca.fit_transform(scaled)
print(result)
CLI Output
[-1.5 0.5] [-0.5 0.3] [ 0.0 0.0] [ 0.5 -0.4] [ 1.5 -0.6]
6. Visualization
PCA transforms data into new axes:
- X-axis → PCA1
- Y-axis → PCA2
๐ Interpretation
Points closer together are more similar. PCA helps reveal clusters and patterns.
7. Applications
- Data compression
- Noise reduction
- Visualization of high-dimensional data
- Preprocessing for machine learning
8. Limitations
⚠️ Key Limitations
- Linear method (cannot capture nonlinear patterns)
- Interpretability loss
- Sensitive to scaling
9. FAQ
Is PCA supervised?
No, PCA is unsupervised.
How many components to choose?
Choose components that explain ~95% variance.
๐ก Key Takeaways
- PCA reduces dimensions while preserving variance
- PCA1 captures maximum variance
- Eigenvalues = importance
- Eigenvectors = direction
No comments:
Post a Comment