K-Means Clustering on the Breast Cancer Dataset
Clustering analysis with object-oriented design principles
This example demonstrates how the Breast Cancer dataset can be analyzed using K-means clustering to uncover patterns in tumor data. The solution also applies object-oriented programming concepts such as inheritance and composition to organize the code cleanly.
Problem Overview
The goal is to group breast cancer cell samples into two clusters (benign and malignant) using unsupervised learning.
- Load and preprocess the dataset
- Apply K-means clustering
- Visualize clustering results
- Use OOP for modular design
Dataset Description
๐งฌ Breast Cancer Dataset (sklearn)
The dataset contains 30 numerical features describing cell nucleus characteristics, such as:
- Mean radius
- Texture
- Smoothness
- Compactness
Each sample is labeled as either benign or malignant.
Code Walkthrough
1️⃣ Import Required Libraries
import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import load_breast_cancer from classInheritanceBreastCancer import Cell from classCompositionBreastCancer import DataProcessor
This setup separates responsibilities:
- Cell: inheritance-based representation of cell data
- DataProcessor: composition-based preprocessing handler
2️⃣ Load the Dataset
breast_cancer = load_breast_cancer()
This loads the feature matrix and labels from sklearn.
3️⃣ Preprocess the Data (Scaling)
processor = DataProcessor(breast_cancer) X_scaled, y = processor.preprocess_data()
Scaling is critical because K-means is sensitive to feature magnitude. Standardization ensures equal contribution from each feature.
4️⃣ Apply K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=42) cluster_labels = kmeans.fit_predict(X_scaled)
The algorithm assigns each data point to one of two clusters by minimizing within-cluster variance.
5️⃣ Visualize the Clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1],
c=cluster_labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-means Clustering of Cell Features')
plt.colorbar(label='Cluster')
plt.show()
Only the first two features are used for visualization, even though clustering uses all features.
Explanation of the Solution
๐ Data Preprocessing
Feature scaling ensures that no single feature dominates distance calculations in K-means clustering.
๐ง K-Means Clustering
K-means partitions data into clusters by minimizing the distance between data points and their respective cluster centroids.
๐ Visualization
The scatter plot provides intuitive insight into how the algorithm groups samples based on similarity.
Plot Interpretation
- Each point represents a breast cancer cell sample
- Colors indicate cluster assignment
- Clusters approximate benign vs malignant separation
- Perfect separation is not guaranteed in 2D views
๐ก Key Takeaways
- K-means is useful for exploratory pattern discovery
- Scaling is essential for distance-based algorithms
- OOP improves modularity and readability
- Visualization helps interpret clustering quality
- Unsupervised results may not perfectly match labels
No comments:
Post a Comment