Thursday, December 19, 2024

K-Means Clustering Analysis on Breast Cancer Dataset Using Object-Oriented Design


K-Means Clustering on Breast Cancer Dataset

K-Means Clustering on the Breast Cancer Dataset

Clustering analysis with object-oriented design principles

This example demonstrates how the Breast Cancer dataset can be analyzed using K-means clustering to uncover patterns in tumor data. The solution also applies object-oriented programming concepts such as inheritance and composition to organize the code cleanly.

Problem Overview

The goal is to group breast cancer cell samples into two clusters (benign and malignant) using unsupervised learning.

  • Load and preprocess the dataset
  • Apply K-means clustering
  • Visualize clustering results
  • Use OOP for modular design

Dataset Description

๐Ÿงฌ Breast Cancer Dataset (sklearn)

The dataset contains 30 numerical features describing cell nucleus characteristics, such as:

  • Mean radius
  • Texture
  • Smoothness
  • Compactness

Each sample is labeled as either benign or malignant.

Code Walkthrough

1️⃣ Import Required Libraries
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_breast_cancer
from classInheritanceBreastCancer import Cell
from classCompositionBreastCancer import DataProcessor

This setup separates responsibilities:

  • Cell: inheritance-based representation of cell data
  • DataProcessor: composition-based preprocessing handler
2️⃣ Load the Dataset
breast_cancer = load_breast_cancer()

This loads the feature matrix and labels from sklearn.

3️⃣ Preprocess the Data (Scaling)
processor = DataProcessor(breast_cancer)
X_scaled, y = processor.preprocess_data()

Scaling is critical because K-means is sensitive to feature magnitude. Standardization ensures equal contribution from each feature.

4️⃣ Apply K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

The algorithm assigns each data point to one of two clusters by minimizing within-cluster variance.

5️⃣ Visualize the Clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1],
            c=cluster_labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-means Clustering of Cell Features')
plt.colorbar(label='Cluster')
plt.show()

Only the first two features are used for visualization, even though clustering uses all features.

Explanation of the Solution

๐Ÿ” Data Preprocessing

Feature scaling ensures that no single feature dominates distance calculations in K-means clustering.

๐Ÿง  K-Means Clustering

K-means partitions data into clusters by minimizing the distance between data points and their respective cluster centroids.

๐Ÿ“Š Visualization

The scatter plot provides intuitive insight into how the algorithm groups samples based on similarity.

Plot Interpretation

  • Each point represents a breast cancer cell sample
  • Colors indicate cluster assignment
  • Clusters approximate benign vs malignant separation
  • Perfect separation is not guaranteed in 2D views

๐Ÿ’ก Key Takeaways

  • K-means is useful for exploratory pattern discovery
  • Scaling is essential for distance-based algorithms
  • OOP improves modularity and readability
  • Visualization helps interpret clustering quality
  • Unsupervised results may not perfectly match labels

No comments:

Post a Comment

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts