Tuesday, August 20, 2024

Categorizing and Sampling Titanic Dataset by Age Group

Passenger Data Categorization & Sampling in Python – Complete Guide

๐Ÿšข Passenger Categorization & Sampling in Python

๐Ÿ“‘ Table of Contents


๐Ÿ“˜ Introduction

Data analysis often begins with structuring raw data into meaningful categories. In this guide, we explore how to categorize passengers based on age and class, extract meaningful samples, and combine them for analysis using Python.

๐Ÿ’ก Goal: Transform raw passenger data into structured insights using categorization and sampling.

๐Ÿง  Step 1: Categorizing Passengers

We create a new column called age_group using conditional logic.

๐ŸŽฏ Rules

  • Age < 18 and Class = 1 → Young Rich
  • Age > 50 → Senior
  • Age between 18–50 → Middle-aged
  • Others → Other
๐Ÿ“– Why Categorization Matters

Categorization helps simplify complex datasets by grouping similar records. This improves analysis, visualization, and decision-making.


๐Ÿ“Š Step 2: Extract Samples

Sampling allows us to examine a smaller subset of the dataset without processing everything.

  • Filter dataset by each age group
  • Select first 3 entries using .head(3)
๐Ÿ’ก Insight: Sampling reduces complexity while preserving representative data.

๐Ÿ”— Step 3: Concatenate Data

We combine all sampled groups into a single DataFrame using:

pd.concat([group1, group2, group3])

This allows easy comparison across categories.


๐Ÿ’ป Complete Code Example

import pandas as pd
import numpy as np

# Sample dataset
df['age_group'] = np.where(
    (df['Age'] < 18) & (df['Pclass'] == 1), 'Young Rich',
    np.where(df['Age'] > 50, 'Senior',
    np.where((df['Age'] >= 18) & (df['Age'] <= 50), 'Middle-aged', 'Other'))
)

# Sampling
young_rich = df[df['age_group'] == 'Young Rich'].head(3)
senior = df[df['age_group'] == 'Senior'].head(3)
middle = df[df['age_group'] == 'Middle-aged'].head(3)
other = df[df['age_group'] == 'Other'].head(3)

# Concatenation
final_sample = pd.concat([young_rich, senior, middle, other])

print(final_sample)

๐Ÿ–ฅ CLI Output Sample

   Name      Age  Pclass  age_group
1  John      15      1    Young Rich
2  Alice     60      2    Senior
3  Mark      35      3    Middle-aged
๐Ÿ“‚ Expand CLI Explanation

The output displays categorized passengers along with their selected attributes. Each row represents a sampled passenger from different categories.


๐Ÿ“ Mathematical Logic Behind Conditions

The categorization logic can be expressed mathematically:

YoungRich = (Age < 18) ∧ (Class = 1)
Senior = (Age > 50)
Middle = (18 ≤ Age ≤ 50)
๐Ÿ“– Expand Mathematical Explanation

Logical operators such as AND (∧) and inequalities define how conditions are applied. This ensures each passenger falls into exactly one category.


๐Ÿงฎ Mathematical Explanation of Categorization Logic

The passenger categorization can be formally described using mathematical logic and piecewise functions. Each passenger is assigned to exactly one category based on their age (A) and class (C).

๐Ÿ“Œ Piecewise Function Representation

f(A, C) =
{
  "Young Rich"     if (A < 18) ∧ (C = 1)
  "Senior"         if (A > 50)
  "Middle-aged"    if (18 ≤ A ≤ 50)
  "Other"          otherwise
}

๐Ÿ“Š Logical Breakdown

  • A < 18 → Young passengers
  • C = 1 → First-class passengers
  • A > 50 → Senior passengers
  • 18 ≤ A ≤ 50 → Middle-aged group
๐Ÿ“– Expand Deep Explanation

This logic follows a hierarchical evaluation similar to nested conditional statements. The first condition has higher priority, meaning if a passenger satisfies (A < 18 AND C = 1), they are immediately classified as "Young Rich".

Mathematically, this ensures:

  • Mutual Exclusivity → No passenger belongs to more than one group
  • Collective Exhaustiveness → Every passenger is categorized

In Boolean algebra terms:

YoungRich = (A < 18) ∧ (C = 1)
Senior = (A > 50)
Middle = (A ≥ 18) ∧ (A ≤ 50)
Other = NOT (YoungRich ∨ Senior ∨ Middle)

This guarantees a complete partitioning of the dataset.

๐Ÿ’ก Key Insight: This is a classic example of a piecewise function combined with Boolean logic used in data classification.

๐ŸŽฏ Key Takeaways

  • Use np.where for conditional categorization
  • Sampling helps simplify analysis
  • pd.concat merges datasets efficiently
  • Structured data improves insights

๐Ÿ“Œ Final Thoughts

This workflow demonstrates how simple transformations can make datasets far more useful. By combining categorization, sampling, and merging techniques, you gain better control over your data analysis process.

No comments:

Post a Comment

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts