Showing posts with label Grouped Summary. Show all posts
Showing posts with label Grouped Summary. Show all posts

Wednesday, August 21, 2024

Using groupby and describe in Pandas for Effective Data Analysis

### Scenario

Imagine you have a small dataset of students' test scores from three different classes: A, B, and C. You want to analyze how the test scores vary across these classes.

### Data


import pandas as pd

# Sample DataFrame
data = {
    'Class': ['A', 'A', 'B', 'B', 'C', 'C', 'C'],
    'Score': [85, 78, 90, 88, 92, 85, 80]
}

df = pd.DataFrame(data)


This is what the data looks like:

| Class | Score |
|-------|-------|
| A | 85 |
| A | 78 |
| B | 90 |
| B | 88 |
| C | 92 |
| C | 85 |
| C | 80 |

### Using `groupby` and `describe`

Now, you want to quickly summarize the test scores by class. Instead of calculating each statistic (like the mean or standard deviation) manually, you can use `groupby` and `describe` together:


# Group by 'Class' and describe the 'Score' data
grouped_summary = df.groupby('Class')['Score'].describe()

print(grouped_summary)


### Output

       count mean std min 25% 50% 75% max
Class                                                     
A 2.0 81.5 4.949747 78.0 79.75 81.5 83.25 85.0
B 2.0 89.0 1.414214 88.0 88.50 89.0 89.50 90.0
C 3.0 85.7 6.110101 80.0 82.50 85.0 88.50 92.0


### Why This is Effective

1. **Instant Summary**: In one step, you get a full statistical summary for each class, including the count of students, mean score, standard deviation, and the range (min to max).

2. **Comparison Across Groups**: You can immediately compare the average score, variability (standard deviation), and score distribution (min, 25%, 50%, 75%, max) across different classes. For example:
   - **Class A** has an average score of 81.5 with some variability (std = 4.95).
   - **Class B** has a higher average (89) with very low variability (std = 1.41).
   - **Class C** has the highest average (85.7) but also the widest range of scores.

3. **Identify Insights Quickly**: You might notice that Class B is more consistent in scoring, while Class C has a wider spread. This kind of insight helps you understand the differences between groups at a glance, which would take much longer if you calculated each statistic separately.

In summary, the `groupby` and `describe` combination is effective because it allows you to efficiently summarize and compare grouped data, making it easier to identify trends, patterns, or outliers within your dataset.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts