Showing posts with label Sample Variance. Show all posts
Showing posts with label Sample Variance. Show all posts

Tuesday, August 13, 2024

Biased and Unbiased Selection in Statistics: Concepts and Calculations


In statistics, the difference between biased and unbiased selection is about how representative a sample is of the entire population.

**Biased Selection:**
Imagine you want to understand the average height of all students in a school, but you only measure the height of the basketball team. Since the basketball players are generally taller than average, your sample won’t accurately represent the heights of all students.

**Unbiased Selection:**
Now, if you randomly select students from all grades and classes to measure their heights, you’re more likely to get a sample that represents the entire student body accurately. This method reduces the chance of over-representing any particular group.

In essence, a biased selection skews results because it doesn’t accurately reflect the entire population, while an unbiased selection gives a more accurate picture by representing the population fairly.

The terms `n` and `n-1` come into play when calculating sample statistics, particularly when estimating the population variance or standard deviation from a sample.

**Sample Variance Calculation:**

- **Using `n` (Sample Size):** When calculating the variance of a sample, if you divide the sum of squared deviations from the sample mean by `n`, you get the *sample variance*. This method often underestimates the population variance because it does not account for the fact that the sample mean is an estimate itself, rather than the true population mean.

- **Using `n-1` (Degrees of Freedom):** To correct for this underestimation, we divide by `n-1` instead. This adjustment is known as "Bessel's correction." The resulting value is called the *sample variance*, which provides an unbiased estimate of the population variance.

**Example:**

Suppose you measure the heights of 4 students and get these values: 150 cm, 160 cm, 165 cm, and 170 cm.

1. Calculate the sample mean: `(150 + 160 + 165 + 170) / 4 = 161.25` cm.
2. Find the squared deviations from the mean and sum them up: `(150 - 161.25)^2 + (160 - 161.25)^2 + (165 - 161.25)^2 + (170 - 161.25)^2`.
3. The sum is `126.5625 + 1.5625 + 14.0625 + 76.5625 = 218.75`.

- **Using `n` (4):** Variance = `218.75 / 4 = 54.6875` (this tends to underestimate the true variance of the population).

- **Using `n-1` (3):** Variance = `218.75 / 3 = 72.9167` (this is an unbiased estimate of the population variance).

So, using `n-1` corrects for the bias in the sample variance estimation.

Monday, August 5, 2024

Calculating Sample Variance: Using ๐‘›vs. ๐‘›−1

Calculating Sample Variance & Variance Inflation Factor (VIF) Explained

Understanding Sample Variance and VIF

This educational guide explains how to calculate sample variance using n and n-1, the differences in results, and real-world implications. Additionally, we explore Variance Inflation Factor (VIF) for detecting multicollinearity.

๐Ÿ“‘ Table of Contents

1. Sample Variance Basics

Variance measures the spread of data points around the mean. There are two common formulas for sample variance:

  • Using n: Average of squared deviations from the mean.
  • Using n-1: Corrected version to estimate population variance from a sample, also called Bessel's correction.

Why n-1? Using n underestimates the true variance when working with a sample because it does not account for the fact that the mean is itself estimated from the sample.

2. Step-by-Step Variance Calculation Example

Let's calculate the variance for three student scores: 80, 85, 90 using n.

  1. Average = (80 + 85 + 90) / 3 = 85
  2. Squared deviations:
    • (80 - 85)2 = 25
    • (85 - 85)2 = 0
    • (90 - 85)2 = 25
  3. Variance = (25 + 0 + 25) / 3 = 16.67

Now, using n-1 for the same data:

  1. Squared deviations remain 25, 0, 25
  2. Variance = (25 + 0 + 25) / (3-1) = 25

Comparison:

  • Variance using n = 16.67
  • Variance using n-1 = 25

CLI Simulation Example

$ python
>>> import numpy as np
>>> data = [80, 85, 90]
>>> np.var(data)       # Using n
16.666666666666668
>>> np.var(data, ddof=1) # Using n-1
25.0

3. Real-Life Example: Clinical Trials

Consider a study evaluating a new drug:

  • Two groups: Drug vs Placebo, 30 patients each
  • We calculate variance of blood pressure reduction
  1. Impact of using n vs n-1:
    • Variance with n = 25 mmHg²
    • Variance with n-1 = 27 mmHg²
  2. Consequences:
    • Confidence intervals are narrower with n → overconfidence in effect.
    • Hypothesis tests may falsely indicate significance.
    • Decision-making may be flawed → regulatory or safety issues.
  3. Key Takeaway: Using n-1 ensures accurate estimates, maintaining reliability and public safety.

4. Detecting Multicollinearity Using Variance Inflation Factor (VIF)

VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity:

# Python example using statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

data = pd.DataFrame({
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 4, 6, 8, 10],  # Highly correlated with X1
    'X3': [5, 3, 6, 2, 1]
})

vif_data = pd.DataFrame()
vif_data["feature"] = data.columns
vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
print(vif_data)

High VIF (>10) indicates multicollinearity, which can distort regression results.

๐Ÿ’ก Key Takeaways

  • Use n-1 for sample variance to avoid underestimation.
  • Incorrect variance can mislead confidence intervals and hypothesis tests.
  • VIF helps detect multicollinearity in regression, ensuring robust model interpretation.
  • Interactive examples, CLI outputs, and copy buttons enhance hands-on learning.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts