Yet Another Data Science Blog: Chi-Square Test for Categorical Data

1. **Create a Contingency Table:**

Organize your data into a table where rows represent gender and columns represent color choices. For example:

|--------------|-------|------|------|-------|

| Boys | O1 | O2 | O3 | B |

| Girls | O4 | O5 | O6 | G |

| **Total** | T1 | T2 | T3 | N |

- O1, O2, O3, O4, O5, O6: Observed frequencies

- B: Total number of boys

- G: Total number of girls

- T1, T2, T3: Totals for each color

- N: Total number of participants

2. **Calculate Expected Frequencies:**

For each cell in the table, calculate the expected frequency using the formula:

E_ij = (Row Total_i * Column Total_j) / Grand Total

For instance, the expected frequency for boys choosing green would be:

E_B_Green = (B * T1) / N

3. **Compute the Chi-square Statistic:**

Use the formula:

χ^2 = Σ ((O_ij - E_ij)^2 / E_ij)

where `O_ij` is the observed frequency and `E_ij` is the expected frequency for each cell.

4. **Determine Degrees of Freedom:**

Degrees of freedom `df` are calculated as:

df = (Number of Rows - 1) * (Number of Columns - 1)

For this table:

df = (2 - 1) * (3 - 1) = 2

5. **Compare with Critical Value or Compute p-value:**

Compare the Chi-square statistic with the critical value from the Chi-square distribution table for `df` at your chosen significance level (usually 0.05). Alternatively, compute the p-value.

6. **Interpret Results:**

- **If the Chi-square statistic is higher than the critical value** or if the p-value is less than 0.05, **reject the null hypothesis**. This indicates a significant association between gender and color preference.

- **If not**, there’s insufficient evidence to suggest a significant association.

In summary, the Chi-square test helps assess whether the observed differences in color preferences between boys and girls are statistically significant or if they could have occurred by chance.

### **Degrees of Freedom Calculation:**

1. **Goodness-of-Fit Test:**

- **Purpose:** Tests if the observed frequencies match an expected distribution.

- **Degrees of Freedom Formula:**

df = Number of Categories - 1

- **Example:** If you’re testing preferences among 3 colors (Green, Pink, Blue),

df = 3 - 1 = 2

2. **Test of Independence (Contingency Table):**

- **Purpose:** Tests if two categorical variables are independent of each other.

- **Degrees of Freedom Formula:**

df = (Number of Rows - 1) * (Number of Columns - 1)

- **Example:** For a table with 2 rows (Boys, Girls) and 3 columns (Green, Pink, Blue),

df = (2 - 1) * (3 - 1) = 2

In summary, the degrees of freedom calculation method changes based on whether you're testing a single categorical variable (Goodness-of-Fit) or the relationship between two categorical variables (Independence). Each method uses degrees of freedom to determine the appropriate Chi-square distribution for assessing statistical significance.

Normalization vs Standardization

### **Understanding the Chi-Square Statistic in Tobit Models**

The Chi-Square test in a Tobit regression typically refers to:

1. **Wald Test**: Tests whether individual coefficients are significantly different from zero.

- Formula:

- Chi-Square = (Beta / Standard Error)²

- This is the squared z-statistic, used to check the significance of each coefficient.

2. **Likelihood Ratio Test (LRT)**: Compares the log-likelihood of the full model versus a restricted model (e.g., only intercept).

- Formula:

- Chi-Square = -2 * (Log-Likelihood of Restricted Model - Log-Likelihood of Full Model)

3. **p-value Calculation**:

- p-value = 1 - chi2.cdf(Chi-Square, Degrees of Freedom)

- Degrees of Freedom = Number of parameters tested.

---

### **Recovering Chi-Square in Python**

Assuming you are using the `tobit` package, here’s how to compute both the **Wald test** and **Likelihood Ratio Test**.

#### **1. Wald Test for Each Coefficient**

import numpy as np

import scipy.stats as stats

from tobit import TobitModel

# Fit Tobit Model

model = TobitModel(y, X, left=0) # Assuming left-censored at 0

results = model.fit()

# Extract coefficients and standard errors

betas = results.params_

se = results.bse_

# Compute Wald Chi-Square statistic

wald_chi2 = (betas / se) ** 2

# Compute p-values from Chi-Square distribution (df=1 for each test)

p_values = 1 - stats.chi2.cdf(wald_chi2, df=1)

# Display results

for i, (beta, chi2, p) in enumerate(zip(betas, wald_chi2, p_values)):

print(f"Variable {i}: Chi-Square = {chi2:.4f}, p-value = {p:.4f}")

#### **2. Likelihood Ratio Test (LRT)**

This compares the full model to a restricted model (intercept only).

# Fit a restricted model (only intercept)

X_restricted = np.ones((X.shape[0], 1)) # Only intercept

restricted_model = TobitModel(y, X_restricted, left=0)

restricted_results = restricted_model.fit()

# Compute Likelihood Ratio Test statistic

ll_full = results.llf_ # Log-likelihood of full model

ll_restricted = restricted_results.llf_

lrt_chi2 = -2 * (ll_restricted - ll_full)

df = X.shape[1] - 1 # Degrees of freedom (difference in parameters)

# Compute p-value

lrt_p_value = 1 - stats.chi2.cdf(lrt_chi2, df=df)

print(f"Likelihood Ratio Test: Chi-Square = {lrt_chi2:.4f}, p-value = {lrt_p_value:.4f}")

---

### **Interpreting Results**

- **Wald Test Chi-Square**: If a variable has a large Chi-Square, its coefficient is significantly different from zero.

- **Pr > Chi-Square (p-value)**: If p < 0.05, the coefficient is statistically significant.

- **Likelihood Ratio Test**: If the p-value is small, the full model is significantly better than the restricted model.

Yet Another Data Science Blog

Pages

Tuesday, November 12, 2024

Chi-Square Test for Categorical Data

No comments:

Post a Comment

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

Popular Posts

Posts Per Category

🎮 AI Fun Zone

🧠 AI Quiz

🎯 Guess Game

⚡ Speed Test

✊ Rock Paper Scissors

🔢 Quick Math

🧩 Memory Game

⌨️ Typing Speed

🟥 Color Click

🎲 Dice Game

Explore AI Hub

Latest Posts

AI Category

🚀 Trending AI Projects

📊 Data Science Resources

📚 Latest Research Papers

🔥 New AI Tools

💬 Developer Discussions

Contact Form

Followers