Showing posts with label Observed Frequencies. Show all posts
Showing posts with label Observed Frequencies. Show all posts

Tuesday, November 12, 2024

Chi-Square Test for Categorical Data





1. **Create a Contingency Table:**
   Organize your data into a table where rows represent gender and columns represent color choices. For example:

   | | Green | Pink | Blue | Total |
   |--------------|-------|------|------|-------|
   | Boys | O1 | O2 | O3 | B |
   | Girls | O4 | O5 | O6 | G |
   | **Total** | T1 | T2 | T3 | N |

   - O1, O2, O3, O4, O5, O6: Observed frequencies
   - B: Total number of boys
   - G: Total number of girls
   - T1, T2, T3: Totals for each color
   - N: Total number of participants

2. **Calculate Expected Frequencies:**
   For each cell in the table, calculate the expected frequency using the formula:
   
   E_ij = (Row Total_i * Column Total_j) / Grand Total
   
   For instance, the expected frequency for boys choosing green would be:
   
   E_B_Green = (B * T1) / N
   

3. **Compute the Chi-square Statistic:**
   Use the formula:
   
   ฯ‡^2 = ฮฃ ((O_ij - E_ij)^2 / E_ij)
   
   where `O_ij` is the observed frequency and `E_ij` is the expected frequency for each cell.

4. **Determine Degrees of Freedom:**
   Degrees of freedom `df` are calculated as:
   
   df = (Number of Rows - 1) * (Number of Columns - 1)
   
   For this table:
   
   df = (2 - 1) * (3 - 1) = 2
   

5. **Compare with Critical Value or Compute p-value:**
   Compare the Chi-square statistic with the critical value from the Chi-square distribution table for `df` at your chosen significance level (usually 0.05). Alternatively, compute the p-value.

6. **Interpret Results:**
   - **If the Chi-square statistic is higher than the critical value** or if the p-value is less than 0.05, **reject the null hypothesis**. This indicates a significant association between gender and color preference.
   - **If not**, there’s insufficient evidence to suggest a significant association.

In summary, the Chi-square test helps assess whether the observed differences in color preferences between boys and girls are statistically significant or if they could have occurred by chance.


### **Degrees of Freedom Calculation:**


1. **Goodness-of-Fit Test:**
   - **Purpose:** Tests if the observed frequencies match an expected distribution.
   - **Degrees of Freedom Formula:** 
     
     df = Number of Categories - 1
     
   - **Example:** If you’re testing preferences among 3 colors (Green, Pink, Blue), 
     
     df = 3 - 1 = 2
     

2. **Test of Independence (Contingency Table):**
   - **Purpose:** Tests if two categorical variables are independent of each other.
   - **Degrees of Freedom Formula:** 
     
     df = (Number of Rows - 1) * (Number of Columns - 1)
     
   - **Example:** For a table with 2 rows (Boys, Girls) and 3 columns (Green, Pink, Blue),
     
     df = (2 - 1) * (3 - 1) = 2
     

In summary, the degrees of freedom calculation method changes based on whether you're testing a single categorical variable (Goodness-of-Fit) or the relationship between two categorical variables (Independence). Each method uses degrees of freedom to determine the appropriate Chi-square distribution for assessing statistical significance.





### **Understanding the Chi-Square Statistic in Tobit Models**  
The Chi-Square test in a Tobit regression typically refers to:  

1. **Wald Test**: Tests whether individual coefficients are significantly different from zero.
   - Formula:  
     - Chi-Square = (Beta / Standard Error)²  
   - This is the squared z-statistic, used to check the significance of each coefficient.

2. **Likelihood Ratio Test (LRT)**: Compares the log-likelihood of the full model versus a restricted model (e.g., only intercept).  
   - Formula:  
     - Chi-Square = -2 * (Log-Likelihood of Restricted Model - Log-Likelihood of Full Model)  

3. **p-value Calculation**:  
   - p-value = 1 - chi2.cdf(Chi-Square, Degrees of Freedom)  
   - Degrees of Freedom = Number of parameters tested.

---

### **Recovering Chi-Square in Python**  

Assuming you are using the `tobit` package, here’s how to compute both the **Wald test** and **Likelihood Ratio Test**.

#### **1. Wald Test for Each Coefficient**  

import numpy as np
import scipy.stats as stats
from tobit import TobitModel

# Fit Tobit Model
model = TobitModel(y, X, left=0) # Assuming left-censored at 0
results = model.fit()

# Extract coefficients and standard errors
betas = results.params_
se = results.bse_

# Compute Wald Chi-Square statistic
wald_chi2 = (betas / se) ** 2

# Compute p-values from Chi-Square distribution (df=1 for each test)
p_values = 1 - stats.chi2.cdf(wald_chi2, df=1)

# Display results
for i, (beta, chi2, p) in enumerate(zip(betas, wald_chi2, p_values)):
    print(f"Variable {i}: Chi-Square = {chi2:.4f}, p-value = {p:.4f}")


#### **2. Likelihood Ratio Test (LRT)**  
This compares the full model to a restricted model (intercept only).  


# Fit a restricted model (only intercept)
X_restricted = np.ones((X.shape[0], 1)) # Only intercept
restricted_model = TobitModel(y, X_restricted, left=0)
restricted_results = restricted_model.fit()

# Compute Likelihood Ratio Test statistic
ll_full = results.llf_ # Log-likelihood of full model
ll_restricted = restricted_results.llf_

lrt_chi2 = -2 * (ll_restricted - ll_full)
df = X.shape[1] - 1 # Degrees of freedom (difference in parameters)

# Compute p-value
lrt_p_value = 1 - stats.chi2.cdf(lrt_chi2, df=df)

print(f"Likelihood Ratio Test: Chi-Square = {lrt_chi2:.4f}, p-value = {lrt_p_value:.4f}")


---

### **Interpreting Results**  
- **Wald Test Chi-Square**: If a variable has a large Chi-Square, its coefficient is significantly different from zero.  
- **Pr > Chi-Square (p-value)**: If p < 0.05, the coefficient is statistically significant.  
- **Likelihood Ratio Test**: If the p-value is small, the full model is significantly better than the restricted model.  


Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts