Wednesday, April 05, 2023

Chi square Test: Application with examples

The Chi-square test is a statistical hypothesis test used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. It is used when the data are categorical and the variables are independent.

Example: 02

Suppose you are interested in whether there is a relationship between gender and smoking behaviour. You randomly sample 100 people and record their gender and smoking status (smoker or non-smoker). The results are shown in the contingency table below:

Smoker

Non-Smoker

Total 

Male

20

30

50

Female

10

40

50

Total 

30

70

100


To conduct a chi-square test, you first need to state your null and alternative hypotheses. In this case, the null hypothesis is that there is no relationship between gender and smoking behaviour, while the alternative hypothesis is that there is a relationship.

Next, you need to calculate the expected frequencies under the null hypothesis. This can be done by multiplying the row total and column total for each cell, and dividing by the total sample size. For example, the expected frequency for the cell in the first row and first column is (50 * 30) / 100 = 15.

Smoker

Non-Smoker

Total 

Male

15

35

50

Female

15

35

50

Total 

30

70

100


Once you have calculated the expected frequencies, you can calculate the chi-square statistic using the formula:

Chi-square statistic = Σ (Observed frequency - Expected frequency)^2 / Expected frequency [χ² = ∑(O - E)² / E]


where O is the observed frequency and E is the expected frequency.

In our example, the chi-square statistic is:


χ² = [(20-15)²/15] + [(30-35)²/35] + [(10-15)²/15] + [(40-35)²/35] = 4.57


To determine whether this value is significant, you need to compare it to the critical value of the chi-square distribution with (rows-1) * (columns-1) degrees of freedom. In this case, the degrees of freedom is (2-1) * (2-1) = 1. Looking up the critical value in a chi-square distribution table with 1 degree of freedom and a significance level of 0.05, we find that the critical value is 3.84.


Since the calculated chi-square value (4.57) is greater than the critical value (3.84), we can reject the null hypothesis and conclude that there is a significant relationship between gender and smoking behaviour.

Therefore, we can conclude that the data provides evidence to suggest that there is a significant difference between the expected frequencies and the observed frequencies in gender and smoking behaviour.


Example-2: 

Chi-square test of categorical data is a statistical hypothesis test used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. It is commonly used in social science research, public health, and business analysis.


Suppose a market research firm wants to determine if there is a relationship between gender and preferred soda brand. They conduct a survey of 500 people, 250 males, and 250 females, asking them which soda brand they prefer: Coke, Pepsi, or Sprite. The results are tabulated as follows:


Coke

Pepsi

Sprite

Row Total

Male (n=250)

70

120

60

250

Female (n=250)

80

80

90

250

Column Total

150

200

150

500


The null hypothesis is that there is no relationship between gender and preferred soda brand. The alternative hypothesis is that there is a relationship.


To calculate the expected frequencies, we first need to calculate the row and column totals. 

The expected frequency for each cell is calculated by multiplying the row total and column total for that cell, and then dividing by the total sample size:


Expected frequency for the male/Coke cell: (250 * 150) / 500 = 75

Expected frequency for the female/Coke cell: (250 * 150) / 500 = 75

Expected frequency for the male/Pepsi cell: (250 * 200) / 500 = 100

Expected frequency for the female/Pepsi cell: (250 * 200) / 500 = 100

Expected frequency for the male/Sprite cell: (250 * 150) / 500 = 75

Expected frequency for the female/Sprite cell: (250 * 150) / 500 = 75


Coke

Pepsi

Sprite

Row Total

Male (n=250)

75

100

75

250

Female (n=250)

75

100

75

250

Column Total

150

200

150

500


Next, we calculate the chi-square statistic using the formula:


χ² = ∑(O - E)² / E


where O is the observed frequency and E is the expected frequency. The observed and expected frequencies for each cell are shown in the following table:


Coke

Pepsi

Sprite

Row Total

Male (n=250)

70

120

60

250

Female (n=250)

80

80

90

250

Column Total

150

200

150

500

Expected frequency

75

100

75

500

Using the chi-square formula, we can calculate the chi-square statistic:


χ² = [(70 - 75)²/75] + [(120 - 100)²/100] + [(60 - 75)²/75] + [(80 - 75)²/75] + [(80 - 100)²/100] + [(90 - 75)²/75]


χ² = 11.40


The degrees of freedom for this test is equal to (r-1)(c-1) = (2-1)(3-1) = 2. 

We can use a chi-square distribution table or software to find the critical value for a given level of significance and degrees of freedom. Let's assume a significance level of 0.05.

Looking up the critical value in a chi-square distribution table with 2 degrees of freedom and a significance level of 0.05, we find the critical value to be 5.99.


Since our calculated chi-square value of 11.40 is greater than the critical value of 5.99, we can reject the null hypothesis that there is no relationship between gender and preferred soda brand. We can conclude that there is a significant relationship between gender and preferred soda brand.


In this example, we found that there was a significant relationship between gender and preferred soda brand. However, the chi-square test does not tell us the direction of the relationship. To investigate this, we may want to perform additional analyses, such as calculating measures of association like Cramer's V or performing post-hoc tests to determine which cells are driving the significant result.


Example-3:

The chi-square test of 5 point Likert type categorical data is used to analyze the relationship between two categorical variables, where one variable is an ordinal variable with 5 levels (5 point Likert scale) and the other variable is a nominal variable.


Suppose a company wants to determine if there is a relationship between customer satisfaction and product quality. A survey is conducted among 500 customers, and they are asked to rate their satisfaction with the company's products on a 5 point Likert scale (1 = Very unsatisfied, 2 = Unsatisfied, 3 = Neutral, 4 = Satisfied, 5 = Very satisfied). The customers are also asked to indicate the product category they purchased (Category A, B, or C).


The data are tabulated as follows:


A

B

C

Total 

Very unsatisfied

10

15

25

50

Unsatisfied

25

30

40

95

Neutral

40

60

50

150

Satisfied

60

90

80

230

Very satisfied

70

75

65

210

Total

205

270

260

735

To conduct a chi-square test, we first need to state our null and alternative hypotheses. The null hypothesis is that there is no relationship between customer satisfaction and product category. The alternative hypothesis is that there is a relationship.


We then calculate the expected frequencies for each cell using the formula:


Expected frequency = (Row total * Column total) / Grand total


For example, the expected frequency for the "Very unsatisfied" and "Category A" cell is (50*205)/735 = 13.89.


The expected frequencies for all cells are shown in the following table:


A

B

C

Total 

Very unsatisfied

13.89

18.29

17.82

50

Unsatisfied

33.51

44.05

42.44

95

Neutral

51.02

67.03

64.95

150

Satisfied

68.78

90.39

87.83

230

Very satisfied

37.80

49.24

47.96

210

Total

205

270

260

735


Next, we calculate the chi-square statistic using the formula:


 χ² = ∑(O - E)² / E


The chi-square statistic for this example is 41.66.


We then need to find the degrees of freedom (df) for the chi-square distribution. The df value for this test is calculated as (r - 1) * (c - 1), where r is the number of rows and c is the number of columns in the contingency table. In this case, we have 5 rows and 3 columns, so the df value is 8.


Finally, we use a chi-square distribution table or software to find the p-value associated with the calculated chi-square value and df value. Let's assume a significance level of 0.05.


Looking up the p-value for a chi-square distribution with 8 df and a chi-square value of 41.66, we find the p-value to be less than 0.001.


Since our p-value is less than the significance level of 0.05, we can reject the null hypothesis and conclude that there is a significant relationship between customer satisfaction and product category.


We can also examine the standardised residuals to determine which cells are contributing most to the significant result. The standardised residual for each cell is calculated as (Observed frequency - Expected frequency) / sqrt(Expected frequency). Cells with standardised residuals greater than 2 or less than -2 are considered significantly different from what would be expected by chance.


For example, the standardised residual for the "Very unsatisfied" and "Category A" cell is (10 - 13.89) / sqrt(13.89) = -1.97. The standardised residuals for all cells are shown in the following table:


A

B

C

Very unsatisfied

-1.97

1.07

0.96

Unsatisfied

-1.75

0.95

0.83

Neutral

0.14

-0.06

-0.11

Satisfied

1.18

-0.64

-0.56

Very satisfied

1.73

-0.94

-0.82


We can see that the cells with the largest standardised residuals are "Very unsatisfied" and "Category A", "Unsatisfied" and "Category A", "Satisfied" and "Category A", and "Very satisfied" and "Category A". This suggests that customers who purchased Category A products were more likely to report extreme levels of satisfaction or dissatisfaction compared to customers who purchased Category B or C products.


In conclusion, the chi-square test can be used to analyse 5 point Likert type categorical data to determine if there is a significant relationship between two categorical variables. The test can also be used to identify which cells are driving the significant result by examining the standardised residuals.

No comments:

Post a Comment