Chi square Test: Application with examples

The Chi-square test is a statistical hypothesis test used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. It is used when the data are categorical and the variables are independent.

Example: 02

Suppose you are interested in whether there is a relationship between gender and smoking behaviour. You randomly sample 100 people and record their gender and smoking status (smoker or non-smoker). The results are shown in the contingency table below:

	Smoker	Non-Smoker	Total
Male	20	30	50
Female	10	40	50
Total	30	70	100

To conduct a chi-square test, you first need to state your null and alternative hypotheses. In this case, the null hypothesis is that there is no relationship between gender and smoking behaviour, while the alternative hypothesis is that there is a relationship.

Next, you need to calculate the expected frequencies under the null hypothesis. This can be done by multiplying the row total and column total for each cell, and dividing by the total sample size. For example, the expected frequency for the cell in the first row and first column is (50 * 30) / 100 = 15.

	Smoker	Non-Smoker	Total
Male	15	35	50
Female	15	35	50
Total	30	70	100

Once you have calculated the expected frequencies, you can calculate the chi-square statistic using the formula:

Chi-square statistic = Σ (Observed frequency - Expected frequency)^2 / Expected frequency [χ² = ∑(O - E)² / E]

where O is the observed frequency and E is the expected frequency.

In our example, the chi-square statistic is:

χ² = [(20-15)²/15] + [(30-35)²/35] + [(10-15)²/15] + [(40-35)²/35] = 4.57

To determine whether this value is significant, you need to compare it to the critical value of the chi-square distribution with (rows-1) * (columns-1) degrees of freedom. In this case, the degrees of freedom is (2-1) * (2-1) = 1. Looking up the critical value in a chi-square distribution table with 1 degree of freedom and a significance level of 0.05, we find that the critical value is 3.84.

Since the calculated chi-square value (4.57) is greater than the critical value (3.84), we can reject the null hypothesis and conclude that there is a significant relationship between gender and smoking behaviour.

Therefore, we can conclude that the data provides evidence to suggest that there is a significant difference between the expected frequencies and the observed frequencies in gender and smoking behaviour.

Example-2:

Chi-square test of categorical data is a statistical hypothesis test used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. It is commonly used in social science research, public health, and business analysis.

Suppose a market research firm wants to determine if there is a relationship between gender and preferred soda brand. They conduct a survey of 500 people, 250 males, and 250 females, asking them which soda brand they prefer: Coke, Pepsi, or Sprite. The results are tabulated as follows:

	Coke	Pepsi	Sprite	Row Total
Male (n=250)	70	120	60	250
Female (n=250)	80	80	90	250
Column Total	150	200	150	500

The null hypothesis is that there is no relationship between gender and preferred soda brand. The alternative hypothesis is that there is a relationship.

To calculate the expected frequencies, we first need to calculate the row and column totals.

The expected frequency for each cell is calculated by multiplying the row total and column total for that cell, and then dividing by the total sample size:

Expected frequency for the male/Coke cell: (250 * 150) / 500 = 75

Expected frequency for the female/Coke cell: (250 * 150) / 500 = 75

Expected frequency for the male/Pepsi cell: (250 * 200) / 500 = 100

Expected frequency for the female/Pepsi cell: (250 * 200) / 500 = 100

Expected frequency for the male/Sprite cell: (250 * 150) / 500 = 75

Expected frequency for the female/Sprite cell: (250 * 150) / 500 = 75

	Coke	Pepsi	Sprite	Row Total
Male (n=250)	75	100	75	250
Female (n=250)	75	100	75	250
Column Total	150	200	150	500

Next, we calculate the chi-square statistic using the formula:

χ² = ∑(O - E)² / E

where O is the observed frequency and E is the expected frequency. The observed and expected frequencies for each cell are shown in the following table:

	Coke	Pepsi	Sprite	Row Total
Male (n=250)	70	120	60	250
Female (n=250)	80	80	90	250
Column Total	150	200	150	500
Expected frequency	75	100	75	500

Using the chi-square formula, we can calculate the chi-square statistic:

χ² = [(70 - 75)²/75] + [(120 - 100)²/100] + [(60 - 75)²/75] + [(80 - 75)²/75] + [(80 - 100)²/100] + [(90 - 75)²/75]

χ² = 11.40

The degrees of freedom for this test is equal to (r-1)(c-1) = (2-1)(3-1) = 2.

We can use a chi-square distribution table or software to find the critical value for a given level of significance and degrees of freedom. Let's assume a significance level of 0.05.

Looking up the critical value in a chi-square distribution table with 2 degrees of freedom and a significance level of 0.05, we find the critical value to be 5.99.

Since our calculated chi-square value of 11.40 is greater than the critical value of 5.99, we can reject the null hypothesis that there is no relationship between gender and preferred soda brand. We can conclude that there is a significant relationship between gender and preferred soda brand.

In this example, we found that there was a significant relationship between gender and preferred soda brand. However, the chi-square test does not tell us the direction of the relationship. To investigate this, we may want to perform additional analyses, such as calculating measures of association like Cramer's V or performing post-hoc tests to determine which cells are driving the significant result.

Example-3:

The chi-square test of 5 point Likert type categorical data is used to analyze the relationship between two categorical variables, where one variable is an ordinal variable with 5 levels (5 point Likert scale) and the other variable is a nominal variable.

Suppose a company wants to determine if there is a relationship between customer satisfaction and product quality. A survey is conducted among 500 customers, and they are asked to rate their satisfaction with the company's products on a 5 point Likert scale (1 = Very unsatisfied, 2 = Unsatisfied, 3 = Neutral, 4 = Satisfied, 5 = Very satisfied). The customers are also asked to indicate the product category they purchased (Category A, B, or C).

The data are tabulated as follows:

	A	B	C	Total
Very unsatisfied	10	15	25	50
Unsatisfied	25	30	40	95
Neutral	40	60	50	150
Satisfied	60	90	80	230
Very satisfied	70	75	65	210
Total	205	270	260	735

To conduct a chi-square test, we first need to state our null and alternative hypotheses. The null hypothesis is that there is no relationship between customer satisfaction and product category. The alternative hypothesis is that there is a relationship.

We then calculate the expected frequencies for each cell using the formula:

Expected frequency = (Row total * Column total) / Grand total

For example, the expected frequency for the "Very unsatisfied" and "Category A" cell is (50*205)/735 = 13.89.

The expected frequencies for all cells are shown in the following table:

	A	B	C	Total
Very unsatisfied	13.89	18.29	17.82	50
Unsatisfied	33.51	44.05	42.44	95
Neutral	51.02	67.03	64.95	150
Satisfied	68.78	90.39	87.83	230
Very satisfied	37.80	49.24	47.96	210
Total	205	270	260	735

Next, we calculate the chi-square statistic using the formula:

χ² = ∑(O - E)² / E

The chi-square statistic for this example is 41.66.

We then need to find the degrees of freedom (df) for the chi-square distribution. The df value for this test is calculated as (r - 1) * (c - 1), where r is the number of rows and c is the number of columns in the contingency table. In this case, we have 5 rows and 3 columns, so the df value is 8.

Finally, we use a chi-square distribution table or software to find the p-value associated with the calculated chi-square value and df value. Let's assume a significance level of 0.05.

Looking up the p-value for a chi-square distribution with 8 df and a chi-square value of 41.66, we find the p-value to be less than 0.001.

Since our p-value is less than the significance level of 0.05, we can reject the null hypothesis and conclude that there is a significant relationship between customer satisfaction and product category.

We can also examine the standardised residuals to determine which cells are contributing most to the significant result. The standardised residual for each cell is calculated as (Observed frequency - Expected frequency) / sqrt(Expected frequency). Cells with standardised residuals greater than 2 or less than -2 are considered significantly different from what would be expected by chance.

For example, the standardised residual for the "Very unsatisfied" and "Category A" cell is (10 - 13.89) / sqrt(13.89) = -1.97. The standardised residuals for all cells are shown in the following table:

	A	B	C
Very unsatisfied	-1.97	1.07	0.96
Unsatisfied	-1.75	0.95	0.83
Neutral	0.14	-0.06	-0.11
Satisfied	1.18	-0.64	-0.56
Very satisfied	1.73	-0.94	-0.82

We can see that the cells with the largest standardised residuals are "Very unsatisfied" and "Category A", "Unsatisfied" and "Category A", "Satisfied" and "Category A", and "Very satisfied" and "Category A". This suggests that customers who purchased Category A products were more likely to report extreme levels of satisfaction or dissatisfaction compared to customers who purchased Category B or C products.

In conclusion, the chi-square test can be used to analyse 5 point Likert type categorical data to determine if there is a significant relationship between two categorical variables. The test can also be used to identify which cells are driving the significant result by examining the standardised residuals.

Wednesday, April 05, 2023

Chi square Test: Application with examples

No comments:

Post a Comment

Translate

Archive

Subscribe Us

Wikipedia

Categories

Pages

Popular Posts

Wednesday, April 05, 2023

Chi square Test: Application with examples

Normalisation of data

Techniques of Composite Index: Some Examples