Tuesday, August 24, 2010

Chi-square Test of Independence

The chi-square goodness of fit test and test for independence are both available on SPSS. Recall that chi-square is useful for analyzing whether a frequency distribution for a categorical or nominal variable is consistent with expectations (a goodness of fit test), or whether two categorical or nominal variables are related or associated with each other (a test for independence). Categorical or nominal variables assign values by virtue of being a member of a category. Sex is a nominal variable. It can take on two values, male and female, which are usually coded numerically as 1 or 2. These numerical codes do not give any information about how much of some characteristic the individual possesses. Instead, the numbers merely provide information about the category to which the individual belongs. Other examples of nominal or categorical variables include hair color, race, diagnosis (e.g., ADHD vs. anxiety vs. depression vs. chemically dependent), and type of treatment (e.g., medication vs. behavior management vs. none). Note that these are the same type of variables that can be used as independent variables in a t-test or ANOVA. In the latter analyses, the researcher is interested in the means of another variable measured on a interval or ratio scale. In chi-square, the interest is in the frequency with which individuals fall in the category or combination of categories.

The chi-square test for independence is a test of whether two categorical variables are associated with each other. For example, imagine that a survey of approximately 200 individuals has been conducted and that 120 of these people are females and 80 are males. Now, assume that the survey includes information about college major. To keep the example simple, assume that each person is either a psychology or a biology major. It might be asked whether males and females tend to choose these two majors at about the same rate or does one of the majors have a different proportion of one sex than the other major. The table below shows the case where males and females tend to be about equally represented in the two majors. In this case college major is independent of sex. Note that the percentage of females in psychology and biology is 59.8 and 60.2, respectively. Another way to characterize these data is to say that sex and major are independent of each other because the proportion of males and females remains the same for both majors.

The next example shows the same problem with a different result. In this example, the proportion of males and females depends upon the major. Females compose 79.6 percent of psychology majors and only 39.2 percent of biology majors. Clearly, the proportion of each sex is different for each major. Another way to state this is to say that choice of major is strongly related to sex, assuming that the example represents a statistically significant finding. It is possible to represent the strength of this relationship with a coefficient of association such as the contingency coefficient or Phi. These coefficients are similar to the Pearson correlation and interpreted in roughly the same way.

The method for obtaining a chi-square test for independence is a little tricky. Begin by clicking Analyze > Summarize > Crosstabs.... Transfer the variables to be analyzed to the Row(s) and Column(s) boxes. Then go to the Statistics... button and check the Chi-square box and anything that looks interesting in the Nominal Data box, followed by the Continue button. Next, click the Cells... button and check any needed descriptive information. Row, column, and total Percentages are particularly useful for interpreting the data. Finally, click OK and the output will quickly appear.

Analysis of Variance (ANOVA)

In this section we have compared two groups (males and females).  What if we wanted to compare more than two groups?  For example, we might want to see if age at the birth of their first child (AGEKDBRN) varies by educational level.  This time let's use the respondent's highest degree (DEGREE) as our measure of education.  To do this we will use One-Way Analysis of Variance (often abbreviated ANOVA).  Click on "Analyze", then point your mouse at "Compare Means", and then click on "Means".  Click on "Reset" to get rid of what is already in the box.  Click on AGEKDBRN to highlight it and then move it to the Dependent List box by clicking on the arrow to the left of the box.  Then scroll down the list of variables on the left and find DEGREE.  Click on it to highlight it and move it to the Independent List box by clicking on the arrow to the left of this box.   Click on the "Options" button and this will open the Means: Options box.  Click on the box labeled "Anova table and etc.  This should put a check mark in this box indicating that we want SPSS to do a One-Way Analysis of Variance.  Click on "Continue" and then on "OK" in the Means box.
 
In this example, the independent variable has five categories: less than high school, high school, junior college, bachelor, and graduate. The output shows the mean age at birth of first child for each of these groups and their standard deviations, as well as the Analysis of Variance table including the sum of squares, degrees of freedom, mean squares, the F-value and the observed significance value.  The significance value for this example is the probability of getting a F-value of 68.266 or higher if the null hypothesis is true.  Here the null hypothesis is that the mean age at birth of first child is the same for all five population groups.  In other words, that the mean age at birth of first child for all people with less than a high school degree is equal to the mean age for all with a high school degree and all those with a junior college degree and all those with a bachelor's degree and all those with a graduate degree.  Since this probability is so low (<.0005 or less than 5 out of 10,000), we would reject the null hypothesis and conclude that these population means are probably not all the same.

There is another procedure in SPSS that does One-Way Analysis of Variance and this is called One-Way ANOVA.  This procedure allows you to use several multiple comparison procedures that can be used to determine which groups have means that are significantly different. 

Paired sample T-Test


We said we would look at an example where the samples are not independent.  (SPSS calls these paired samples.  Sometimes they are called matched samples.)  Let's say we wanted to compare the educational level of the respondent's father and mother.  PAEDUC is the years of school completed by the father and MAEDUC is years of school for the mother.  Clearly our samples of fathers and mothers are not independent of each other.  If the respondent's father is in one sample, then his or her mother will be in the other sample.  One sample determines the other sample.  Another example of paired samples is before and after measurements.  We might have a person's weight before they started to exercise and their weight after exercising for two months.  Since both measures are for the same person we clearly do not have independent samples.  This requires a different type of t test for paired samples.

Click on "Analyze", then point your mouse at "Compare Means", and then click on "Paired-Samples T Test".  Scroll down to MAEDUC in the list of variables on the left and click on it to move it to the Current Selections box as Variable 1.  Now click on PAEDUC to move it to the Current Selections box as Variable 2.  Click on the arrow to the left of the Paired Variables box to move this pair of variables into the box in the middle of the window.  Click on “OK".  The output table shows the mean years of school completed by mothers (11.47) and by fathers (11.33), as well as the standard deviations.  The t-value for the paired-samples t test is 1.822 and the 2-tailed significance value is 0.069.  (We may have to scroll down to see these values.)  This is the probability of getting a t-value this large or larger just by chance if the null hypothesis is true.  Since this probability is greater than .05, we won't reject the null hypothesis.  There is no statistical basis for saying that the respondents' fathers and mothers have different educational levels.  However, notice that if we were using a one-tailed test, then we would divide the two-tailed significance value of .069 by 2 which would be .0345.  For a one-tailed test, we would reject the null hypothesis since the one-tailed significance value is less than .05.

Independent Sample t-Test


If married women are, on the average, two years younger than men at birth of first child, can we conclude that this is also true in our population?  Can we make an inference about the population (all people) from our sample (about 2,800 people selected from the population)?  To answer this question we need to perform a t-test.  This will test the null hypothesis that men and women in the population do not differ in terms of their mean age at the birth of their first child.  The particular version of the t test that we will be using is called the independent-samples t test since our two samples are completely independent of each other.  In other words, the selection of cases in one of the samples does not influence the selection of cases in the other sample.  We'll look later at a situation where this is not true.

We want to compare our sample of men with our sample of women and then use this information to make an inference about the population.  Click on "Analyze", then point your mouse at "Compare Means" and then click on "Independent-Samples T Test".  Find AGEKDBRN in the list of variables on the left and click on it to highlight it, and then click on the arrow to the left of the Test Variable box.  This is the variable we want to test so it will go in the Test Variable box.  Now click on the list of variables on the left and use the scroll bar to find the variable SEX.  Click on it to highlight it and then click on the arrow to the left of the Grouping Variable box.  SEX defines the two groups we want to compare so it will go in the Grouping Variable box. Now we want to define the groups so click on the "Define Groups" button.  This will open the Define Groups box.  Since males are coded 1 and females 2, type 1 in the Group 1 box and 2 in the Group 2 box.  (You will have to click in each box before typing the value.)  This tells SPSS what the two groups are we want to compare.  (If you don't know how males and females are coded, click on “Options", then on "Variables" and scroll down until you find the variable SEX and click on it.  The box to the right will tell you the values for males and females.  Be sure to close this box.)  Now click on "Continue" and on "OK" in the Independent-Samples t-test box. 


The output table shows that the mean age at birth of first child for men (25.17) and women (22.58) which is a mean difference of 2.41.  It also shows us the results of two t tests.  Remember that this tests the null hypothesis that men and women have the same mean age at the birth of their first child in the population.  There are two versions of this test.  One assumes that the populations of men and women have equal variances (for AGEKDBRN), while the other doesn't make any assumption about the variances of the populations.  The table also gives you the values for the degrees of freedom and the observed significance level.  The significance value is .000 for both versions of the t test.  Actually, this means less than .0005 since SPSS rounds to the nearest third decimal place. This significance value is the probability that the t value would be this big or bigger simply by chance if the null hypothesis was true.  Since this probability is so small (less than five in 10,000), we will reject the null hypothesis and conclude that there probably is a difference between men and women in terms of average age at the birth of their first child in the population.  Notice that this is a two-tailed significance value.  If we want the one-tailed significance value, we can just divide the two-tailed value in half.



Let's work another example.  This time we will compare males and females in terms of average years of school completed (EDUC).  Click on "Analyze", point your mouse at "Compare Means", and click on "Independent-Samples T Test".  Click on "Reset" to get rid of the information you entered previously. Move EDUC into the Test Variable box and SEX into the Grouping Variable box.  Click on "Define Groups" and define males and females as you did before.  Click on "Continue" and then on "OK" to get the output window.  There is not much of a difference between men and women in terms of years of school completed, but we still reject the null hypothesis since the observed significance level is less than .05.  By the way, this is because we have such large samples.  When the samples are large, it is easier to reject the null hypothesis.

Comparing Means


Cross tabulation is a useful way of exploring the relationship between variables that contain only a few categories.  For example, for example in GSS2000, we could compare how men and women feel about abortion.  Here our dependent variable (abortion) consists of only two categories—approve or disapprove. But what if we wanted to find out if the average age at birth of first child is younger for women than for men?  Here our dependent variable is a continuous variable consisting of many values.  We could recode it so that it only had a few categories (e.g., under 20, 20 to 24, 25 to 29, 30 to 34, 35 to 39, 40 and older), but that would result in the loss of a lot of information.  A better way to do this would be to compare the mean age at birth of first child for men and women.

We're going to use the subset from the 2000 General Social Survey to answer this question.  Click on "Analyze", point your mouse at "Compare Means", and then click on "Means".  We want to put age at birth of first child (AGEKDBRN) in the Dependent List and SEX in the Independent List.  Highlight AGEKDBRN in the list of variables on the left of your screen, and then click on the arrow next to the Dependent List box.  Now click on the list of variables on the left and use the scroll bar to find the variable SEX.  Click on it to highlight it and then and then click on the arrow next to the Independent List box.  On the average, women are a little more than two years younger than men at the birth of first child.

Regression Analysis

We can also analyze the relationship between education and occupational prestige using regression analysis.  But first, let’s look at the relationship graphically by creating a scatterplot.  Click on "Graphs," "Scatter" and "Define" (we will use the default format, “Simple”).  This will open up the dialog box.  In the box on the left, click on EDUC then on the arrow key that is pointing toward the box labeled "X Axis" (because it is the independent variable in our hypothesis).  Next, click on PRESTG80 and move it into the box labeled "Y Axis" (because it is the dependent variable).  Then, click OK.

What we see is a plot of the number of years of education by the occupational prestige score for persons in the data set who have a job. 

We can edit our graph to make it easier to interpret.  First, double-click anywhere in the graph.  This will cause the graph to open in its own window.  Then, double-click on the X-axis.  A dialog box will open.  In the Range section of the box, change the Minimum to 0.  In the Major and Minor Divisions sections, change the Increments to 2.  Then, click OK.

Now, on the Menu Bar, click on “Chart,” then “Options.”  In the Fit Line section, click in the box next to Total.  Then, click on the Fit Options button, and click in the box next to “Display R-square in legend.”  Click Continue, then OK.

Notice the Fit Line that is now drawn on the graph. Regression (and correlation) analyze linear relationships between variables, finding the line that “best fits” the data (i.e. it keeps the errors, distances of points from the line, to a minimum).  The Fit Line shows you the line that describes the linear relationship.  Also notice the R-square statistic listed to the right of the graph.  Multiplied by 100, this statistic tells us the percentage of the variation in the dependent variable (PRESTG80, on the Y-axis) that is explained by the scores on the independent variable (EDUC, on the X-axis).  Thus, years of education predict 27.03% of the variation in occupational prestige in our sample.  Recall that the Pearson coefficient was .520.  If you square the Pearson coefficient (.520 x .520), you get .2704 – the same as the R-square (give or take some rounding)!  Thus, by knowing the correlation coefficient, you can also know the amount of variance in one variable (dependent) that is explained by the other variable (independent) in a bivariate analysis. 


We can get more information about the regression line.  Minimize the SPSS Chart Editor.  Click on "Analyze," "Regression," and "Linear."  This opens up the dialog box shown in Figure 7-11.  Move PRESTG80 to the "Dependent" box, and EDUC to the "Independent(s)" box.  Click OK. 

The first table just shows the variables that have been included in the analysis.  The second table, “Model Summary,” shows the R-square statistic, which is .270. 

The third table, ANOVA, gives the information about the model as a whole.  ANOVA is discussed briefly in chapter 6.  The final table, Coefficients, gives results of the regression analysis that are not available using only correlation techniques.  Look at the “Unstandardized Coefficients” column.  Two statistics are reported: B, which is the regression coefficient, and the standard error.  Notice that there are two statistics reported under B:  one labeled as (Constant), the other labeled as EDUC.  The statistic labeled as EDUC is the regression coefficient, which is the slope of the line that you saw on the scatterplot (note that in scholarly reports, it is conventional to refer to the regression coefficient using the lower case, b).  The one labeled as (Constant) is not actually a regression coefficient, but is the Y-intercept (SPSS reports it in this column for convenience only).


Y = a + bX

Y refers to the value of the dependent variable for a given case, a is the Y-intercept (the point where the line crosses the Y-axis, listed as Constant on your output), b is the slope of the line which describes the relationship between the independent and dependent variables (B for EDUC), and X is the value of the independent variable for a given case.

We know that the linear relationship between X and Y (EDUC and PRESTG80) is not perfect.  The correlation coefficient was not 1 (or –1), and the scatterplot showed plenty of cases that did not fall directly on the line.  Thus, it is clear to us that knowing someone’s education will not tell us without fail what their occupational prestige is, and furthermore, we are only analyzing a sample of cases and not the whole population to which we want to generalize our findings.  It is clear that there is some error built into our findings (the reason that the Fit Line is usually called the “Best Fit Line”).  For these reasons, it is conventional to write the formula for the line as

Y = a + bX + e, where e refers to error.

What can we do with this formula?  One thing we can do is make predictions about particular values of the independent variable, using just a little arithmetic.  All we have to do is plug the values from our output into the formula for a line (for our purposes, we will ignore the “e”):

Y = 9.84 +  2.565X

9.84, the Y-intercept (or Constant), is interpreted as the average occupational prestige score (our dependent, or Y variable), holding constant the effects of education (our independent, or X variable).  2.565 is the slope of the line. That is, if you refer back to the scatterplot, if you move one unit to the right on the X-axis, then move 2.565 units upward, you will intersect with the regression line.  (It is possible to have a negative coefficient.  In that case, to intersect with the line, you would move one unit to the right, and then B units downward.)

What occupational prestige score would our results predict for a person who completed high school, but no higher education?  All we have to do is enter 12 (as in twelve years of education) into our education:

Y  = 9.84 +  2.565(12)
Y  = 40.62

We find that having 12 years of education is associated with an occupational prestige score of 40.62.  But what of the error?  We know that not every high school graduate has this exact prestige score.  We acknowledge this when we discuss results by stating that on average, those with 12 years of education will have occupations with prestige scores of 40.62.  This language points out to our readers that it is likely that some of those respondents scored higher and some lower, but that 40.62 represents a central point.  In sum, the error tells us about the distance from actual values of Y (the answers that the GSS survey respondents gave) and predicted values of Y (the one’s you calculate based on the GSS respondent’s information in the “X” variable).  Thus, the error is the difference between a predicted value of Y for a given case and the actual value of Y for a given case (-Y).

More generally, though, when we discuss regression results, we rarely compute predicted scores for particular values of the independent variable.  Instead, in scholarly reports, we usually point out the general process at work.  In our case, we would say that “each additional year of education is associated with a 2.565 increase on the occupational prestige scale.”  Note that we refer to “an additional year of education” because our independent variable was measured as years of school completed.  Thus, the “unit” of measurement is years. We say there was a 2.565 increase in prestige with a unit increase in education, because that is the distance we have to move to intersect with the Y-axis, which represents occupational prestige.

Correlation Analysis






Correlation and regression analysis (also called "least squares" analysis) helps us examine relationships among interval or ratio variables. As you will see, results of these two tests tell us slightly different things about the relationship between two variables. In this section, we will explore techniques for doing correlation and bivariate regression.





Correlation

How does education influence the types of occupations that people enter ?  One way to think about occupations is in terms of  “occupational prestige.” Your data set includes a variable, PRESTG80, in which a prestige score was assigned to respondents’ occupations, where higher numbers indicate greater prestige. 



Let’s hypothesize that as education increases, the level of prestige of one’s occupation also increases.  To test this hypothesis, click on "Analyze," "Correlate," and "Bivariate."  Click on EDUC, and then click the arrow to move it into the box.  Do the same with PRESTG80.


The most widely used bivariate test is the Pearson correlation.  It is intended to be used when both variables are measured at either the interval or ratio level, and each variable is normally distributed.  However, sometimes we do violate these assumptions. If we do a histogram of both EDUC, in PRESTG80, we will notice that neither is actually normally distributed.  Furthermore, if we noted that PRESTG80 is really an ordinal measure, not an interval one, we would be correct.  Nevertheless, most analysts would use the Pearson correlation because the variables are close to being normally distributed, the ordinal variable has many ranks, and because the Pearson correlation is the one they are used to.  SPSS includes another correlation test, Spearman’s rho, that is designed to analyze variables that are not normally distributed, or are ranked, as in PRESTG80.  We will conduct both tests to see if our hypothesis is supported, and also to see how much the results differ depending on the test used – in other words, whether those who use the Pearson correlation on these types of variables are seriously off base.



In the dialog box, the box next to Pearson is already checked, as this is the default.  

The correlation coefficient may range from –1 to 1, where –1 or 1 indicates a “perfect” relationship.  The further the coefficient is from 0, regardless of whether it is positive or negative, the stronger the relationship between the two variables.  Thus, a coefficient of .453 is exactly as strong as a coefficient of -.453.  Positive coefficients tell us there is a direct relationship:  when one variable increases, the other increases.  Negative coefficients tell us that there is an inverse relationship: when one variable increases, the other one decreases.  Notice that the Pearson coefficient for the relationship between education and occupational prestige is .520, and it is positive.  This tells us that, just as we predicted, as education increases, occupational prestige increases.  But should we consider the relationship strong?  At .520, the coefficient is only about half as large as is possible.  It should not surprise us, however, that the relationship is not “perfect” (a coefficient of 1).  Education appears to be an important predictor of occupational prestige, but no doubt you can think of other reasons why people might enter a particular occupation. For example, someone with a college degree may decide that they really wanted to be a cheese-maker, which has an occupational prestige score of only 29, while a high-school dropout may one day become an owner of a bowling alley, which has a prestige score of 44.  Given the variety of factors that may influence one’s occupational choice, a coefficient of .520 suggests that the relationship between education and occupational prestige is actually quite strong.

The correlation matrix also gives the probability of being wrong if we assume that the relationship we find in our sample accurately reflects the relationship between education and occupational prestige that exists in the total population from which the sample was drawn (labeled as Sig. (2-tailed)).  The probability value is .000 (remember that the value is rounded to three digits), which is well below the conventional threshold of p < .05.  Thus, our hypothesis is supported.  There is a relationship (the coefficient is not 0), it is in the predicted direction (positive), and we can generalize the results to the population (p < .05).

Recall that we had some concerns about using the Pearson coefficient, given that  PRESTG80 is measured as an ordinal variable.  Notice that the coefficient, .523, is nearly identical to coefficient obtained using the Pearson correlation.