Tuesday, August 24, 2010

Regression Analysis

We can also analyze the relationship between education and occupational prestige using regression analysis.  But first, let’s look at the relationship graphically by creating a scatterplot.  Click on "Graphs," "Scatter" and "Define" (we will use the default format, “Simple”).  This will open up the dialog box.  In the box on the left, click on EDUC then on the arrow key that is pointing toward the box labeled "X Axis" (because it is the independent variable in our hypothesis).  Next, click on PRESTG80 and move it into the box labeled "Y Axis" (because it is the dependent variable).  Then, click OK.

What we see is a plot of the number of years of education by the occupational prestige score for persons in the data set who have a job. 

We can edit our graph to make it easier to interpret.  First, double-click anywhere in the graph.  This will cause the graph to open in its own window.  Then, double-click on the X-axis.  A dialog box will open.  In the Range section of the box, change the Minimum to 0.  In the Major and Minor Divisions sections, change the Increments to 2.  Then, click OK.

Now, on the Menu Bar, click on “Chart,” then “Options.”  In the Fit Line section, click in the box next to Total.  Then, click on the Fit Options button, and click in the box next to “Display R-square in legend.”  Click Continue, then OK.

Notice the Fit Line that is now drawn on the graph. Regression (and correlation) analyze linear relationships between variables, finding the line that “best fits” the data (i.e. it keeps the errors, distances of points from the line, to a minimum).  The Fit Line shows you the line that describes the linear relationship.  Also notice the R-square statistic listed to the right of the graph.  Multiplied by 100, this statistic tells us the percentage of the variation in the dependent variable (PRESTG80, on the Y-axis) that is explained by the scores on the independent variable (EDUC, on the X-axis).  Thus, years of education predict 27.03% of the variation in occupational prestige in our sample.  Recall that the Pearson coefficient was .520.  If you square the Pearson coefficient (.520 x .520), you get .2704 – the same as the R-square (give or take some rounding)!  Thus, by knowing the correlation coefficient, you can also know the amount of variance in one variable (dependent) that is explained by the other variable (independent) in a bivariate analysis. 


We can get more information about the regression line.  Minimize the SPSS Chart Editor.  Click on "Analyze," "Regression," and "Linear."  This opens up the dialog box shown in Figure 7-11.  Move PRESTG80 to the "Dependent" box, and EDUC to the "Independent(s)" box.  Click OK. 

The first table just shows the variables that have been included in the analysis.  The second table, “Model Summary,” shows the R-square statistic, which is .270. 

The third table, ANOVA, gives the information about the model as a whole.  ANOVA is discussed briefly in chapter 6.  The final table, Coefficients, gives results of the regression analysis that are not available using only correlation techniques.  Look at the “Unstandardized Coefficients” column.  Two statistics are reported: B, which is the regression coefficient, and the standard error.  Notice that there are two statistics reported under B:  one labeled as (Constant), the other labeled as EDUC.  The statistic labeled as EDUC is the regression coefficient, which is the slope of the line that you saw on the scatterplot (note that in scholarly reports, it is conventional to refer to the regression coefficient using the lower case, b).  The one labeled as (Constant) is not actually a regression coefficient, but is the Y-intercept (SPSS reports it in this column for convenience only).


Y = a + bX

Y refers to the value of the dependent variable for a given case, a is the Y-intercept (the point where the line crosses the Y-axis, listed as Constant on your output), b is the slope of the line which describes the relationship between the independent and dependent variables (B for EDUC), and X is the value of the independent variable for a given case.

We know that the linear relationship between X and Y (EDUC and PRESTG80) is not perfect.  The correlation coefficient was not 1 (or –1), and the scatterplot showed plenty of cases that did not fall directly on the line.  Thus, it is clear to us that knowing someone’s education will not tell us without fail what their occupational prestige is, and furthermore, we are only analyzing a sample of cases and not the whole population to which we want to generalize our findings.  It is clear that there is some error built into our findings (the reason that the Fit Line is usually called the “Best Fit Line”).  For these reasons, it is conventional to write the formula for the line as

Y = a + bX + e, where e refers to error.

What can we do with this formula?  One thing we can do is make predictions about particular values of the independent variable, using just a little arithmetic.  All we have to do is plug the values from our output into the formula for a line (for our purposes, we will ignore the “e”):

Y = 9.84 +  2.565X

9.84, the Y-intercept (or Constant), is interpreted as the average occupational prestige score (our dependent, or Y variable), holding constant the effects of education (our independent, or X variable).  2.565 is the slope of the line. That is, if you refer back to the scatterplot, if you move one unit to the right on the X-axis, then move 2.565 units upward, you will intersect with the regression line.  (It is possible to have a negative coefficient.  In that case, to intersect with the line, you would move one unit to the right, and then B units downward.)

What occupational prestige score would our results predict for a person who completed high school, but no higher education?  All we have to do is enter 12 (as in twelve years of education) into our education:

Y  = 9.84 +  2.565(12)
Y  = 40.62

We find that having 12 years of education is associated with an occupational prestige score of 40.62.  But what of the error?  We know that not every high school graduate has this exact prestige score.  We acknowledge this when we discuss results by stating that on average, those with 12 years of education will have occupations with prestige scores of 40.62.  This language points out to our readers that it is likely that some of those respondents scored higher and some lower, but that 40.62 represents a central point.  In sum, the error tells us about the distance from actual values of Y (the answers that the GSS survey respondents gave) and predicted values of Y (the one’s you calculate based on the GSS respondent’s information in the “X” variable).  Thus, the error is the difference between a predicted value of Y for a given case and the actual value of Y for a given case (-Y).

More generally, though, when we discuss regression results, we rarely compute predicted scores for particular values of the independent variable.  Instead, in scholarly reports, we usually point out the general process at work.  In our case, we would say that “each additional year of education is associated with a 2.565 increase on the occupational prestige scale.”  Note that we refer to “an additional year of education” because our independent variable was measured as years of school completed.  Thus, the “unit” of measurement is years. We say there was a 2.565 increase in prestige with a unit increase in education, because that is the distance we have to move to intersect with the Y-axis, which represents occupational prestige.

No comments:

Post a Comment