University of Exeter

DEPARTMENT OF PSYCHOLOGY


PSY6003 Advanced statistics: Multivariate analysis II: Manifest variables analyses

Topic 1: Multiple regression: Revision/Introduction


Contents of this handout: What is multiple regression, where does it fit in, and what is it good for? The idea of a regression equation; From simple regression to multiple regression; interpreting and reporting multiple regression results; Carrying out multiple regression; Exercises; Worked examples using Minitab and SPSS

These notes cover the material of the first lecture, which is designed to remind you briefly of the main ideas in multiple regression. They are not full explanations; they assume you have at least met multiple regression before. If you haven't, you will probably need to read Bryman & Cramer, pp. 177-186 and pp. 235-246. The words and phrases printed in bold type are all things which you should understand by the end of the course. Many of them you will already know; some will be explained in the course of this lecture. In some cases we will explain them later in the course. Some of the material in these notes will not be gone through in the lecture, and you should make sure to read it over and ask us for explanations if you don't understand it.

What is multiple regression, where does it fit in, and what is it good for?

Multiple regression is the simplest of all the multivariate statistical techniques. Mathematically, multiple regression is a straightforward generalisation of simple regression, the process of fitting the best straight line through the dots on an x-y plot or scattergram. We will discuss what "best" means later in the lecture.

Regression (simple and multiple) techniques are closely related to the analysis of variance (anova). Both are special cases of the General Linear Model or GLIM, and you can in fact do an anova using the regression commands in statistical packages (though the process is clumsy). You can combine the two, when what you have is an analysis of covariance (ancova), which we will discuss briefly later in this course.

What distinguishes multiple regression from other techniques? The following are the main points:

This means that multiple regression is useful in the following general class of situations. We observe one dependent variable, whose variation we want to explain in terms of a number of other independent variables, which we can also observe. These other variables are not under experimental control - we just have to accept the variations in them that happen to occur in the sample of people or situations we can observe. We want to know which if any of these independent variables is significantly correlated with the dependent variable, taking into account the various correlations that may exist between the independent variables. So typically we use multiple regression to analyse data that come from "natural" rather than experimental situations. This makes it very useful in social psychology, and social science generally. Note, however, that it is inherently a correlational technique; it cannot of itself tell us anything about the causalities that may underlie the relationships it describes.

There are some additional rules that have to be obeyed if multiple regression is to be useful:

The idea of a regression equation

Like many statistical procedures, multiple regression has two functions: to summarise some data, and to examine it for (statistically) significant trends. The first of these is part of descriptive statistics, the second of inferential statistics. We spend most of our time in elementary statistics courses thinking about inferential statistics, because at that level they are usually more difficult. But at any level, descriptive statistics are more important. In this section, we concentrate on how multiple regression describes a set of data.

How do we choose a descriptive statistic?

Any number we use to summarise a set of numbers is called a descriptive statistic. Many different descriptive statistics can be calculated for a given set of numbers, and different ones are useful for different purposes. In many cases, a descriptive statistic is chosen because it is in some sense the best summary of a particular type. But what do we mean by "best"?

Consider the best known of all descriptive statistics, the arithmetic mean - what lay people call the average. Why is this the best summary of a set of numbers? There is an answer, but it isn't obvious. The mean is the value from which the numbers in the set have the minimum sum of squared deviations. For the meaning of this, see Figure 1.


Figure 1

Consider observation 1. Its y value is y1. If we consider an "average" value ÿ, we define the deviation from the average as y1-ÿ, the squared deviation from the as (y1-ÿ)2, and the sum of squared deviations as sigmai(yi-ÿ)2. The arithmetic mean turns out to be the value of ÿ that makes this sum lowest. It also, of course, has the property that sigmai(yi-ÿ) = 0; that, indeed, is its definition.

Describing data with a simple regression equation

If we look at Figure 1, it's obvious that we could summarise the data better if we could find some way of representing the fact that the observations with high y values tend to be those with high x values. Graphically, we can do this by drawing a straight line on the graph so it passes through the cluster of points, as in Figure 2. Simple regression is a way of choosing the best straight line for this job.

Figure 2

This raises two problems: what is the best straight line, and how can we describe it when we have found it?
Let's deal first with describing a straight line. This is GCSE maths. Any straight line can be described by an equation relating the y values to the x values. In general, we usually write,

y = mx + c

Here m and c are constants whose values tell us which of the infinite number of possible straight lines we are looking at. m (from French monter) tells us about the slope or gradient of the line. Positive m means the line slopes upwards to the right; negative m that it slopes downwards. High m values mean a steep slope, low values a shallow one. c (from French couper) tells us about the intercept, i.e. where the line cuts the y axis: positive c means that when x is zero, y has a positive value, negative c means that when x is zero, y has a negative value. But for regression purposes, it's more convenient to use different symbols. We usually write:

y = a + bx

This is just the same equation with different names for the constants: a is the intercept, b is the gradient.

The problem of choosing the best straight line then comes down to finding the best values of a and b. We define "best" in the same way as we did when we explained why the mean is the best summary: we choose the a and b values that give us the line such that the sum of squared deviations from the line is minimised. This is illustrated in Figure 3. The best line is called the regression line, and the equation describing it is called the regression equation. The deviations from the line are also called residuals.

Figure 3

Goodness of fit

Having found the best straight line, the next question is how well it describes the data. We measure this by the fraction

    (sum of squared deviations from the line)
1 - -----------------------------------------
    (sum of squared deviations from the mean)

This is called the variance accounted for, symbolised by VAC or R2. Its square root is the Pearson correlation coefficient. R2 can vary from 0 (the points are completely random) to 1 (all the points lie exactly on the regression line); quite often it is reported as a percentage (e.g. 73% instead of 0.73). The Pearson correlation coefficient (usually symbolised by r) is always reported as a decimal value. It can take values from -1 to +1; if the value of b is negative, the value of r will also be negative.

Note that two sets of data can have identical a and b values and very different R2 values, or vice versa. Correlation measure the strength of a linear relationship: it tells you how much scatter there is about the best fitting straight line through a scattergram. a and b, on the other hand, tell you what the line is. The values of a and b will depend on the units of measurement used, but the value of r is independent of units. If we transform y and x to z-scores, which involves rescaling them so they have means of zero and standard deviations of 1, b will equal r.

Note carefully that a, b, R2 and r are all descriptive statistics. We have not said anything about significance tests. Given a set of paired x and y values, we can use virtually any statistics package to find the corresponding values of a, b and R2. It will also do some significance tests for us. The way to do this is described later. All the calculations can also be done by hand, or on a pocket calculator that has statistical functions.

From simple regression to multiple regression

What happens if we have more than two independent variables? In most cases, we can't draw graphs to illustrate the relationship between them all. But we can still represent the relationship by an equation. This is what multiple regression does. It's a straightforward extension of simple regression. If there are n independent variables, we call them x1, x2, x3 and so on up to xn. Multiple regression then finds values of a, b1, b2, b3 and so on up to bn which give the best fitting equation of the form

y = a + b1x1 + b2x2 + b3x3 + ... + bnxn

b1 is called the coefficient of x1, b2 is the coefficient of x2, and so forth. The equation is exactly like the one for simple regression, except that it is very laborious to work out the values of a, b1 etc by hand. Most statistics packages, however, do it with exactly the same command as for simple regression.

What do the regression coefficients mean? The coefficient of each independent variable tells us what relation that variable has with y, the dependent variable, when all the other independent variables are held constant. So, if b1 is high and positive, that means that if x2, x3 and so on up to xn do not change, then increases in x1 will correspond to large increases in y.

Goodness of fit in multiple regression

In multiple regression, as in simple regression, we can work out a value for R2. However, every time we add another independent variable, we necessarily increase the value of R2 (you can get a feel for how this happens if you compare Fig 3 with Fig 1). Therefore, in assessing the goodness of fit of a regression equation, we usually work in terms of a slightly different statistic, called R2-adjusted or R2adj. This is calculated as

R2adj = 1 - (1-R2)(N-n-1)/(N-1)

where N is the number of observations in the data set (usually the number of people) and n the number of independent variables or regressors. This allows for the extra regressors. You can see that R2adj will always be lower than R2 if there is more than one regressor. There is also another way of assessing goodness of fit in multiple regression, using the F statistic which is discussed below. It is possible in principle to to take the square root of R2 or R2adj to get what is called the multiple correlation coefficient, but we don't usually bother.

Prediction

Regression equations can also be used to obtain predicted or fitted values of the dependent variable for given values of the independent variable. If we know the values of x1, x2, ... xn, it is obviously a simple matter to calculate the value of y which, according to the equation, should correspond to them: we just multiply x1 by b1, x2 by b2, and so on, and add all the products to a. We can do this for combinations of independent variables that are represented in the data, and also for new combinations. We need to be careful, though, of extending the independent variable values far outside the range we have observed (extrapolating), as it is not guaranteed that the regression equation will still hold accurately.

Interpreting and reporting multiple regression results

The main questions multiple regression answers

Multiple regression enables us to answer five main questions about a set of data, in which n independent variables (regressors), x1 to xn, are being used to explain the variation in a single dependent variable, y.

  1. How well do the regressors, taken together, explain the variation in the dependent variable? This is assessed by the value of R2adj. As a very rough guide, in psychological applications we would usually reckon an R2adj of above 75% as very good; 50-75% as good; 25-50% as fairr; and below 25% as poor and perhaps unacceptable. Alas, R2adj values above 90% are rare in psychological data, and if you get one, you should wonder whether there is some artefact in your data.
  2. Are the regressors, taken together, significantly associated with the dependent variable? This is assessed by the statistic F in the "Analysis of Variance" or anova part of the regression output from a statistics package. This is the Fisher F as used in the ordinary anova, so its significance depends on its degrees of freedom, which in turn depend on sample sizes and/or the nature of the test used. As in anova, F has two degrees of freedom associated with it. In general they are referred to as the numerator and denominator degrees of freedom (because F is actually a ratio). In regression, the numerator degrees of freedom are associated with the regression (and equal the number of regressors used), and the denominator degrees of freedom with the residual or error; you can find them in the Regression and Error rows of the anova table in the output from a statistics package. If you were finding the significance of an F value by looking it up in a book of tables, you would need the degrees of freedom to do it. Statistics packages normally work out significances for you, and you will find them in the anova table next to the F value; but you need to use the degrees of freedom when reporting the results (see below). It is useful to remember that the higher the value of F, the more significant it will be for given degrees of freedom.
  3. What relationship does each regressor have with the dependent variable when all other regressors are held constant? This is answered by looking at the regression coefficients. Some statistics packages (e.g. Minitab) report these twice, once in the form of a regression equation and again (to an extra decimal place) in a table of regression coefficients and associated statistics. Note that regression coefficients have units. So if the dependent variable is number of cigarettes smoked per week, and one of the regressors is annual income, the coefficient for that regressor would have units of (cigarettes per week) per (pound of income per year). That means that if we changed the units of one of the variables, the regression coefficient would change - but the relationship it is describing, and what it is saying about it, would not. So the size of a regression coefficient doesn't tell us anything about the strength of the relationship it describes until we have taken the units into account. The fact that regression coefficients have units also means that we can give a precise interpretation to each coefficient. So, staying with smoking and income, a coefficient of 0.062 in this case would mean that, with all other variables held constant, increasing someone's income by one pound per year is associated with an increase of cigarette consumption of 0.062 cigarettes per week (we might want to make this easier to grasp by saying that an increase in income of 1 pound per week would be associated with an increase in cigarette consumption of 52 * 0.062 = 3.2 cigarettes per week). Negative coefficients mean that when the regressor increases, the dependent variable decreases. If the regressor is a dichotomous variable (e.g. gender), the size of the coefficient tells us the size of the difference between the two classes of individual (again, with all other variables held constant). So a gender coefficient of 2.6, with women coded 0 and men coded 1, would mean that with all other variables held constant, men's dependent variable scores would average 2.6 units higher than women's.
  4. Which independent variable has most effect on the dependent variable? It is not possible to give a fully satisfactory answer to this question, for a number of reasons. The chief one is that we are always looking at the effect of each variable in the presence of all the others; since the dependent variable need not be independent, it is hard to be sure which one is contributing to a joint relationship (or even to be sure that that means anything). However, the usual way of addressing the question is to look at the standardised regression coefficients or beta weights for each variable; these are the regression coefficients we would get if we converted all variables (independent and dependent) to z-scores before doing the regression. SPSS reports beta weights for each independent variable in its regression output; Minitab, unfortunately, does not.
  5. Are the relationships of each regressor with the dependent variable statistically significant, with all other regressors taken into account? This is answered by looking at the t values in the table of regression coefficients. The degrees of freedom for t are those for the residual in the anova table, but statistics packages work out significances for us, so we need to know the degrees of freedom only when it comes to reporting results. Note that if a regression coefficient is negative, most packages will report the corresponding t value as negative, but if you were looking it up in tables, you would use the absolute (unsigned) value, and the sign should be dropped when reporting results.

Further questions to ask

Either the nature of the data, or the regression results, may suggest further questions. For example, you may want to obtain means and standard deviations or histograms of variables to check on their distributions; or plot one variable against another, or obtain a matrix of correlations, to check on first order relationships. You should also check for unusual observations or "outliers": these will be discussed in the next session.

Reporting regression results

Research articles frequently report the results of several different regressions done on a single data set. In this case, it is best to present the results in a table. Where a single regression is done, however, that is unnecessary, and the results can be reported in text. The wording should be something like the following - this is for the depression vs age, income and gender example used as a Minitab example below:

Normally you will need to go on to discuss the meaning of the trends you have described.

Note the following pitfalls for the unwary:

Carrying out multiple regression

Minitab

At the end of the handout there is a complete worked example on some made-up data, in which we attempt to predict scores on a paper and pencil test of depression (running from 0 to 100) from income (in pounds/week), gender (coded 0 for men and 1 for women) and age. Note that the REGRESS command, which actually carries out the regression, needs us to tell it how many independent variables there are. It is very important to make sure that we then provide the corresponding number of columns - if we provide too many, Minitab will not warn us of the error, but will write some detailed results into the extra columns, thus overwriting any data we might have in them, and producing mystifying errors later in our analysis.

SPSS

The SPSS example uses a set of data on the psychology of tax avoidance. An appropriate command file would be as follows:

   title test regression
   get file='/singer1/eps/psybin/stats/tax.sys'
   regression variables=index free1 to law5
     /statistics=defaults
     /missing=meansubstitution
     /dependent=index
     /method=enter
   finish

Output from this file is given at the end of this handout. It shows that the 15 questionnaire items do quite a good job of predicting tax avoidance.


Exercises

1. The following are the IQ scores on the Verbal and Numerical scales of a certain test for a group of students:

                 Verbal: 98 120  85  97 100 132 124  88  91 144
              Numerical: 92 105 100  92  93 144 143  75  85 121

Use Minitab to calculate the mean and standard deviation of the scores on each scale. Use LET to work out the difference between them and put it in a new column. Use TTEST on this column to see whether there is a significant difference between the verbal and numerical scores.

2. Using the data from the previous example, work out the regression line for predicting Numerical scores (dependent variable) from Verbal scores (independent variable).

3. A social psychologist observes the scores achieved on a video game in a pub, by the first new (previously unobserved) player to use the machine after each half hour through the evening. They are as follows:

   Time:     6pm 6.30  7pm 7.30  8pm 8.30  9pm 9.30 10pm 10.30
   Score:   1760  995 2130  770 1535 3975 2120 5660 3341  4995

Use SPSS to investigate whether the data support the psychologist's hypothesis that more expert players use the machine later in the evening? What would be the most likely score to observe at 9.45pm?

4. The following data show the levels of anxiety recorded by a paper-and-pencil test just before a group of students took an examination, together with the exam marks obtained. Use Minitab's PLOT command to decide whether it would be appropriate to use linear regression to summarize these data.

   Anxiety score: 5 17 10 12  3 19  2 11  9  8 13 18  4  7
   Exam mark:    45 20 55 72 45 39 50 75 60 57 58 52 43 57

5. The Singer file /singer1/eps/psybin/stats/teengamb.DAT contains, for each of 47 teenagers, the following information:

  1. subject number
  2. gender (0=male, 1=female)
  3. status (arbitrary scale based on parents' occupation. Higher numbers => higher status)
  4. income (pocket money+earnings) in pounds/wk
  5. verbal intelligence (number of words out of 12 correctly defined)
  6. estimate (from questionnaire answers) of expenditure on all forms of gambling, in pounds/yr

Each line of the file contains all 6 data items for a single person. These are real data, collected during an undergraduate project a few years ago, and since published (Ide-Smith & Lea, 1988, Journal of Gambling Behavior, 4, 110-118). Note , though, that you won't get quite the same results as in the published article, because I've cut out the data from some subjects whose data would have given you problems.

Set up a Minitab worksheet with columns with appropriate names, and read these data into it using READ. Note that you don't need to type the file extension (.DAT) because this is the default for READ, but if you do type it, you must use CAPITALS. The rest of the filename must be typed in lower case.


Worked example of an elementary multiple regression using Minitab

MTB > set c1
DATA> 74 82 15 23 35 54 12 28 66 43 55 31 83 29 53 32
DATA> end
MTB > set c2
DATA> 120 55 350 210 185 110 730 150 61 175 121 225 45 325 171 103
DATA> end
MTB > set c3
DATA> 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 1
DATA> end
MTB > set c4
DATA> 33 28 47 55 32 63 59 68 27 32 42 51 47 33 51 20
DATA> end
MTB > name c1 'depress'
MTB > name c2 'income'
MTB > name c3 'm0f1'
MTB > name c4 'age'
MTB > regress c1 3 c2-c4

The regression equation is
depress = 68.3 - 0.0934 income + 3.31 m0f1 - 0.162 age

Predictor       Coef       Stdev    t-ratio        p
Constant       68.28       15.44       4.42    0.001
income      -0.09336     0.02937      -3.18    0.008
m0f1           3.306       8.942       0.37    0.718
age          -0.1617      0.3436      -0.47    0.646

s = 17.70       R-sq = 52.0%     R-sq(adj) = 39.9%

Analysis of Variance

SOURCE       DF          SS          MS         F        p
Regression    3      4065.4      1355.1      4.32    0.028
Error        12      3760.0       313.3
Total        15      7825.4

SOURCE       DF      SEQ SS
income        1      3940.5
m0f1          1        55.5
age           1        69.4

Continue? y
Unusual Observations
Obs.  income   depress       Fit Stdev.Fit  Residual   St.Resid
  7      730     12.00     -6.10     15.57     18.10      2.15RX

R denotes an obs. with a large st. resid.
X denotes an obs. whose X value gives it large influence.



Output from the SPSS regression on the tax data set

some blank lines have been removed

           * * * *   M U L T I P L E   R E G R E S S I O N   * * * *

Mean Substituted for Missing Data

Equation Number 1    Dependent Variable..   INDEX   Evasion measure

Block Number  1.  Method:  Enter
Variable(s) Entered on Step Number
   1..    LAW5
   2..    FREE2
   3..    ALIEN2
   4..    LAW3
   5..    LAW4
   6..    LAW2
   7..    LAW1
   8..    ALIEN4
   9..    ALIEN1
  10..    ALIEN5
  11..    FREE5
  12..    ALIEN3
  13..    FREE3
  14..    FREE4
  15..    FREE1

Multiple R           .93111
R Square             .86696
Adjusted R Square    .80460
Standard Error      1.93857

Analysis of Variance
                    DF      Sum of Squares      Mean Square
Regression          15           783.65847         52.24390
Residual            32           120.25820          3.75807
F =      13.90179       Signif F =  .0000

           * * * *   M U L T I P L E   R E G R E S S I O N   * * * *

Equation Number 1    Dependent Variable..   INDEX   Evasion measure

------------------ Variables in the Equation ------------------

Variable              B        SE B       Beta         T  Sig T
LAW5            .103593     .172898    .046600      .599  .5533
FREE2         -1.278802     .641764   -.399007    -1.993  .0549
ALIEN2         -.177951     .513296   -.064325     -.347  .7311
LAW3           -.269736     .224503   -.117014    -1.201  .2384
LAW4            .294076     .286945    .112165     1.025  .3131
LAW2            .224659     .350312    .084970      .641  .5259
LAW1            .106083     .234746    .037769      .452  .6544
ALIEN4          .353269     .339096    .130039     1.042  .3053
ALIEN1         1.227092     .252610    .567362     4.858  .0000
ALIEN5          .272150     .293067    .125177      .929  .3600
FREE5          -.833464     .339398   -.340064    -2.456  .0197
ALIEN3         -.059760     .345620   -.025101     -.173  .8638
FREE3         -1.531610     .668878   -.485904    -2.290  .0288
FREE4         -2.142148     .657277   -.670692    -3.259  .0027
FREE1          4.401877     .721189   1.828551     6.104  .0000
(Constant)     -.845601    4.572997                -.185  .8545


Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623


Send questions and comments to the departmental administrator or to the author of this page
Goto Home page for this course | next topic | FAQ file
Goto home page for: University of Exeter | Department of Psychology | Staff | Students | Research | Teaching | Miscellaneous


Disclaimer Home (access count since 10th February 1997).
Document revised 10th February 1997