# Multiple regression: basic concepts and procedures

Contents of this handout: From simple regression to multiple regression; Goodness of fit in multiple regression; The main questions multiple regression answers; Further questions to ask about your data and results; How to report regression results

### From simple regression to multiple regression

What happens if we have more than two independent variables? In most cases, we can't draw graphs to illustrate the relationship between them all. But we can still represent the relationship by an equation. This is what multiple regression does. It's a straightforward extension of simple regression. If there are n independent variables, we call them x1, x2, x3 and so on up to xn. Multiple regression then finds values of a, b1, b2, b3 and so on up to bn which give the best fitting equation of the form

y = a + b1x1 + b2x2 + b3x3 + ... + bnxn

b1 is called the coefficient of x1, b2 is the coefficient of x2, and so forth. The equation is exactly like the one for simple regression, except that it is very laborious to work out the values of a, b1 etc by hand. Minitab, however, does it with exactly the same command as for simple regression.
What do the regression coefficients mean? The coefficient of each independent variable tells us what relation that variable has with y, the dependent variable, with all the other independent variables held constant. So, if b1 is high and positive, that means that if x2, x3 and so on up to xn do not change, then increases in x1 will correspond to large increases in y.

### Goodness of fit in multiple regression

In multiple regression, as in simple regression, we can work out a value for R2. However, every time we add another independent variable, we necessarily increase the value of R2 (you can get an idea of why this happens if you compare Fig 3 with Fig 1 in the handout on "The idea of a regression equation"). Therefore, in assessing the goodness of fit of a regression equation, we usually work in terms of a slightly different statistic, called R2or R2adj. This is calculated as

where N is the number of observations in the data set (usually the number of people) and n the number of independent variables or regressors. This allows for the extra regressors. Check that you can see from the formula that R2adj will always be lower than R2 if there is more than one regressor. There is also another way of assessing goodness of fit in multiple regression, using the F statistic which we will meet in a moment.

### The main questions multiple regression answers

Multiple regression enables us to answer five main questions about a set of data, in which n independent variables (regressors), x1 to xn, are being used to explain the variation in a single dependent variable, y.

1. How well do the regressors, taken together, explain the variation in the dependent variable? This is assessed by the value of R2adj. As a very rough guide, in psychological applications we would usually reckon an R2adj of above 75% as very good; 50% to 75% as good; 25% to 50% as poor but acceptable; and below 25% as very poor and perhaps unacceptable. Alas, an R2adj value above 90% is very rare in psychology, and should make you wonder whether there is some artefact in your data.
2. Are the regressors, taken together, significantly associated with the dependent variable? This is assessed by the statistic F in the "Analysis of Variance" or anova part of the regression output. F is like some other statistics (e.g. t, chi2) in that its significance depends on its degrees of freedom, which in turn depend on sample sizes and/or the nature of the test used. Unlike t, though, F has two degrees of freedom associated with it. In general they are referred to as the numerator and denominator degrees of freedom (because F is actually a ratio). In regression, the numerator degrees of freedom are associated with the regression, and the denominator degrees of freedom with the residual or error; you can find them in the Regression and Error rows of the anova table in the Minitab output. If you were finding the significance of an F value by looking it up in a book of tables, you would need the degrees of freedom to do it. Minitab works out significances for you, and you will find them in the anova table next to the F value; but you need to use the degrees of freedom when reporting the results (see below). Note that the higher the value of F, the more significant it will be for given degrees of freedom.
3. What relationship does each regressor have with the dependent variable when all other regressors are held constant? This is answered by looking at the regression coefficients. Minitab reports these twice, once in the regression equation and again (to an extra decimal place) in the table of regression coefficients and associated statistics. Note that regression coefficients have units. So if the dependent variable is score on a psychometric test of depression, and one of the regressors is monthly income, the coefficient for that regressor would have units of (scale points) per ( income per month). That means that if we changed the units of one of the variables, the regression coefficient would change but the relationship it is describing, and what it is saying about it, would not. So the size of a regression coefficient doesn't tell us anything about the strength of the relationship it describes until we have taken the units into account. The fact that regression coefficients have units also means that we can give a precise interpretation to each coefficient. So, staying with depression score and income, a coefficient of -0.0934 (as in the worked example on the next sheet) would mean that, with all other variables held constant, increasing someone's income by 1 per month is associated with a decrease of depression score of 0.0934 points (we might want to make this more meaningful by saying that an increase in income of 100 per month would be associated with a decrease in depression score of 100 * 0.0934 = 9.34 scale units). As in this example, negative coefficients mean that when the regressor increases, the dependent variable decreases. If the regressor is a dichotomous variable (e.g. gender), the size of the coefficient tells us the size of the difference between the two classes of individual (again, with all other variables heldd constant). So a gender coefficient of 3.3, with men coded 0 and women coded 1, would mean that with all other variables held constant, women's dependent variable scores would average 3.3 units higher than men's.
4. Which regressor has most effect on the dependent variable? It is not possible to give a fully satisfactory answer to this question, for a number of reasons. The chief one is that we are always looking at the effect of each variable in the presence of all the others; since the dependent variable need not be independent, it is hard to be sure which one is contributing to a joint relationship (or even to be sure that that means anything). However, the usual way of addressing the question is to look at the standardised regression coefficients or beta weights for each variable; these are the regression coefficients we would get if we converted all variables (independent and dependent) to z-scores before doing the regression. Minitab, unfortunately, does not report beta weights for the independent variable in its regression output, though it is possible to calculate them; SPSS, which you will learn about later in the course, does give them directly.
5. Are the relationships of each regressor with the dependent variable statistically significant, with all other regressors taken into account? This is answered by looking at the t values in the table of regression coefficients. The degrees of freedom for t are those for the residual in the anova table, but Minitab works out significances for us, so we need to know the degrees of freedom only when it comes to reporting results. Note that if a regression coefficient is negative, Minitab will report the corresponding t value as negative, but if you were looking it up in tables, you would use the absolute (unsigned) value.

Either the nature of the data, or the regression results, may suggest further questions. For example, you may want to obtain means and standard deviations or histograms of variables to check on their distributions; or plot one variable against another, or obtain a matrix of correlations, to check on first order relationships. Minitab does some checking for you automatically, and reports if it finds "unusual observations". If there are unusual observations, PLOT or HISTOGRAM may tell you what the possible problems are. The usual kinds of unusual observations are "outliers" points which lie far from the main distributions or the main trends of one or more variables. Serious outliers should be dealt with as follows:
1. temporarily remove the observations from the data set. In Minitab, this can be done by using the LET command to set the outlier value to "missing", indicated by an asterisk instead of a numerical value. For example, if item 37 in the variable held in C1 looks like an outlier, we could type:
LET C1(37)='*'
note the single quotes round the asterisk.
2. repeat the regression and see whether the same qualitative results are obtained (the quantitative results will inevitably be different).
3. if the same general results are obtained, we can conclude that the outliers are not distorting the results. Report the results of the original regression, adding a note that removal of outliers did not greatly affect them.
4. if different general results are obtained, accurate interpretation will require more data to be collected. Report the results of both regressions, and note that the interpetation of the data is uncertain. The outliers may be due to errors of observation, data coding, etc, and in this case they should be corrected or discarded. However, they may also represent a subpopulation for which the effects of interest are different from those in the main population. If they are not due to error, the group of data contributing to outliers will need to be identified, and if possible a reasonably sized sample collected from it so that it can be compared with the main population. This is a scientific rather than a statistical problem.

### Reporting regression results

Research articles sometimes report the results of several different regressions done on a single data set. In this case, it is best to present the results in a table. Where a single regression is done, however, that is unnecessary, and the results can be reported in text. The wording should be something like the following this is for the depression vs age, income and gender example used in the class:
The data were analysed by multiple regression, using as regressors age, income and gender. The regression was a rather poor fit (R2adj = 40%), but the overall relationship was significant (F3,12 = 4.32, p < 0.05). With other variables held constant, depression scores were negatively related to age and income, decreasing by 0.16 for every extra year of age, and by 0.09 for every extra pound per week income. Women tended to have higher scores than men, by 3.3 units. Only the effect of income was significant (t12 = 3.18, p < 0.01).

Note the following:

• The above brief paragraph does not exhaust what you can say about a set of regression results. There may be features of the data you should look at "Unusual observations", for example. Normally you will need to go on to discuss the meaning of the trends you have described.
• Always report what happened before moving on to its significance so R2adj values before F values, regression coefficients before t values. Remember, descriptive statistics are more important than significance tests.
• Although Minitab will give a negative t value if the corresponding regression coefficient is negative, you should drop the negative sign when reporting the results.
• Degrees of freedom for both F and t values must be given. Usually they are written as subscripts. For F the numerator degrees of freedom are given first. You can also put degrees of freedom in parentheses, or report them explicitly, e.g.: "F(3,12) = 4.32" or "F = 4.32, d. of f. = 3, 12".
• Significance levels can either be reported exactly (e.g. p = 0.032) or in terms of conventional levels (e.g. p < 0.05). There are arguments in favour of either, so it doesn't much matter which you do. But you should be consistent in any one report.
• Beware of highly significant F or t values, whose significance levels will be reported by statistics packages as, for example, 0.0000. It is an act of statistical illiteracy to write p = 0.0000; significance levels can never be exactly zero there is always some probability that the observed data could arise if the null hypothesis was true. What the package means is that this probability is so low it can't be represented with the number of columns available. We should write it as p < 0.00005.
• Beware of spurious precision, i.e. reporting coefficients etc to huge numbers of significant figures when, on the basis of the sample you have, you couldn't possibly expect them to replicate to anything like that degree of precision if someone repeated the study. F and t values are conventionally reported to two decimal places, and R2adj values to the nearest percentage point (sometimes to one additional decimal place). For coefficients, you should be guided by the sample size: with a sample size of 16, as in the example used above, two significant figures is plenty, but even with more realistic samples, in the range of 100 to 1000, three significant figures is usually as far as you should go. This means that you will usually have to round off the numbers that Minitab will give you.

Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623