Contents of this handout: From simple regression to multiple regression; Goodness of fit in multiple regression; The main questions multiple regression answers; Further questions to ask about your data and results; How to report regression results

What happens if we have more than two independent variables? In most
cases, we can't draw graphs to illustrate the relationship between them
all. But we can still represent the relationship by an equation. This is
what multiple regression does. It's a straightforward extension of simple
regression. If there are* n *independent variables, we call them*
x*_{1},* x*_{2},* x*_{3} and so on
up to* x*_{n}. Multiple regression then finds values of *a*,*
b*_{1},* b*_{2},* b*_{3} and so on
up to* b*_{n} which give the best fitting equation of the
form

*y = a *+* b*_{1}x_{1} +* b*_{2}x_{2}
+* b*_{3}x_{3} + ... +* b*_{n}x_{n}

*b*_{1} is called the **coefficient** of* x*_{1},*
b*_{2} is the coefficient of* x*_{2}, and so forth.
The equation is exactly like the one for simple regression, except that
it is very laborious to work out the values of *a*,* b*_{1}
etc by hand. Minitab, however, does it with exactly the same command as
for simple regression.

What do the regression coefficients mean? The coefficient of each independent
variable tells us what relation that variable has with *y*, the dependent
variable, with all the other independent variables held constant. So, if*
b*_{1} is high and positive, that means that if* x*_{2},*
x*_{3} and so on up to* x*_{n} do not change, then
increases in* x*_{1} will correspond to large increases in
*y*.

In multiple regression, as in simple regression, we can work out a value
for* R*^{2}. However, every time we add another independent
variable, we necessarily increase the value of* R*^{2} (you
can get an idea of why this happens if you compare Fig 3 with Fig 1 in
the handout on "The idea of a regression equation"). Therefore,
in assessing the goodness of fit of a regression equation, we usually work
in terms of a slightly different statistic, called* R*^{2}or*
R*^{2}_{adj}. This is calculated as

*R*^{2}_{adj} = 1 - (1-*R*^{2})(*N*-*n*-1)/*N*-1)

where *N* is the number of observations in the data set (usually
the number of people) and* n *the number of independent variables
or **regressors**. This allows for the extra regressors. Check that
you can see from the formula that* R*^{2}_{adj} will
always be lower than* R*^{2} if there is more than one regressor.
There is also another way of assessing goodness of fit in multiple regression,
using the* F *statistic which we will meet in a moment.

Multiple regression enables us to answer five main questions about a
set of data, in which* n *independent variables (regressors),* x*_{1}
to* x*_{n}, are being used to explain the variation in a single
dependent variable, *y*.

- How well do the regressors, taken together, explain the variation in
the dependent variable? This is assessed by the value of
*R*^{2}_{adj}. As a very rough guide, in psychological applications we would usually reckon an*R*^{2}_{adj}of above 75% as very good; 50% to 75% as good; 25% to 50% as poor but acceptable; and below 25% as very poor and perhaps unacceptable. Alas, an*R*^{2}_{adj}value above 90% is very rare in psychology, and should make you wonder whether there is some artefact in your data. - Are the regressors, taken together, significantly associated with the
dependent variable? This is assessed by the statistic
*F*in the "Analysis of Variance" or anova part of the regression output.*F*is like some other statistics (e.g.*t*, chi^{2}) in that its significance depends on its**degrees of freedom**, which in turn depend on sample sizes and/or the nature of the test used. Unlike*t*, though,*F*has two degrees of freedom associated with it. In general they are referred to as the**numerator**and**denominator**degrees of freedom (because*F*is actually a ratio). In regression, the numerator degrees of freedom are associated with the regression, and the denominator degrees of freedom with the**residual**or**error**; you can find them in the Regression and Error rows of the anova table in the Minitab output. If you were finding the significance of an*F*value by looking it up in a book of tables, you would need the degrees of freedom to do it. Minitab works out significances for you, and you will find them in the anova table next to the*F*value; but you need to use the degrees of freedom when reporting the results (see below). Note that the higher the value of*F*, the more significant it will be for given degrees of freedom. - What relationship does each regressor have with the dependent variable
when all other regressors are held constant? This is answered by looking
at the regression coefficients. Minitab reports these twice, once in the
regression equation and again (to an extra decimal place) in the table
of regression coefficients and associated statistics. Note that regression
coefficients have units. So if the dependent variable is score on a psychometric
test of depression, and one of the regressors is monthly income, the coefficient
for that regressor would have units of (scale points) per ( income per
month). That means that if we changed the units of one of the variables,
the regression coefficient would change but the relationship it is describing,
and what it is saying about it, would not. So the size of a regression
coefficient doesn't tell us anything about the strength of the relationship
it describes until we have taken the units into account. The fact that
regression coefficients have units also means that we can give a precise
interpretation to each coefficient. So, staying with depression score and
income, a coefficient of -0.0934 (as in the worked example on the next
sheet) would mean that, with all other variables held constant, increasing
someone's income by 1 per month is associated with a decrease of depression
score of 0.0934 points (we might want to make this more meaningful by saying
that an increase in income of 100 per month would be associated with a
decrease in depression score of 100 * 0.0934 = 9.34 scale units). As in
this example, negative coefficients mean that when the regressor increases,
the dependent variable decreases. If the regressor is a
**dichotomous**variable (e.g. gender), the size of the coefficient tells us the size of the difference between the two classes of individual (again, with all other variables heldd constant). So a gender coefficient of 3.3, with men coded 0 and women coded 1, would mean that with all other variables held constant, women's dependent variable scores would average 3.3 units higher than men's. - Which regressor has most effect on the dependent variable? It is not
possible to give a fully satisfactory answer to this question, for a number
of reasons. The chief one is that we are always looking at the effect of
each variable in the presence of all the others; since the dependent variable
need not be independent, it is hard to be sure which one is contributing
to a joint relationship (or even to be sure that that means anything).
However, the usual way of addressing the question is to look at the
**standardised regression coefficients**or**beta weights**for each variable; these are the regression coefficients we would get if we converted all variables (independent and dependent) to**z-scores**before doing the regression. Minitab, unfortunately, does not report beta weights for the independent variable in its regression output, though it is possible to calculate them; SPSS, which you will learn about later in the course, does give them directly. - Are the relationships of each regressor with the dependent variable
statistically significant, with all other regressors taken into account?
This is answered by looking at the
*t*values in the table of regression coefficients. The degrees of freedom for*t*are those for the residual in the anova table, but Minitab works out significances for us, so we need to know the degrees of freedom only when it comes to reporting results. Note that if a regression coefficient is negative, Minitab will report the corresponding*t*value as negative, but if you were looking it up in tables, you would use the**absolute**(unsigned) value.

Either the nature of the data, or the regression results, may suggest
further questions. For example, you may want to obtain means and standard
deviations or histograms of variables to check on their distributions;
or plot one variable against another, or obtain a matrix of correlations,
to check on first order relationships. Minitab does some checking for you
automatically, and reports if it finds "unusual observations".
If there are unusual observations, PLOT or HISTOGRAM may tell you what
the possible problems are. The usual kinds of unusual observations are
"**outliers**" points which lie far from the main distributions
or the main trends of one or more variables. Serious outliers should be
dealt with as follows:

1. temporarily remove the observations from the data set. In Minitab, this
can be done by using the LET command to set the outlier value to "**missing**",
indicated by an asterisk instead of a numerical value. For example, if
item 37 in the variable held in C1 looks like an outlier, we could type:

LET C1(37)='*'

note the single quotes round the asterisk.

2. repeat the regression and see whether the same qualitative results are
obtained (the quantitative results will inevitably be different).

3. if the same general results are obtained, we can conclude that the outliers
are not distorting the results. Report the results of the original regression,
adding a note that removal of outliers did not greatly affect them.

4. if different general results are obtained, accurate interpretation will
require more data to be collected. Report the results of both regressions,
and note that the interpetation of the data is uncertain. The outliers
may be due to errors of observation, data coding, etc, and in this case
they should be corrected or discarded. However, they may also represent
a subpopulation for which the effects of interest are different from those
in the main population. If they are not due to error, the group of data
contributing to outliers will need to be identified, and if possible a
reasonably sized sample collected from it so that it can be compared with
the main population. This is a scientific rather than a statistical problem.

Research articles sometimes report the results of several different
regressions done on a single data set. In this case, it is best to present
the results in a table. Where a single regression is done, however, that
is unnecessary, and the results can be reported in text. The wording should
be something like the following this is for the depression vs age, income
and gender example used in the class:

The data were analysed by multiple regression, using as regressors age,
income and gender. The regression was a rather poor fit (R^{2}_{adj}
= 40%), but the overall relationship was significant (F_{3,12}
= 4.32, p < 0.05). With other variables held constant, depression scores
were negatively related to age and income, decreasing by 0.16 for every
extra year of age, and by 0.09 for every extra pound per week income. Women
tended to have higher scores than men, by 3.3 units. Only the effect of
income was significant (t_{12} = 3.18, p < 0.01).

Note the following:

- The above brief paragraph does not exhaust what you can say about a set of regression results. There may be features of the data you should look at "Unusual observations", for example. Normally you will need to go on to discuss the meaning of the trends you have described.
- Always report what happened before moving on to its significance so
*R*^{2}_{adj}values before*F*values, regression coefficients before*t*values. Remember, descriptive statistics are more important than significance tests. - Although Minitab will give a negative
*t*value if the corresponding regression coefficient is negative, you should drop the negative sign when reporting the results.

- Degrees of freedom for both
*F*and*t*values must be given. Usually they are written as subscripts. For*F*the numerator degrees of freedom are given first. You can also put degrees of freedom in parentheses, or report them explicitly, e.g.: "F(3,12) = 4.32" or "F = 4.32, d. of f. = 3, 12". - Significance levels can either be reported exactly (e.g. p = 0.032)
or in terms of conventional levels (e.g.
*p*< 0.05). There are arguments in favour of either, so it doesn't much matter which you do. But you should be consistent in any one report. - Beware of highly significant
*F*or*t*values, whose significance levels will be reported by statistics packages as, for example, 0.0000. It is an act of statistical illiteracy to write*p*= 0.0000; significance levels can never be exactly zero there is always some probability that the observed data could arise if the**null hypothesis**was true. What the package means is that this probability is so low it can't be represented with the number of columns available. We should write it as*p*< 0.00005. - Beware of
**spurious precision**, i.e. reporting coefficients etc to huge numbers of**significant figures**when, on the basis of the sample you have, you couldn't possibly expect them to replicate to anything like that degree of precision if someone repeated the study.*F*and*t*values are conventionally reported to two decimal places, and*R*^{2}_{adj}values to the nearest percentage point (sometimes to one additional decimal place). For coefficients, you should be guided by the sample size: with a sample size of 16, as in the example used above, two significant figures is plenty, but even with more realistic samples, in the range of 100 to 1000, three significant figures is usually as far as you should go. This means that you will usually have to round off the numbers that Minitab will give you.

Stephen Lea

University of Exeter

Department of Psychology

Washington Singer Laboratories

Exeter EX4 4QG

United Kingdom

Tel +44 1392 264626

Fax +44 1392 264623

Send questions and comments to the departmental administrator or to the author of this page

Goto Home page for this
course | previous
topic |
examples
sheet | next
topic

Goto home page for: University of
Exeter | Department of
Psychology | Staff
| Students |
Research | Teaching
| Miscellaneous

(access count).