Multiple regression: basic concepts and procedures
Contents of this handout: From
simple regression to multiple regression; Goodness
of fit in multiple regression; The main questions
multiple regression answers; Further questions
to ask about your data and results; How to report
regression results
From simple regression to multiple regression
What happens if we have more than two independent variables? In most
cases, we can't draw graphs to illustrate the relationship between them
all. But we can still represent the relationship by an equation. This is
what multiple regression does. It's a straightforward extension of simple
regression. If there are n independent variables, we call them
x1, x2, x3 and so on
up to xn. Multiple regression then finds values of a,
b1, b2, b3 and so on
up to bn which give the best fitting equation of the
form
y = a + b1x1 + b2x2
+ b3x3 + ... + bnxn
b1 is called the coefficient of x1,
b2 is the coefficient of x2, and so forth.
The equation is exactly like the one for simple regression, except that
it is very laborious to work out the values of a, b1
etc by hand. Minitab, however, does it with exactly the same command as
for simple regression.
What do the regression coefficients mean? The coefficient of each independent
variable tells us what relation that variable has with y, the dependent
variable, with all the other independent variables held constant. So, if
b1 is high and positive, that means that if x2,
x3 and so on up to xn do not change, then
increases in x1 will correspond to large increases in
y.
Goodness of fit in multiple regression
In multiple regression, as in simple regression, we can work out a value
for R2. However, every time we add another independent
variable, we necessarily increase the value of R2 (you
can get an idea of why this happens if you compare Fig 3 with Fig 1 in
the handout on "The idea of a regression equation"). Therefore,
in assessing the goodness of fit of a regression equation, we usually work
in terms of a slightly different statistic, called R2or
R2adj. This is calculated as
R2adj = 1 - (1-R2)(N-n-1)/N-1)
where N is the number of observations in the data set (usually
the number of people) and n the number of independent variables
or regressors. This allows for the extra regressors. Check that
you can see from the formula that R2adj will
always be lower than R2 if there is more than one regressor.
There is also another way of assessing goodness of fit in multiple regression,
using the F statistic which we will meet in a moment.
The main questions multiple regression answers
Multiple regression enables us to answer five main questions about a
set of data, in which n independent variables (regressors), x1
to xn, are being used to explain the variation in a single
dependent variable, y.
- How well do the regressors, taken together, explain the variation in
the dependent variable? This is assessed by the value of R2adj.
As a very rough guide, in psychological applications we would usually reckon
an R2adj of above 75% as very good; 50% to
75% as good; 25% to 50% as poor but acceptable; and below 25% as very poor
and perhaps unacceptable. Alas, an R2adj value
above 90% is very rare in psychology, and should make you wonder whether
there is some artefact in your data.
- Are the regressors, taken together, significantly associated with the
dependent variable? This is assessed by the statistic F in the "Analysis
of Variance" or anova part of the regression output. F is like
some other statistics (e.g. t, chi2) in that its significance
depends on its degrees of freedom, which in turn depend on sample
sizes and/or the nature of the test used. Unlike t, though, F
has two degrees of freedom associated with it. In general they are
referred to as the numerator and denominator degrees of freedom
(because F is actually a ratio). In regression, the numerator degrees
of freedom are associated with the regression, and the denominator degrees
of freedom with the residual or error; you can find them
in the Regression and Error rows of the anova table in the Minitab output.
If you were finding the significance of an F value by looking it
up in a book of tables, you would need the degrees of freedom to do it.
Minitab works out significances for you, and you will find them in the
anova table next to the F value; but you need to use the degrees
of freedom when reporting the results (see below). Note that the higher
the value of F, the more significant it will be for given degrees
of freedom.
- What relationship does each regressor have with the dependent variable
when all other regressors are held constant? This is answered by looking
at the regression coefficients. Minitab reports these twice, once in the
regression equation and again (to an extra decimal place) in the table
of regression coefficients and associated statistics. Note that regression
coefficients have units. So if the dependent variable is score on a psychometric
test of depression, and one of the regressors is monthly income, the coefficient
for that regressor would have units of (scale points) per ( income per
month). That means that if we changed the units of one of the variables,
the regression coefficient would change but the relationship it is describing,
and what it is saying about it, would not. So the size of a regression
coefficient doesn't tell us anything about the strength of the relationship
it describes until we have taken the units into account. The fact that
regression coefficients have units also means that we can give a precise
interpretation to each coefficient. So, staying with depression score and
income, a coefficient of -0.0934 (as in the worked example on the next
sheet) would mean that, with all other variables held constant, increasing
someone's income by 1 per month is associated with a decrease of depression
score of 0.0934 points (we might want to make this more meaningful by saying
that an increase in income of 100 per month would be associated with a
decrease in depression score of 100 * 0.0934 = 9.34 scale units). As in
this example, negative coefficients mean that when the regressor increases,
the dependent variable decreases. If the regressor is a dichotomous
variable (e.g. gender), the size of the coefficient tells us the size of
the difference between the two classes of individual (again, with all other
variables heldd constant). So a gender coefficient of 3.3, with men coded
0 and women coded 1, would mean that with all other variables held constant,
women's dependent variable scores would average 3.3 units higher than men's.
- Which regressor has most effect on the dependent variable? It is not
possible to give a fully satisfactory answer to this question, for a number
of reasons. The chief one is that we are always looking at the effect of
each variable in the presence of all the others; since the dependent variable
need not be independent, it is hard to be sure which one is contributing
to a joint relationship (or even to be sure that that means anything).
However, the usual way of addressing the question is to look at the standardised
regression coefficients or beta weights for each variable; these
are the regression coefficients we would get if we converted all variables
(independent and dependent) to z-scores before doing the regression.
Minitab, unfortunately, does not report beta weights for the independent
variable in its regression output, though it is possible to calculate them;
SPSS, which you will learn about later in the course, does give them directly.
- Are the relationships of each regressor with the dependent variable
statistically significant, with all other regressors taken into account?
This is answered by looking at the t values in the table of regression
coefficients. The degrees of freedom for t are those for the residual
in the anova table, but Minitab works out significances for us, so we need
to know the degrees of freedom only when it comes to reporting results.
Note that if a regression coefficient is negative, Minitab will report
the corresponding t value as negative, but if you were looking it
up in tables, you would use the absolute (unsigned) value.
Further questions to ask
Either the nature of the data, or the regression results, may suggest
further questions. For example, you may want to obtain means and standard
deviations or histograms of variables to check on their distributions;
or plot one variable against another, or obtain a matrix of correlations,
to check on first order relationships. Minitab does some checking for you
automatically, and reports if it finds "unusual observations".
If there are unusual observations, PLOT or HISTOGRAM may tell you what
the possible problems are. The usual kinds of unusual observations are
"outliers" points which lie far from the main distributions
or the main trends of one or more variables. Serious outliers should be
dealt with as follows:
1. temporarily remove the observations from the data set. In Minitab, this
can be done by using the LET command to set the outlier value to "missing",
indicated by an asterisk instead of a numerical value. For example, if
item 37 in the variable held in C1 looks like an outlier, we could type:
LET C1(37)='*'
note the single quotes round the asterisk.
2. repeat the regression and see whether the same qualitative results are
obtained (the quantitative results will inevitably be different).
3. if the same general results are obtained, we can conclude that the outliers
are not distorting the results. Report the results of the original regression,
adding a note that removal of outliers did not greatly affect them.
4. if different general results are obtained, accurate interpretation will
require more data to be collected. Report the results of both regressions,
and note that the interpetation of the data is uncertain. The outliers
may be due to errors of observation, data coding, etc, and in this case
they should be corrected or discarded. However, they may also represent
a subpopulation for which the effects of interest are different from those
in the main population. If they are not due to error, the group of data
contributing to outliers will need to be identified, and if possible a
reasonably sized sample collected from it so that it can be compared with
the main population. This is a scientific rather than a statistical problem.
Reporting regression results
Research articles sometimes report the results of several different
regressions done on a single data set. In this case, it is best to present
the results in a table. Where a single regression is done, however, that
is unnecessary, and the results can be reported in text. The wording should
be something like the following this is for the depression vs age, income
and gender example used in the class:
The data were analysed by multiple regression, using as regressors age,
income and gender. The regression was a rather poor fit (R2adj
= 40%), but the overall relationship was significant (F3,12
= 4.32, p < 0.05). With other variables held constant, depression scores
were negatively related to age and income, decreasing by 0.16 for every
extra year of age, and by 0.09 for every extra pound per week income. Women
tended to have higher scores than men, by 3.3 units. Only the effect of
income was significant (t12 = 3.18, p < 0.01).
Note the following:
- The above brief paragraph does not exhaust what you can say about a
set of regression results. There may be features of the data you should
look at "Unusual observations", for example. Normally you will
need to go on to discuss the meaning of the trends you have described.
- Always report what happened before moving on to its significance so
R2adj values before F values, regression
coefficients before t values. Remember, descriptive statistics are
more important than significance tests.
- Although Minitab will give a negative t value if the corresponding
regression coefficient is negative, you should drop the negative sign when
reporting the results.
- Degrees of freedom for both F and t values must be given.
Usually they are written as subscripts. For F the numerator degrees
of freedom are given first. You can also put degrees of freedom in parentheses,
or report them explicitly, e.g.: "F(3,12) = 4.32" or "F
= 4.32, d. of f. = 3, 12".
- Significance levels can either be reported exactly (e.g. p = 0.032)
or in terms of conventional levels (e.g. p < 0.05). There are
arguments in favour of either, so it doesn't much matter which you do.
But you should be consistent in any one report.
- Beware of highly significant F or t values, whose significance
levels will be reported by statistics packages as, for example, 0.0000.
It is an act of statistical illiteracy to write p = 0.0000; significance
levels can never be exactly zero there is always some probability that
the observed data could arise if the null hypothesis was true. What
the package means is that this probability is so low it can't be represented
with the number of columns available. We should write it as p <
0.00005.
- Beware of spurious precision, i.e. reporting coefficients etc
to huge numbers of significant figures when, on the basis of the
sample you have, you couldn't possibly expect them to replicate to anything
like that degree of precision if someone repeated the study. F and
t values are conventionally reported to two decimal places, and
R2adj values to the nearest percentage point
(sometimes to one additional decimal place). For coefficients, you should
be guided by the sample size: with a sample size of 16, as in the example
used above, two significant figures is plenty, but even with more realistic
samples, in the range of 100 to 1000, three significant figures is usually
as far as you should go. This means that you will usually have to round
off the numbers that Minitab will give you.
Stephen Lea
University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623
Send questions and comments to the
departmental administrator or to the
author of this page
Goto Home page for this
course | previous
topic |
examples
sheet | next
topic
Goto home page for: University of
Exeter | Department of
Psychology | Staff
| Students |
Research | Teaching
| Miscellaneous
(access count).
Document revised 10th January 1997