Contents of this handout: Categorical independent variables: background and procedures; Exploratory regression analysis, and procedures for carrying it out; moving data between packages; further problems with multiple regression (outliers, multicollinearity, heteroscedasticity, ratio of number of cases to number of regressors); Examples

The most natural use of multiple regression is when all the variables
concerned are continuous. It has been shown by "Monte Carlo"
methods (i.e. trying it out with random numbers) that it is not badly affected
if either independent or dependent variables are only **ordinally**
measured (e.g. **rating scales**), so long as the number of categories
is not too small.. What happens, though, when we have to deal with **nominal**
measurement, i.e. where the numbers we have are merely labels for categories?
If the dependent variable is categorical, we can't use multiple regression
at all. However, unordered categories as independent variables can be accommodated.

When an independent variable consists of just two categories, e.g. male
vs female, there is no problem; we have already seen how to include such
**dichotomous** variables in a regression analysis. Where we have three
or more categories in a variable, however, it would clearly be a serious
error just to include such a variable in a regression analysis directly.
Instead we have to use what are called **dummy variables**.

Suppose that the variable includes *m* categories (*m*>2).
What we have to do is to break down this multi-way category into *m*
2-way categories, each indicating whether or not the observation belongs
to a particular one of the *m* original categories. So suppose a variable
called 'faculty' contains codes 1-6 indicating which of the 6 faculties
an Exeter student belongs to, assigning the codes in alphabetical order
of faculty name. We have to create 6 new variables, one for each faculty,
which will indicated whether the student belongs to that faculty. So the
values of the variables for the first few cases might look like this:

faculty arts educatio engineer law science socstuds 2 0 1 0 0 0 0 3 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 4 0 0 0 1 0 0 6 0 0 0 0 0 1

The new variables are called dummy variables. In effect we have replaced the one 6-way categorical variable 'faculty' with six dichotomous variables, 'arts' to 'socstuds'. We have already seen that we can include dichotomous variables in a regression analysis, so our problem is essentially solved. There are however a couple of loose ends to tie up before we consider how we can make the transformation in practice.

- If we include all the dummy variables derived from a single categorical
independent variable in an analysis, our statistics package would protest
that it has detected
**multicollinearity**, i.e. it has found that it can predict 'socstuds' from the variables 'arts' to 'science': if a student isn't a member of any other faculty, s/he must be a member of the last one to be tried. (We shall hear more about multicollinearity later in the handout.) To prevent this, we must eliminate one of the categories from the analysis, and this means we must choose which one to eliminate. There are two possible grounds for choice. Both depend on the fact that when we put the remaining categories into the regression, the question we are really asking is whether members of those categories differ from the category we eliminated. So, if there is one category which is in some sense a control or "normal" condition, this would be the one to eliminate. If there isn't such a control condition, we usually eliminate the**modal**category, i.e. the one to which most observations belong. If there is no control condition and equal numbers in each category, then just eliminate one category at random. However we choose, the*R*^{2}and*F*values will not be affected by our choice of which category to eliminate. - Logically, before asking whether each of the categories differs from
the control condition, we should ask whether the categorical variable
*as a whole*has a significant effect. This is analogous to looking at the overall*F*for an effect in analysis of variance before enquiring into contrasts, and in fact we use an*F*test in this case, too. SPSS offers an easy way of carrying out the necessary test. There is a way of doing it in Minitab, but it is not easy and so far as I know not documented at all.

*Return to top*

Going from a categorical variable to the corresponding set of dummy
variables is easy in Minitab: there is a command, **INDICATOR**, which
simply takes a column and produce dummy variables for us. So if we have
a 6-way categorical variable stored in C1, we can simply type

MTB> INDICATOR C1 C11-C16

We shall have to give the new variables the appropriate names by hand. Things are not quite so easy in SPSS; here we have to write a short loop in our command file, for example:

do repeat x=arts educatio engineer law science socstuds / i=1 to 6.

compute x=0.

if (faculty=i) x=1.

if missing(faculty) x=9.

end repeat.

missing values arts to socstuds (9).

Note that we have to take explicit action to look after missing values.

This is straightforward in either Minitab (use the TALLY command, or TABLE with a single argument) or SPSS (use FREQUENCIES).

In either SPSS or Minitab, the dummy variables can simply be entered
in the regression command just like any other variable. *Note, though
that the original categorical variable should never appear in the regression
command*.

Testing for the overall effect of the categorical variable is done by
entering the set of dummy variables (excluding, of course, the control
or modal category) *after* having entered the remaining independent
variables. Suppose we are assessing the effect of A-level score (in points),
gender, parental income and faculty on undergraduates' rated satisfaction
with their courses at Exeter. Then we might have an SPSS command file like
the following (assuming that the dummy variables have been formed up before
we saved the system file, and that Arts is the modal faculty):

title satisfaction regression.

get file=satrates.sys.

regression

variables=ALevels,gender,parincom,satirate,arts to socstuds

/statistics=defaults

/dependent=satirate

/method=enter ALevels,gender,parincom

/method=enter arts to science.

finish.

SPSS will give us an *F* value for the effects of the group of
variables added on each ENTER command, which is exactly what we want.

The principle is the same in Minitab, but it requires more work. We
have to be careful to specify the dummy variables *last* in the list
of regressors for the REGRESS command. Then we look at the column headed
SEQ SS that comes under the anova table in multiple regression output,
and add up the entries for all the *m*-1 categories. The total can
be described as the **sum of squares** for the categorical variable
as a whole. Divide this by *m*-1, and we shall have the **mean square**
for the categorical variable. This in turn can be divided by the **error
mean square** from the anova table, to give an *F* statistic which
will allow us to test the significance of the entire categorical variable.
Remember that *F* always has two values of degrees of freedom associated
with it: in this case, the numerator degrees of freedom are *m*-1
and the denominator degrees of freedom are the error degrees of freedom
from the anova table. Note that Minitab includes the LET command, which
allows us to do simple arithmetic, and also constants (called K1, K2 etc
to distinguish them from columns) in which we can store the results. This
means we can get Minitab to do the arithmetic, by typing lines like

LET K1=((269.0+38.7+9.3+236.6+155.3)/5)/193.5

PRINT K1

We can then use another Minitab command, CDF (which stands for Cumulative
Distribution Function) to look up the significance of the *F* value
we obtain (though, irritatingly, CDF reports the **complement** of a
significance level, i.e. 1-*p* rather than *p*). Use Minitab's
HELP facility to find out how to use CDF.

*Return to top*

Up to now, we have assumed that we have a fixed set of independent variables, whose effect on the dependent variable we want to describe and subject to significance tests. However, quite often we have a large set of possible independent variables, and we are interested in finding which subset of them are most useful for predicting the dependent variable. This question divides into two parts: deciding exactly what "most useful" means in this context, and having operationalised that into a precise criterion, finding the set of regressors which best meets this criterion.

A subset of the the possible regressors is usually referred to as a "regression model", because in effect choosing a subset of regressors is specifying a hypothesis, or model, of the variables that are associated with the dependent variable.

If we are choosing between models which all involve the same number
of regressors, it is easy to say which is the best: it is the model with
the highest value of *R*^{2}_{adj}. However, more
usually, we are choosing between models with different numbers of regressors
- for example, we want to know whether it is worth adding an additional
regressor or group of regressors to a model which already does a reasonable
job at explaining the variation in the data. Here things become more complicated,
because, as we already know, adding additional variables means that we
are bound to account for more variance, but obviously makes for a less
economical model - the issue is whether the gain in prediction is worth
the loss in economy. Two criteria are commonly used to assess this issue.
We could choose the model with the highest value of *R*^{2}_{adj},
or we could choose the model with the highest value of *F*. There
are arguments in favour of either policy. The model with the highest *R*^{2}_{adj}
will almost always have more regressors in it than the model with the highest
*F*. Therefore, a reasonable rule is: If you want the most *complete*
description of the data, which is also reasonably efficient, choose the
model with the highest *R*^{2}_{adj} value. If you
want the most *efficient* description, which is also reasonably complete,
choose the model with the highest *F*. Different research projects
will call for different strategies.

For purely exploratory analyses, we find the best model by either **stepwise**
or **setwise** regression. A scenario intermediate between strict hypothesis
testing and exploration is that where we know we must consider certain
independent variables, but we want to know whether other variables have
any effect over and above these "compulsory" regressors. In this
case we will use some form of **hierarchical** regression. Hierarchical
and stepwise regression are available in all packages, and are widely used.
Setwise regression is in principle more satisfactory, but requires much
more computing power, and has only recently come on the scene. It is still
not available in SPSS; it is available in Minitab, but restricted to sets
of 20 or fewer independent variables.

More formally, what the three procedures do is the following:

- In hierarchical regression, we specify the order in which the independent variables are to be entered into the model. We can look at the effect that each new variable has, over and above the set previously used. Variables may be entered and considered in groups. The order in which variables are entered should be derived from the theory underlying the study. For example, in a study on the causation of debt, which is manifestly an economic phenomenon but less certainly a psychological one, we entered a group of economic variables, then a group of demographic variables, and only after these a group of psychological variables.
- In stepwise regression, at each stage the variable that has the highest
**partial correlation**with the dependent variable (after taking all variables currently in the model into account) is added to the model. Variables are only added if they increase the*F*value for the regression by some specified amount. Variables can also be removed, if they reduce*F*by another specified threshold amount. The aim is to find the set of independent variables which maximises*F*. However, the procedure used, like any**hill-climbing algorithm**, is vulnerable to**local maxima**of*F*. - In setwise regression, all possible combinations of the set of regressors
are tried, starting with all the 1-regressor models, then trying all the
2-regressor models, and so on up to the size of the group. With
*n*regressors, there are 2^{n}possible models, so the time required for this gets very large indeed. However, it is sure to find the best possible model. Minitab reports the best model for each number of regressors, so you can choose either the model with the highest*F*value or the one with the highest*R*^{2}_{adj}value.

If there were no correlation between the regressors, there would never
be a problem in deciding which of two regressors was the more important,
and therefore in settling on a single best regression model. Usually, though,
we are using multiple regression precisely because regressors are correlated,
and this leaves us with problems of interpretation. Two models using different
sets of regressors can have very similar *R*^{2}_{adj}
and/or *F* values. Which is really the best model?

Ultimately, if two independent variables are closely correlated (or
one is well predicted by two others), no statistical procedure will enable
us to say which predicts a dependent variable better - both will do an
equally good job. This is called the **identification problem**. It
is a scientific rather than a statistical problem, and it can only be resolved
by collecting more data. Usually we will have to change our sampling procedure
so as to break down the offending correlation. Suppose we find, in a study
of absenteeism among women workers, that parents of pre-school children
and part-time workers have high absence rates, but we cannot tell which
is the crucial variable, because most of our part-timers have pre-school
children and vice versa. We shall have to go out of our way to collect
some data from childless part-timers or full-timers who have pre-school
children, by oversampling those combinations.

*Return to top*

Hierarchical regression involves no new techniques. The other two require illustration. You can do stepwise regression in Minitab or SPSS, but SPSS does it better. Setwise regression can only be done in Minitab.

This is very straightforward. You just substitute

/method=step

for the /method=enter subcommand we have used before. The output will be long, but its interpretation is fairly obvious. One way of testing whether you are falling into local maximum problems is to follow up by doing the same regression but using

/method=back

This will do backwards stepwise regression, which starts from the model involving all the regressors and then tries to take them out one at a time. If forwards and backwards regression arrive at the same endpoint, it is unlikely to be a local maximum.

SPSS looks for the model with the best *F* value. However, because
it shows you a lot of its intermediate calculations, you can usually identify
the model with the best *R*^{2}_{adj} value, because
at each number of regressors, the model with the best *F* value is
also the model with the best *R*^{2}_{adj} value.
It is only when we are comparing models involving different numbers of
regressors that the two criteria diverge.

SPSS will report all the usual significance tests for the model it identifies
as best. But you should note that choosing a model in this way biases the
procedure in favour of producing large *R*^{2}_{adj}
values, and so undermines the logic of significance testing - so it isn't
very surprising if the best regression model is reported as significant.
Significance testing is only really appropriate when we have specified
exact null and alternative hypotheses in advance. So for the purposes of
statistical inference, any exploratory regression should be followed up
by a test of the chosen regression model on an independent set of data.
Some researchers would argue for doing the exploratory regression on a
randomly chosen half of the data you have collected, and then following
up with a test regression on the other half; others would argue that a
completely independent data set should be collected.

This is also straightforward. It uses a new command, BREG (standing for Best Regression). Using BREG is easy. If the dependent variable was in C10, and you had 6 independent variables in C1-C6, you would just type

BREG C10 C1-C6

and Minitab would do the rest. Notice that you don't have to tell BREG
how many regressors there are (though it won't matter if you do).

The output comes in a nice compact form, though you will need to think
about it a bit to see how to interpret it. It tells you which regressors
are included in the best two models for each possible number of regressors.
It also tells you their *R*^{2}_{adj} values, so you
can pick out the best model of all in *R*^{2}_{adj}
terms directly. It doesn't tell you their *F* values, so if you want
the best model in *F* terms, you will have to use the REGRESS command
on the best-*R*^{2}_{adj} model for each number of
regressors, and look to see which one has the best *F*.

Even if you are interested in the best-*R*^{2}_{adj}
model, you should proceed to use REGRESS on the set of regressors it identifies,
so you can find out the values of the coefficients and their significance.
You'll also need the *F* value to assess the significance of the model
as a whole, though the same cautionary note applies here as to stepwise
regression.

*Return to top*

It should be clear by now that there is no one best statistics packages
for all purposes. This means that we often need to move a data set between
packages. In general packages will not read each other's private data files:
so Minitab cannot read SPSS **system files**, and SPSS cannot read Minitab
**worksheet files**, for example. The usual way to move data from one
package to another is via a **text** or **ascii** file.

A text file contains information in a very simple, standard code which
can be interpreted by a wide variety of programs; the code most often used
is called ASCII, but text files don't use the full list of 256 ASCII codes.
The list of symbols allowed in text files varies a bit between programs,
but you can rely on being allowed the 26 letters of the English alphabet
in both capitals and lower case; the digits 0-9; some but not all punctuation
symbols; and some but not all mathematical symbols. In addition you will
always be allowed the control code ENTER (used to mark ends of lines).
The advantage of text files is that almost any program and almost any computer
can use them. So they are used for moving data between one computer and
another, as well as between or one program and another. For example, we
can prepare a data file in a word processor on a Macintosh, output it as
a text file, transfer that to singer, display it on the screen, edit it,
or send it to a printer, and read it into Minitab or SPSS for statistical
work. Note, though, that both SPSS and Minitab can also produce what are
called **portable files**, which can be used for moving between versions
of the same package on different computers: so if you wanted to move data
from a Macintosh Minitab worksheet to singer Minitab, you would use a portable
worksheet file rather than a text file.

To produce a text file from Minitab, use command WRITE, for example

WRITE 'filename' C1-C10

WRITE without a filename writes to the screen so you can see what the layout looks like. The subcommand FORMAT can be used if you want to write more columns than will comfortably fit on a line, though it is not very easy to use unless you know the Fortran programming language. If you don't provide an extension as part of the filename, WRITE will add .DAT to the name.

The command to produce a text file from SPSS is also called WRITE. It must always be followed by the command EXECUTE; forgetting this is a common and very irritating error. The FORMATS command can be used to vary the output format of each column, and numbers can be given to space the columns out. Life will be much easier at the Minitab end if we use these facilities to make sure that all the variables have some blank space between them:

title writing a file to send to Minitab.

get file=tax.sys.

formats index (f3) free1 to law5 (f2).

write outfile=taxasci.DAT / index free1 to law5.

execute.

finish.

Notice that the output file name is specified by outfile=, not file=; and that the / between the outfile name and the list of variables is essential.

*Return to top*

**Outliers** are points which lie far from the main distributions
or the main trends of one or more variables. They can be detected by plotting
variables against one another (e.g. using Minitab's PLOT command), or using
commands like FREQUENCIES in SPSS or TALLY in Minitab to examine the distributions
of key variables. Minitab does some checking for you automatically whenever
it does a regression, and reports if it finds "unusual observations",
which usually are outliers.

Serious outliers should be dealt with as follows:

- temporarily remove the observations from the data set (by setting the
value of one variable to "
**missing**") - repeat the regression and see whether the same
*qualitative*results are obtained (the quantitative results will inevitably be different).

if the same general results are obtained, we can conclude that the outliers are not distorting the results. Report the results of the*original*regression, adding a note that removal of outliers did not greatly affect them. - if different general results are obtained, accurate interpretation
will require more data to be collected. Report the results of
*both*regressions, and note that the interpetation of the data is uncertain. The outliers may represent a subpopulation for which the effects of interest are different from those in the main population; this group will need to be identified, and if possible a reasonably sized sample collected from it so that it can be compared with the main population. This is a scientific rather than a statistical problem.

This refers to the situation where one or more of the independent variables
can be predicted almost exactly from the remainder of the set. In this
case, the independent variable set is obviously redundant in some sense.
If the independent variables are **multicollinear**, the regression
coefficients we calculate will be very unstable - they will vary markedly
from sample to sample - so it will be difficult to decide correctly which
are the important regressors.

If multicollinearity is extreme, Minitab will refuse to carry out the
analysis, but this will only happen with situations way beyond that where
we would be wise to drop some variables from the set. SPSS allows us to
assess the degree of multicolinearity in the sample, by requesting a measure
of **tolerance** in the statistics subcommand:

/statistics default tol

This gives us 1-*R*^{2} for the regression of each independent
variable on all the others. If tolerance is low (below 0.1, say) for any
independent variable, it should be regarded as a problem.

Another approach to detecting multicollinearity is to run a **principal
components analysis** on the independent variables (first tranforming
them to **z-scores**). If many of the **eigenvalues** are below 1.0,
the variable set is showing serious multicollinearity.

For further discussion of multicollinearity, see Tabachnick & Fidell (1989), p. 130.

Another long word (it means "different variabilities"). Regression
assumes that the scatter of the points about the regression line is the
same for all values of each independent variable. Quite often, the spread
will increase steadily as one of the independent variables increases, so
we get a fan-like scattergram if we plot the dependent variable against
that independent variable. Another way of detecting **heteroscedasticity**
(and also outlier problems) is to plot the **residuals** against the
**fitted values** of the dependent variable; a **monotonic** relationship
between residuals and the independent variable (which is what produces
the fan-like pattern in the original scattergram) suggests we may have
problems with the data. We may be able to deal with this by transforming
one or more variables.

For further discussion of heteroscedasticity, see Tabachnik and Fidell (1989), pp. 131-133.

Obviously if you have as many independent variables as you have data points, you can predict the dependent variable value perfectly - but you are not explaining it at all. Economical explanation requires fewer independent variables than cases, but how many fewer? Various rules of thumb have been suggested, and obviously much depends on the amount of noise in the data, the nature of the phenomena being investigated, and the hypotheses being tested. In favourable circumstances (well-defined hypotheses, clean data), five times as many cases as regressors might be enough; even under bad circumstances, a 20:1 ratio ought to be adequate. If in doubt, collect a second sample of data and see whether the results replicate - good scientific advice regardless of the statistical method in use. "Cases" here means fully usable cases, i.e. ones with no missing values on any variable; apparently adequate samples are easily reduced to uselessness by a wide scatter of missing values. For further discussion see Tabachnick and Fidell (1989), pp. 128-192.

*Return to top*

The data used in examples 1-3 are part of those collected in a questionnaire
study of neighbourly help (see Webley
& Lea 1993, *Human Relations* 46, 65-76). After the examples you
will find an extract from the questionnaire. People living in different
districts were sent different coloured questionnaires, so we knew when
the forms came back where they had come from. The corresponding data (including
a code for which of 4 districts people came from) are stored in the Singer
file **/singer1/eps/psybin/stats/neighbor.MTW**

- All the variables in this study are either categorical or ordinal. Which are which?
- Use multiple regression, including dummy variables where appropriate, to find out how peoples' ratings of the neighbourliness of the area where they live are related to all the other variables.
- With all the other variables taken into account, are the following variables significantly associated with rated neighbourliness?
- age group:
- number of people known by name
- the district where people now live
- Move the data to SPSS, and repeat the analysis of question (b) using stepwise regression. Do you get the same answers as before?
- Create an ASCII file from the tax data set used in last week's examples
- Read this file into Minitab, label the columns appropriately, and carry out a setwise regression to see how the index of tax avoidance can best be predicted from the other 15 variables.
- Choose the best regression model containing four variables, and carry out a standard regression using just these four variables. Put the residuals and fits from this regression into two new columns (Minitab will do this if you provide the column names or numbers at the end of the regression command, after the four regressors). Plot residuals against fits and look for problems.

NEIGHBOURLINESS SURVEY (Extract, reformatted)

About how long have you lived where you do now?

Less than 6 months / 6-12 months / 1-3 years / 3-10 years / Over 10 years

Where were you living before you moved to your present house?

In the same neighbourhood / Elsewhere in Exeter / Elsewhere in Devon / Elsewhere in Britain / Abroad

How neighbourly do you think the area where you now live is?:

Very unfriendly / Not very friendly / About average / Fairly friendly / Very friendly

Roughly how many people in your street, or in the streets just near you, do you know the names of?

None / 1-5 / 6-20 / More than 20

How many of those people (not counting children) would you call by their first names?

None / 1-5 / 6-20 / More than 20

Your sex:

Male / Female

Your age:

Under 18 / 18-30 / 31-50 / 51-65 / Over 65

*Return to top*

Stephen Lea

University of Exeter Department of Psychology

Washington Singer Laboratories

Exeter EX4 4QG

United Kingdom

Tel +44 1392 264626

Fax +44 1392 264623

Send questions and
comments to the departmental
administrator or to the author
of this page

Goto Home page for
this course | previous topic | next
topic | FAQ file

Goto home page for: University of
Exeter | Department of
Psychology | Staff
| Students |
Research | Teaching
| Miscellaneous

(access count since 12th February 1997).