# Topic 2: Multiple regression: More advanced issues

Contents of this handout: Categorical independent variables: background and procedures; Exploratory regression analysis, and procedures for carrying it out; moving data between packages; further problems with multiple regression (outliers, multicollinearity, heteroscedasticity, ratio of number of cases to number of regressors); Examples

### Categorical independent variables: background

The most natural use of multiple regression is when all the variables concerned are continuous. It has been shown by "Monte Carlo" methods (i.e. trying it out with random numbers) that it is not badly affected if either independent or dependent variables are only ordinally measured (e.g. rating scales), so long as the number of categories is not too small.. What happens, though, when we have to deal with nominal measurement, i.e. where the numbers we have are merely labels for categories? If the dependent variable is categorical, we can't use multiple regression at all. However, unordered categories as independent variables can be accommodated.

When an independent variable consists of just two categories, e.g. male vs female, there is no problem; we have already seen how to include such dichotomous variables in a regression analysis. Where we have three or more categories in a variable, however, it would clearly be a serious error just to include such a variable in a regression analysis directly. Instead we have to use what are called dummy variables.
Suppose that the variable includes m categories (m>2). What we have to do is to break down this multi-way category into m 2-way categories, each indicating whether or not the observation belongs to a particular one of the m original categories. So suppose a variable called 'faculty' contains codes 1-6 indicating which of the 6 faculties an Exeter student belongs to, assigning the codes in alphabetical order of faculty name. We have to create 6 new variables, one for each faculty, which will indicated whether the student belongs to that faculty. So the values of the variables for the first few cases might look like this:

``` faculty     arts educatio engineer      law  science socstuds
2        0        1        0        0        0        0
3        0        0        1        0        0        0
1        1        0        0        0        0        0
1        1        0        0        0        0        0
4        0        0        0        1        0        0
6        0        0        0        0        0        1```

The new variables are called dummy variables. In effect we have replaced the one 6-way categorical variable 'faculty' with six dichotomous variables, 'arts' to 'socstuds'. We have already seen that we can include dichotomous variables in a regression analysis, so our problem is essentially solved. There are however a couple of loose ends to tie up before we consider how we can make the transformation in practice.

• If we include all the dummy variables derived from a single categorical independent variable in an analysis, our statistics package would protest that it has detected multicollinearity, i.e. it has found that it can predict 'socstuds' from the variables 'arts' to 'science': if a student isn't a member of any other faculty, s/he must be a member of the last one to be tried. (We shall hear more about multicollinearity later in the handout.) To prevent this, we must eliminate one of the categories from the analysis, and this means we must choose which one to eliminate. There are two possible grounds for choice. Both depend on the fact that when we put the remaining categories into the regression, the question we are really asking is whether members of those categories differ from the category we eliminated. So, if there is one category which is in some sense a control or "normal" condition, this would be the one to eliminate. If there isn't such a control condition, we usually eliminate the modal category, i.e. the one to which most observations belong. If there is no control condition and equal numbers in each category, then just eliminate one category at random. However we choose, the R2 and F values will not be affected by our choice of which category to eliminate.
• Logically, before asking whether each of the categories differs from the control condition, we should ask whether the categorical variable as a whole has a significant effect. This is analogous to looking at the overall F for an effect in analysis of variance before enquiring into contrasts, and in fact we use an F test in this case, too. SPSS offers an easy way of carrying out the necessary test. There is a way of doing it in Minitab, but it is not easy and so far as I know not documented at all.

### Categorical independent variables: procedures

#### Forming dummy variables

Going from a categorical variable to the corresponding set of dummy variables is easy in Minitab: there is a command, INDICATOR, which simply takes a column and produce dummy variables for us. So if we have a 6-way categorical variable stored in C1, we can simply type

MTB> INDICATOR C1 C11-C16

We shall have to give the new variables the appropriate names by hand. Things are not quite so easy in SPSS; here we have to write a short loop in our command file, for example:

do repeat x=arts educatio engineer law science socstuds / i=1 to 6.
compute x=0.
if (faculty=i) x=1.
if missing(faculty) x=9.
end repeat.
missing values arts to socstuds (9).

Note that we have to take explicit action to look after missing values.

#### Identifying the modal category

This is straightforward in either Minitab (use the TALLY command, or TABLE with a single argument) or SPSS (use FREQUENCIES).

#### Including the categorical variable in the regression, and testing for its overall effect

In either SPSS or Minitab, the dummy variables can simply be entered in the regression command just like any other variable. Note, though that the original categorical variable should never appear in the regression command.

Testing for the overall effect of the categorical variable is done by entering the set of dummy variables (excluding, of course, the control or modal category) after having entered the remaining independent variables. Suppose we are assessing the effect of A-level score (in points), gender, parental income and faculty on undergraduates' rated satisfaction with their courses at Exeter. Then we might have an SPSS command file like the following (assuming that the dummy variables have been formed up before we saved the system file, and that Arts is the modal faculty):

title satisfaction regression.
get file=satrates.sys.
regression
variables=ALevels,gender,parincom,satirate,arts to socstuds
/statistics=defaults
/dependent=satirate
/method=enter ALevels,gender,parincom
/method=enter arts to science.
finish.

SPSS will give us an F value for the effects of the group of variables added on each ENTER command, which is exactly what we want.

The principle is the same in Minitab, but it requires more work. We have to be careful to specify the dummy variables last in the list of regressors for the REGRESS command. Then we look at the column headed SEQ SS that comes under the anova table in multiple regression output, and add up the entries for all the m-1 categories. The total can be described as the sum of squares for the categorical variable as a whole. Divide this by m-1, and we shall have the mean square for the categorical variable. This in turn can be divided by the error mean square from the anova table, to give an F statistic which will allow us to test the significance of the entire categorical variable. Remember that F always has two values of degrees of freedom associated with it: in this case, the numerator degrees of freedom are m-1 and the denominator degrees of freedom are the error degrees of freedom from the anova table. Note that Minitab includes the LET command, which allows us to do simple arithmetic, and also constants (called K1, K2 etc to distinguish them from columns) in which we can store the results. This means we can get Minitab to do the arithmetic, by typing lines like

LET K1=((269.0+38.7+9.3+236.6+155.3)/5)/193.5
PRINT K1

We can then use another Minitab command, CDF (which stands for Cumulative Distribution Function) to look up the significance of the F value we obtain (though, irritatingly, CDF reports the complement of a significance level, i.e. 1-p rather than p). Use Minitab's HELP facility to find out how to use CDF.

### Exploratory regression analysis: background

Up to now, we have assumed that we have a fixed set of independent variables, whose effect on the dependent variable we want to describe and subject to significance tests. However, quite often we have a large set of possible independent variables, and we are interested in finding which subset of them are most useful for predicting the dependent variable. This question divides into two parts: deciding exactly what "most useful" means in this context, and having operationalised that into a precise criterion, finding the set of regressors which best meets this criterion.

#### What is the best regression model?

A subset of the the possible regressors is usually referred to as a "regression model", because in effect choosing a subset of regressors is specifying a hypothesis, or model, of the variables that are associated with the dependent variable.

If we are choosing between models which all involve the same number of regressors, it is easy to say which is the best: it is the model with the highest value of R2adj. However, more usually, we are choosing between models with different numbers of regressors - for example, we want to know whether it is worth adding an additional regressor or group of regressors to a model which already does a reasonable job at explaining the variation in the data. Here things become more complicated, because, as we already know, adding additional variables means that we are bound to account for more variance, but obviously makes for a less economical model - the issue is whether the gain in prediction is worth the loss in economy. Two criteria are commonly used to assess this issue. We could choose the model with the highest value of R2adj, or we could choose the model with the highest value of F. There are arguments in favour of either policy. The model with the highest R2adj will almost always have more regressors in it than the model with the highest F. Therefore, a reasonable rule is: If you want the most complete description of the data, which is also reasonably efficient, choose the model with the highest R2adj value. If you want the most efficient description, which is also reasonably complete, choose the model with the highest F. Different research projects will call for different strategies.

#### Finding the best model

For purely exploratory analyses, we find the best model by either stepwise or setwise regression. A scenario intermediate between strict hypothesis testing and exploration is that where we know we must consider certain independent variables, but we want to know whether other variables have any effect over and above these "compulsory" regressors. In this case we will use some form of hierarchical regression. Hierarchical and stepwise regression are available in all packages, and are widely used. Setwise regression is in principle more satisfactory, but requires much more computing power, and has only recently come on the scene. It is still not available in SPSS; it is available in Minitab, but restricted to sets of 20 or fewer independent variables.

More formally, what the three procedures do is the following:

• In hierarchical regression, we specify the order in which the independent variables are to be entered into the model. We can look at the effect that each new variable has, over and above the set previously used. Variables may be entered and considered in groups. The order in which variables are entered should be derived from the theory underlying the study. For example, in a study on the causation of debt, which is manifestly an economic phenomenon but less certainly a psychological one, we entered a group of economic variables, then a group of demographic variables, and only after these a group of psychological variables.
• In stepwise regression, at each stage the variable that has the highest partial correlation with the dependent variable (after taking all variables currently in the model into account) is added to the model. Variables are only added if they increase the F value for the regression by some specified amount. Variables can also be removed, if they reduce F by another specified threshold amount. The aim is to find the set of independent variables which maximises F. However, the procedure used, like any hill-climbing algorithm, is vulnerable to local maxima of F.
• In setwise regression, all possible combinations of the set of regressors are tried, starting with all the 1-regressor models, then trying all the 2-regressor models, and so on up to the size of the group. With n regressors, there are 2n possible models, so the time required for this gets very large indeed. However, it is sure to find the best possible model. Minitab reports the best model for each number of regressors, so you can choose either the model with the highest F value or the one with the highest R2adj value.

#### The identification problem

If there were no correlation between the regressors, there would never be a problem in deciding which of two regressors was the more important, and therefore in settling on a single best regression model. Usually, though, we are using multiple regression precisely because regressors are correlated, and this leaves us with problems of interpretation. Two models using different sets of regressors can have very similar R2adj and/or F values. Which is really the best model?

Ultimately, if two independent variables are closely correlated (or one is well predicted by two others), no statistical procedure will enable us to say which predicts a dependent variable better - both will do an equally good job. This is called the identification problem. It is a scientific rather than a statistical problem, and it can only be resolved by collecting more data. Usually we will have to change our sampling procedure so as to break down the offending correlation. Suppose we find, in a study of absenteeism among women workers, that parents of pre-school children and part-time workers have high absence rates, but we cannot tell which is the crucial variable, because most of our part-timers have pre-school children and vice versa. We shall have to go out of our way to collect some data from childless part-timers or full-timers who have pre-school children, by oversampling those combinations.

### Exploratory regression: Procedures

Hierarchical regression involves no new techniques. The other two require illustration. You can do stepwise regression in Minitab or SPSS, but SPSS does it better. Setwise regression can only be done in Minitab.

#### Stepwise regression using SPSS

This is very straightforward. You just substitute

/method=step

for the /method=enter subcommand we have used before. The output will be long, but its interpretation is fairly obvious. One way of testing whether you are falling into local maximum problems is to follow up by doing the same regression but using

/method=back

This will do backwards stepwise regression, which starts from the model involving all the regressors and then tries to take them out one at a time. If forwards and backwards regression arrive at the same endpoint, it is unlikely to be a local maximum.

SPSS looks for the model with the best F value. However, because it shows you a lot of its intermediate calculations, you can usually identify the model with the best R2adj value, because at each number of regressors, the model with the best F value is also the model with the best R2adj value. It is only when we are comparing models involving different numbers of regressors that the two criteria diverge.

SPSS will report all the usual significance tests for the model it identifies as best. But you should note that choosing a model in this way biases the procedure in favour of producing large R2adj values, and so undermines the logic of significance testing - so it isn't very surprising if the best regression model is reported as significant. Significance testing is only really appropriate when we have specified exact null and alternative hypotheses in advance. So for the purposes of statistical inference, any exploratory regression should be followed up by a test of the chosen regression model on an independent set of data. Some researchers would argue for doing the exploratory regression on a randomly chosen half of the data you have collected, and then following up with a test regression on the other half; others would argue that a completely independent data set should be collected.

#### Setwise regression using Minitab

This is also straightforward. It uses a new command, BREG (standing for Best Regression). Using BREG is easy. If the dependent variable was in C10, and you had 6 independent variables in C1-C6, you would just type

BREG C10 C1-C6

and Minitab would do the rest. Notice that you don't have to tell BREG how many regressors there are (though it won't matter if you do).

The output comes in a nice compact form, though you will need to think about it a bit to see how to interpret it. It tells you which regressors are included in the best two models for each possible number of regressors. It also tells you their R2adj values, so you can pick out the best model of all in R2adj terms directly. It doesn't tell you their F values, so if you want the best model in F terms, you will have to use the REGRESS command on the best-R2adj model for each number of regressors, and look to see which one has the best F.

Even if you are interested in the best-R2adj model, you should proceed to use REGRESS on the set of regressors it identifies, so you can find out the values of the coefficients and their significance. You'll also need the F value to assess the significance of the model as a whole, though the same cautionary note applies here as to stepwise regression.

### Moving data between packages

It should be clear by now that there is no one best statistics packages for all purposes. This means that we often need to move a data set between packages. In general packages will not read each other's private data files: so Minitab cannot read SPSS system files, and SPSS cannot read Minitab worksheet files, for example. The usual way to move data from one package to another is via a text or ascii file.
A text file contains information in a very simple, standard code which can be interpreted by a wide variety of programs; the code most often used is called ASCII, but text files don't use the full list of 256 ASCII codes. The list of symbols allowed in text files varies a bit between programs, but you can rely on being allowed the 26 letters of the English alphabet in both capitals and lower case; the digits 0-9; some but not all punctuation symbols; and some but not all mathematical symbols. In addition you will always be allowed the control code ENTER (used to mark ends of lines). The advantage of text files is that almost any program and almost any computer can use them. So they are used for moving data between one computer and another, as well as between or one program and another. For example, we can prepare a data file in a word processor on a Macintosh, output it as a text file, transfer that to singer, display it on the screen, edit it, or send it to a printer, and read it into Minitab or SPSS for statistical work. Note, though, that both SPSS and Minitab can also produce what are called portable files, which can be used for moving between versions of the same package on different computers: so if you wanted to move data from a Macintosh Minitab worksheet to singer Minitab, you would use a portable worksheet file rather than a text file.

To produce a text file from Minitab, use command WRITE, for example

WRITE 'filename' C1-C10

WRITE without a filename writes to the screen so you can see what the layout looks like. The subcommand FORMAT can be used if you want to write more columns than will comfortably fit on a line, though it is not very easy to use unless you know the Fortran programming language. If you don't provide an extension as part of the filename, WRITE will add .DAT to the name.

The command to produce a text file from SPSS is also called WRITE. It must always be followed by the command EXECUTE; forgetting this is a common and very irritating error. The FORMATS command can be used to vary the output format of each column, and numbers can be given to space the columns out. Life will be much easier at the Minitab end if we use these facilities to make sure that all the variables have some blank space between them:

title writing a file to send to Minitab.
get file=tax.sys.
formats index (f3) free1 to law5 (f2).
write outfile=taxasci.DAT / index free1 to law5.
execute.
finish.

Notice that the output file name is specified by outfile=, not file=; and that the / between the outfile name and the list of variables is essential.

### Some further problems with regression

#### Outliers

Outliers are points which lie far from the main distributions or the main trends of one or more variables. They can be detected by plotting variables against one another (e.g. using Minitab's PLOT command), or using commands like FREQUENCIES in SPSS or TALLY in Minitab to examine the distributions of key variables. Minitab does some checking for you automatically whenever it does a regression, and reports if it finds "unusual observations", which usually are outliers.

Serious outliers should be dealt with as follows:

• temporarily remove the observations from the data set (by setting the value of one variable to "missing")
• repeat the regression and see whether the same qualitative results are obtained (the quantitative results will inevitably be different).
if the same general results are obtained, we can conclude that the outliers are not distorting the results. Report the results of the original regression, adding a note that removal of outliers did not greatly affect them.
• if different general results are obtained, accurate interpretation will require more data to be collected. Report the results of both regressions, and note that the interpetation of the data is uncertain. The outliers may represent a subpopulation for which the effects of interest are different from those in the main population; this group will need to be identified, and if possible a reasonably sized sample collected from it so that it can be compared with the main population. This is a scientific rather than a statistical problem.

#### Multicollinearity

This refers to the situation where one or more of the independent variables can be predicted almost exactly from the remainder of the set. In this case, the independent variable set is obviously redundant in some sense. If the independent variables are multicollinear, the regression coefficients we calculate will be very unstable - they will vary markedly from sample to sample - so it will be difficult to decide correctly which are the important regressors.

If multicollinearity is extreme, Minitab will refuse to carry out the analysis, but this will only happen with situations way beyond that where we would be wise to drop some variables from the set. SPSS allows us to assess the degree of multicolinearity in the sample, by requesting a measure of tolerance in the statistics subcommand:

/statistics default tol

This gives us 1-R2 for the regression of each independent variable on all the others. If tolerance is low (below 0.1, say) for any independent variable, it should be regarded as a problem.

Another approach to detecting multicollinearity is to run a principal components analysis on the independent variables (first tranforming them to z-scores). If many of the eigenvalues are below 1.0, the variable set is showing serious multicollinearity.

For further discussion of multicollinearity, see Tabachnick & Fidell (1989), p. 130.

#### Heteroscedasticity

Another long word (it means "different variabilities"). Regression assumes that the scatter of the points about the regression line is the same for all values of each independent variable. Quite often, the spread will increase steadily as one of the independent variables increases, so we get a fan-like scattergram if we plot the dependent variable against that independent variable. Another way of detecting heteroscedasticity (and also outlier problems) is to plot the residuals against the fitted values of the dependent variable; a monotonic relationship between residuals and the independent variable (which is what produces the fan-like pattern in the original scattergram) suggests we may have problems with the data. We may be able to deal with this by transforming one or more variables.

For further discussion of heteroscedasticity, see Tabachnik and Fidell (1989), pp. 131-133.

#### Ratio of cases to independent variables

Obviously if you have as many independent variables as you have data points, you can predict the dependent variable value perfectly - but you are not explaining it at all. Economical explanation requires fewer independent variables than cases, but how many fewer? Various rules of thumb have been suggested, and obviously much depends on the amount of noise in the data, the nature of the phenomena being investigated, and the hypotheses being tested. In favourable circumstances (well-defined hypotheses, clean data), five times as many cases as regressors might be enough; even under bad circumstances, a 20:1 ratio ought to be adequate. If in doubt, collect a second sample of data and see whether the results replicate - good scientific advice regardless of the statistical method in use. "Cases" here means fully usable cases, i.e. ones with no missing values on any variable; apparently adequate samples are easily reduced to uselessness by a wide scatter of missing values. For further discussion see Tabachnick and Fidell (1989), pp. 128-192.

### Examples

The data used in examples 1-3 are part of those collected in a questionnaire study of neighbourly help (see Webley & Lea 1993, Human Relations 46, 65-76). After the examples you will find an extract from the questionnaire. People living in different districts were sent different coloured questionnaires, so we knew when the forms came back where they had come from. The corresponding data (including a code for which of 4 districts people came from) are stored in the Singer file /singer1/eps/psybin/stats/neighbor.MTW

1. All the variables in this study are either categorical or ordinal. Which are which?
2. Use multiple regression, including dummy variables where appropriate, to find out how peoples' ratings of the neighbourliness of the area where they live are related to all the other variables.
3. With all the other variables taken into account, are the following variables significantly associated with rated neighbourliness?
• age group:
• number of people known by name
• the district where people now live
4. Move the data to SPSS, and repeat the analysis of question (b) using stepwise regression. Do you get the same answers as before?
5. Create an ASCII file from the tax data set used in last week's examples
6. Read this file into Minitab, label the columns appropriately, and carry out a setwise regression to see how the index of tax avoidance can best be predicted from the other 15 variables.
7. Choose the best regression model containing four variables, and carry out a standard regression using just these four variables. Put the residuals and fits from this regression into two new columns (Minitab will do this if you provide the column names or numbers at the end of the regression command, after the four regressors). Plot residuals against fits and look for problems.

NEIGHBOURLINESS SURVEY (Extract, reformatted)

About how long have you lived where you do now?

Less than 6 months / 6-12 months / 1-3 years / 3-10 years / Over 10 years

Where were you living before you moved to your present house?

In the same neighbourhood / Elsewhere in Exeter / Elsewhere in Devon / Elsewhere in Britain / Abroad

How neighbourly do you think the area where you now live is?:

Very unfriendly / Not very friendly / About average / Fairly friendly / Very friendly

Roughly how many people in your street, or in the streets just near you, do you know the names of?

None / 1-5 / 6-20 / More than 20

How many of those people (not counting children) would you call by their first names?

None / 1-5 / 6-20 / More than 20

Male / Female

Under 18 / 18-30 / 31-50 / 51-65 / Over 65

Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623