Contents of this handout: The categorical variables problem; Constructing and using dummy variables; Testing the significance of a categorical variable; doing arithmetic in Minitab; The significance of individual categories

The most natural use of multiple regression is when all the variables concerned are continuous. What happens, though, when the numbers we have are just labels for categories?

It has been shown by "Monte Carlo" methods (i.e. trying it
out with random numbers) that it is not badly affected if the categories
are in some way *ordered*, i.e. if the variables are measured on an
**ordinal scale**. This is very common in psychology, e.g. when we use
**rating scales**). What happens, though, when we have to deal with
**nominal** measurement, i.e. where the numbers we have are labels for
categories which have no natural ordering? This is called the **categorical
variables** problem.

If the unordered categorical variable is the *dependent* variable,
we cannot use multiple regression. We need to use a different though related
technique, usually **logistic regression** for 2-way categories and
**discriminant analysis** for *m*-way categories where *m*>2.
These are beyond the scope of this course, though if you need to use them
you should be able to learn how - try the notes for course PSY6003.

However, where the unordered cateogorical variable is one of the independent
variables, there are ways to include it in a multiple regression. When
there are only two categories, e.g. male vs female, there is no problem;
we have already seen how to include such **dichotomous** variables in
a regression analysis. Where we have three or more categories in a variable,
however, there is a problem. It would clearly be a serious error just to
include such a variable directly in a regression analysis (make sure you
can see why). We have to use what are called **dummy variables**

Suppose that the variable includes *m* categories (*m*>2).
What we have to do is to break down this multicategory into *m* 2-way
categories, each indicating whether or not the observation belongs to a
particular one of the *m* original categories. So suppose a column
C2 contains codes 1 to 6 indicating which of the 6 faculties an Exeter
student belongs to. We have to create 6 new columns (C11 to C16, say),
one for each faculty, which will indicated whether or not the student belongs
to that faculty. So the first few rows of the columns might look like this:

C2 C11 C12 C13 C14 C15 C16 faculty arts science soc.st law engin educ 2 0 1 0 0 0 0 3 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 4 0 0 0 1 0 0 6 0 0 0 0 0 1

The new variables (in columns C11-C16) are called dummy variables. There
is a Minitab command which will take a column like C2 and produce dummy
variables for us. It is called **INDICATOR**. In the above example,
we would use it as follows:

MTB> INDICATOR C2 C11-C16

Unfortunately, INDICATOR cannot give the new variables names, so we must do this ourselves, in this case by the command

NAME C11 'arts' c12 'science' c13 'soc.st' c14 'law' c15 'engin' c16 'educ'

Once we have done this, we can include C11-C16 (but *not* C2) in
a regression analysis. However, we will find that if we do, Minitab will
throw out C16 before it starts, on grounds of **multicollinearity**,
which means that it has found that it can predict C16 from C11 to C15 (because
if a student isn't a member of any other faculty, s/he must be a member
of the last one to be tried), and these are standard grounds for eliminating
a variable from a regression analysis. It is better to choose one category
to eliminate for ourselves. If there is one category which is in some sense
a control or "normal" condition, this would be the one to eliminate.
If there isn't, we usually eliminate the **modal** category, i.e. the
one to which most observations belong. We can find out which is the modal
category by using the Minitab command:

TABLE C2

or the command

TALLY C2

either of which will give us a list of the numbers in each category
in C2.

Whichever way we eliminate a category, the regression coefficients for
the remaining categories will give the extent to which members of those
categories differ (other regressors held constant) *from the category
we eliminated*. The *R*^{2} and *F* values will not
be affected by our choice of which category to eliminate.

So, suppose the data on students' faculties were being used along with
their their gender (held in C3, say), their Alevel points score (in C4),
and scores on an IQ test (C5) in order to predict their average marks in
first year university exams (held in C1). TALLY reveals that the modal
faculty is Arts, which corresponds to variable C11. We would end up giving
the following command to do the regression:

REGRESS C1 8 C3-C5 C12-C16

Make sure you understand why:

- C2 is not included in the regression command
- C11 is not included in the regression command
- we tell REGRESS that there are going to be 8 regressors

In the output from this regression, the coefficient of C12 (say) would tell us how, with other variables held constant, the average marks of a science student differed from those for an arts student, and so on for each of the other faculties.

Logically, before asking whether each of the categories differs from
the control condition, we should ask whether the categorical variable *as
a whole* has a significant effect. This is closely analogous to looking
at the overall *F* for an effect in analysis of variance before enquiring
into contrasts, and in fact we use an *F* test to answer this question
too. Unfortunately, Minitab does not make it very easy for us to get at
the answer. If we want to do it, we have to be careful to specify the dummy
variables *last* in the list of regressors for the REGRESS command.
For example, we might have Minitab output that looked like this:

MTB > REGRESS C1 8 C3-C5 C12-C16 The regression equation is degrmarx = - 23.1 - 0.80 m0f1 + 0.900 ALpoints + 0.450 IQ + 8.17 science + 1.36 soc.st + 4.69 law + 7.28 engin + 5.85 educ Predictor Coef Stdev t-ratio p Constant -23.08 13.38 -1.72 0.088 m0f1 -0.801 3.018 -0.27 0.791 ALpoints 0.9005 0.4610 1.95 0.054 IQ 0.4495 0.1329 3.38 0.001 science 8.167 4.984 1.64 0.105 soc.st 1.356 5.238 0.26 0.796 law 4.688 5.076 0.92 0.358 engin 7.283 5.131 1.42 0.159 educ 5.847 6.527 0.90 0.373 s = 13.91 R-sq = 32.1% R-sq-adj = 26.2% Analysis of Variance SOURCE DF SS MS F p Regression 8 8332.1 1041.5 5.38 0.000 Error 91 17607.7 193.5 Total 99 25939.8 SOURCE DF SEQ SS m0f1 1 443.6 ALpoints 1 3980.9 IQ 1 3198.7 science 1 269.0 soc.st 1 38.7 law 1 9.3 engin 1 236.6 educ 1 155.3

To assess the effect of the faculty variable, we start by looking at
the column headed SEQ SS that comes under the anova table in multiple regression
output, and add up the entries for all the *m-*1 categories. So, in
this example, we start by adding up the SEQ SS values for the five faculties:

269.0 + 38.7 + 9.3 + 236.6 + 155.3 = 708.9

The total, 708.9 in this case, can be described as the **sum of squares**
for the categorical variable as a whole. Divide this by *m*-1, the
total number of degrees of freedom associated with the five dummy variables,
and we shall have the **mean square** for the categorical variable (708.9/5
= 141.8 in the example). This in turn can be divided by the **error mean
square** from the anova table, to give an *F* statistic which will
allow us to test the significance of the entire categorical variable (in
our example, *F* = 141.8/193.5 = 0.73). Remember that *F* always
has two values of degrees of freedom associated with it: in this case,
the numerator degrees of freedom are *m-*1 and the denominator degrees
of freedom are the error degrees of freedom from the anova table (in the
example, 5 and 91 respectively).

Does all this mean we have to go back to carrying around a calculator
and a book of statistics tables, to do the sums and find the significance
of the *F* value we derive? Fortunately, it does not. Remember that
Minitab includes the **LET** command, which does arithmetic. It also
includes **constants**, which are like columns except that they can
be used to hold single numbers; they are referred to as K1, K2, K3 etc.
So, staying with the example above, we could do all the arithmetic in a
single Minitab command, by typing:

LET K1=((269.0+38.7+9.3+236.6+155.3)/5)/193.5 PRINT K1

Make sure you get the brackets right! If you find it difficult to keep
track of them, you could do the arithmetic in smaller steps by using a
series of LET commands.

Having got the required *F* value, we can report it as usual (in the
present case, we'd write *F*_{5,91}=0.73). But how can we
find its significance? We do this by using the command **CDF** (this
stands for **cumulative distribution function**), which will find the
significance of most kinds of statistic. Because it works on lots of different
statistics, we have to tell it which one we are using, and we do this through
a Minitab **subcommand**. Many Minitab commands use subcommands. If
you follow the command by a semicolon, instead of executing the command
straight away, you will be given the prompt SUBC>, following which you
can give extra information. In the case of CDF, what you have to give is
the code for the kind of statistic we are interested in (*F* in this
case) and its degrees of freedom. You follow those by a full stop to indicate
that there are no more subcommands to come. So in our present example we
would type

MTB > CDF K1; SUBC> F 5 91.

Minitab would reply,

0.7327 0.3992

The first number is just the calculated *F* value again. The second
value is not its significance, but the **complement** of its significance,
i.e. one minus its significance level. So the significance of the *F*
in this case is 0.60, i.e. it is a long way from being significant.

Note:

- we could have given CDF the numerical value of the
*F*we were interested in, instead of the constant that contained it; - the result you will get from all this does
*not*depend on which category you eliminated from the analysis.

As well as being interested in the significance of the categorical variable
as a whole, we are likely to be interested in the significance of individual
categories. Note in the sample output above that the dummy variable for
each faculty has a *t* and a *p* value associated with it. It
is important to remember that these assess the significance of the difference
between this category *and the category that we eliminated from the analysis*;
that is why if one category is a control or "normal" condition
we choose that one to eliminate.

Stephen Lea

University of Exeter

Department of Psychology

Washington Singer Laboratories

Exeter EX4 4QG

United Kingdom

Tel +44 1392 264626

Fax +44 1392 264623

Send questions and comments to the departmental administrator or to the author of this page

Goto Home page for this
course | previous
topic | examples
sheet | dummy
test paper

Goto home page for: University of
Exeter | Department of
Psychology | Staff
| Students |
Research | Teaching
| Miscellaneous

(access count).