Categorical variables; Minitab arithmetic; significance levels

Contents of this handout: The categorical variables problem; Constructing and using dummy variables; Testing the significance of a categorical variable; doing arithmetic in Minitab; The significance of individual categories

The categorical variables problem

The most natural use of multiple regression is when all the variables concerned are continuous. What happens, though, when the numbers we have are just labels for categories?

It has been shown by "Monte Carlo" methods (i.e. trying it out with random numbers) that it is not badly affected if the categories are in some way ordered, i.e. if the variables are measured on an ordinal scale. This is very common in psychology, e.g. when we use rating scales). What happens, though, when we have to deal with nominal measurement, i.e. where the numbers we have are labels for categories which have no natural ordering? This is called the categorical variables problem.

If the unordered categorical variable is the dependent variable, we cannot use multiple regression. We need to use a different though related technique, usually logistic regression for 2-way categories and discriminant analysis for m-way categories where m>2. These are beyond the scope of this course, though if you need to use them you should be able to learn how - try the notes for course PSY6003.

However, where the unordered cateogorical variable is one of the independent variables, there are ways to include it in a multiple regression. When there are only two categories, e.g. male vs female, there is no problem; we have already seen how to include such dichotomous variables in a regression analysis. Where we have three or more categories in a variable, however, there is a problem. It would clearly be a serious error just to include such a variable directly in a regression analysis (make sure you can see why). We have to use what are called dummy variables

Constructing and using dummy variables

Suppose that the variable includes m categories (m>2). What we have to do is to break down this multicategory into m 2-way categories, each indicating whether or not the observation belongs to a particular one of the m original categories. So suppose a column C2 contains codes 1 to 6 indicating which of the 6 faculties an Exeter student belongs to. We have to create 6 new columns (C11 to C16, say), one for each faculty, which will indicated whether or not the student belongs to that faculty. So the first few rows of the columns might look like this:

```        C2       C11      C12      C13      C14      C15      C16
faculty  arts     science  soc.st   law      engin    educ
2        0        1        0        0        0        0
3        0        0        1        0        0        0
1        1        0        0        0        0        0
1        1        0        0        0        0        0
4        0        0        0        1        0        0
6        0        0        0        0        0        1
```

The new variables (in columns C11-C16) are called dummy variables. There is a Minitab command which will take a column like C2 and produce dummy variables for us. It is called INDICATOR. In the above example, we would use it as follows:

MTB> INDICATOR C2 C11-C16

Unfortunately, INDICATOR cannot give the new variables names, so we must do this ourselves, in this case by the command

NAME C11 'arts' c12 'science' c13 'soc.st' c14 'law' c15 'engin' c16 'educ'

Once we have done this, we can include C11-C16 (but not C2) in a regression analysis. However, we will find that if we do, Minitab will throw out C16 before it starts, on grounds of multicollinearity, which means that it has found that it can predict C16 from C11 to C15 (because if a student isn't a member of any other faculty, s/he must be a member of the last one to be tried), and these are standard grounds for eliminating a variable from a regression analysis. It is better to choose one category to eliminate for ourselves. If there is one category which is in some sense a control or "normal" condition, this would be the one to eliminate. If there isn't, we usually eliminate the modal category, i.e. the one to which most observations belong. We can find out which is the modal category by using the Minitab command:

TABLE C2

or the command

TALLY C2

either of which will give us a list of the numbers in each category in C2.
Whichever way we eliminate a category, the regression coefficients for the remaining categories will give the extent to which members of those categories differ (other regressors held constant) from the category we eliminated. The R2 and F values will not be affected by our choice of which category to eliminate.
So, suppose the data on students' faculties were being used along with their their gender (held in C3, say), their Alevel points score (in C4), and scores on an IQ test (C5) in order to predict their average marks in first year university exams (held in C1). TALLY reveals that the modal faculty is Arts, which corresponds to variable C11. We would end up giving the following command to do the regression:

REGRESS C1 8 C3-C5 C12-C16

Make sure you understand why:

• C2 is not included in the regression command
• C11 is not included in the regression command
• we tell REGRESS that there are going to be 8 regressors

In the output from this regression, the coefficient of C12 (say) would tell us how, with other variables held constant, the average marks of a science student differed from those for an arts student, and so on for each of the other faculties.

Testing the significance of a categorical variable

Logically, before asking whether each of the categories differs from the control condition, we should ask whether the categorical variable as a whole has a significant effect. This is closely analogous to looking at the overall F for an effect in analysis of variance before enquiring into contrasts, and in fact we use an F test to answer this question too. Unfortunately, Minitab does not make it very easy for us to get at the answer. If we want to do it, we have to be careful to specify the dummy variables last in the list of regressors for the REGRESS command. For example, we might have Minitab output that looked like this:

```MTB > REGRESS C1 8 C3-C5 C12-C16
The regression equation is
degrmarx = - 23.1 - 0.80 m0f1 + 0.900 ALpoints + 0.450 IQ + 8.17 science
+ 1.36 soc.st + 4.69 law + 7.28 engin + 5.85 educ

Predictor       Coef       Stdev    t-ratio        p
Constant      -23.08       13.38      -1.72    0.088
m0f1          -0.801       3.018      -0.27    0.791
ALpoints      0.9005      0.4610       1.95    0.054
IQ            0.4495      0.1329       3.38    0.001
science        8.167       4.984       1.64    0.105
soc.st         1.356       5.238       0.26    0.796
law            4.688       5.076       0.92    0.358
engin          7.283       5.131       1.42    0.159
educ           5.847       6.527       0.90    0.373

s = 13.91       R-sq = 32.1%     R-sq-adj = 26.2%

Analysis of Variance

SOURCE       DF          SS          MS         F        p
Regression    8      8332.1      1041.5      5.38    0.000
Error        91     17607.7       193.5
Total        99     25939.8

SOURCE       DF      SEQ SS
m0f1          1       443.6
ALpoints      1      3980.9
IQ            1      3198.7
science       1       269.0
soc.st        1        38.7
law           1         9.3
engin         1       236.6
educ          1       155.3
```

To assess the effect of the faculty variable, we start by looking at the column headed SEQ SS that comes under the anova table in multiple regression output, and add up the entries for all the m-1 categories. So, in this example, we start by adding up the SEQ SS values for the five faculties:

269.0 + 38.7 + 9.3 + 236.6 + 155.3 = 708.9

The total, 708.9 in this case, can be described as the sum of squares for the categorical variable as a whole. Divide this by m-1, the total number of degrees of freedom associated with the five dummy variables, and we shall have the mean square for the categorical variable (708.9/5 = 141.8 in the example). This in turn can be divided by the error mean square from the anova table, to give an F statistic which will allow us to test the significance of the entire categorical variable (in our example, F = 141.8/193.5 = 0.73). Remember that F always has two values of degrees of freedom associated with it: in this case, the numerator degrees of freedom are m-1 and the denominator degrees of freedom are the error degrees of freedom from the anova table (in the example, 5 and 91 respectively).

Doing arithmetic in Minitab

Does all this mean we have to go back to carrying around a calculator and a book of statistics tables, to do the sums and find the significance of the F value we derive? Fortunately, it does not. Remember that Minitab includes the LET command, which does arithmetic. It also includes constants, which are like columns except that they can be used to hold single numbers; they are referred to as K1, K2, K3 etc. So, staying with the example above, we could do all the arithmetic in a single Minitab command, by typing:

```      LET K1=((269.0+38.7+9.3+236.6+155.3)/5)/193.5
PRINT K1```

Make sure you get the brackets right! If you find it difficult to keep track of them, you could do the arithmetic in smaller steps by using a series of LET commands.
Having got the required F value, we can report it as usual (in the present case, we'd write F5,91=0.73). But how can we find its significance? We do this by using the command CDF (this stands for cumulative distribution function), which will find the significance of most kinds of statistic. Because it works on lots of different statistics, we have to tell it which one we are using, and we do this through a Minitab subcommand. Many Minitab commands use subcommands. If you follow the command by a semicolon, instead of executing the command straight away, you will be given the prompt SUBC>, following which you can give extra information. In the case of CDF, what you have to give is the code for the kind of statistic we are interested in (F in this case) and its degrees of freedom. You follow those by a full stop to indicate that there are no more subcommands to come. So in our present example we would type

```       MTB > CDF K1;
SUBC> F 5 91.```

```        0.7327    0.3992
```

The first number is just the calculated F value again. The second value is not its significance, but the complement of its significance, i.e. one minus its significance level. So the significance of the F in this case is 0.60, i.e. it is a long way from being significant.
Note:

• we could have given CDF the numerical value of the F we were interested in, instead of the constant that contained it;
• the result you will get from all this does not depend on which category you eliminated from the analysis.

The significance of individual categories

As well as being interested in the significance of the categorical variable as a whole, we are likely to be interested in the significance of individual categories. Note in the sample output above that the dummy variable for each faculty has a t and a p value associated with it. It is important to remember that these assess the significance of the difference between this category and the category that we eliminated from the analysis; that is why if one category is a control or "normal" condition we choose that one to eliminate.

Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623