 Topic 4: Logistic regression and discriminant analysis

Contents of this handout: The problem of dichotomous dependent variables; Discriminant analysis; Logistic regression - theory; Logistic regression (and discriminant analysis) in practice; Interpreting and reporting logistic regression results; References and further reading; Examples.

The Problem: Categorical dependent variables

A limitation of ordinary linear models is the requirement that the dependent variable is numerical rather than categorical. But many interesting variables are categorical - patients may live or die, people may pass or fail MScs and so on. A range of techniques have been developed for analysing data with categorical dependent variables, including discriminant analysis, probit analysis, log-linear regression and logistic regression. To contrast it with these, the kind of regression we have used so far is usually referred to as linear regression.

The various techniques listed above are applicable in different situations: for example log-linear regression require all regressors to be categorical, whilst discriminant analysis strictly require them all to be continuous (though dummy variables can be used as for multiple regression). In SPSS at least, logistic regression is easier to use than discriminant analysis when we have a mixture of numerical and categorical regressors, because it includes procedures for generating the necessary dummy variables automatically.

Discriminant analysis

The major purpose of discriminant analysis is to predict membership in two or more mutually exclusive groups from a set of predictors, when there is no natural ordering on the groups. So we may ask whether we can predict whether people vote Labour or Conservative from a knowledge of their age, their class, attitudes, values etc etc.

Discriminant analysis is just the inverse of a one-way MANOVA, the multivariate analysis of variance. The levels of the independent variable (or factor) for Manova become the categories of the dependent variable for discriminant analysis, and the dependent variables of the Manova become the predictors for discriminant analysis. In MANOVA we ask whether group membership produces reliable differences on a combination of dependent variables. If the answer to that question is 'yes' then clearly that combination of variables can be used to predict group membership. Mathematically, MANOVA and discriminant analysis are the same; indeed, the SPSS MANOVA command can be used to print out the discriminant functions that are at the heart of discriminant analysis, though this is not usually the easiest way of obtaining them. These discriminant functions are the linear combinations of the standardised independent variables which yield the biggest mean differences between the groups. If the dependent variable is a dichotomy, there is one discriminant function; if there are k levels of the dependent variable, up to k-1 discriminant functions can be extracted, and we can test how many it is worth extracting. Successive discriminant functions are orthogonal to one another, like principal components, but they are not the same as the principal components you would obtain if you just did a principal components analysis on the independent variables, because they are constructed to maximise the differences between the values of the dependent variable.

The commonest use of discriminant analysis is where there are just two categories in the dependent variable; but as we have seen, it can be used for multi-way categories (just as MANOVA can be used to test the significance of differences between several groups, not just two). This is an advantage over logistic regression, which is always described for the problem of a dichotomous dependent variable.

You will encounter discriminant analysis fairly often in journals. But it is now being replaced with logistic regression, as this approach requires fewer assumptions in theory, is more statistically robust in practice, and is easier to use and understand than discriminant analysis. So we will concentrate on logistic regression.

Logistic regression: theory

Just like linear regression, logistic regression gives each regressor a coefficient b1 which measures the regressor's independent contribution to variations in the dependent variable. But there are technical problems with dependent variables that can only take values of 0 and 1. What we want to predict from a knowledge of relevant independent variables is not a precise numerical value of a dependent variable, but rather the probability (p) that it is 1 rather than 0. We might think that we could use this probability as the dependent variable in an ordinary regression, i.e. as a simple linear function of regressors, but we cannot, for two reasons. First, numerical regressors may be unlimited in range. If we expressed p as a linear function of income, we might then find ourselves predicting that p is greater than 1 (which cannot be true, as probabilities can only take values between 0 and 1). Second, there is a problem of additivity. Imagine that we are trying to predict success at a task from two dichotomous variables, training and gender. Among untrained individuals, 50% of men succeed and 70% of women. Among trained men, 90% succeed. If we thought of p as a linear function of gender and training we would have to estimate the proportion of trained women as 70% plus 40% = 110% (which again cannot be true).

We get over this problem by making a logistic transformation of p, also called taking the logit of p. Logit(p) is the log (to base e) of the odds or likelihood ratio that the dependent variable is 1. In symbols it is defined as:

logit(p)=log(p/(1-p))

Whereas p can only range from 0 to 1, logit(p) ranges from negative infinity to positive infinity. The logit scale is symmetrical around the logit of 0.5 (which is zero), so the table below only includes a couple of negative values.

`Table 1. The relationship between probability of success (p) and logit(p)`
```p         .3    .4    .5    .6    .7    .8    .9    .95   .99
logit(p) -.847 -.405 0.0    .405  .847 1.386 2.197 2.944 4.595```

This table makes it clear that the differences between extreme probabilities is spread out; the differences of logits between success rates of .95 and .99 is much bigger than that between .5 and .7. In fact the logit scale is approximately linear in the middle range and logarithmic at extreme values.

We do not know that the logit scale is the best possible scale but it does seem intuitively reasonable. If we consider the example of training and gender used above, we can see how it works. On the logit scale, for untrained individuals, the difference of logits between men (success rate 0.50, logit 0.0) and women (success rate 0.70, logit 0.847) is 0.847. The success rate for trained men is .9 (logit 2.197), so we conclude that training makes a difference of logits of 2.197. We therefore predict for trained women a logit of 2.197 + 0.847 = 3.044 - which corresponds to a success probability of .955.

It follows that logistic regression involves fitting to the data an equation of the form:

logit(p)= a + b1x1 + b2x2 + b3x3 + ...

The meaning of the coefficients b1, b2, etc is discussed below.

Although logistic regression finds a "best fitting" equation just as linear regression does, the principles on which it does so are rather different. Instead of using a least-squared deviations criterion for the best fit, it uses a maximum likelihood method, which maximises the probability of getting the observed results given the fitted regression coefficients. A consequence of this is that the goodness of fit and overall significance statistics used in logistic regression are different from those used in linear regression.

Logistic regression (and discriminant analysis) in practice

Logistic regression is not available in Minitab but is one of the features relatively recently added to SPSS. The advanced statistics manuals for SPSS versions 4 onwards describe it well. If you are already familiar with the REGRESSION command, LOGISTIC REGRESSION is fairly straightforward to use and we suggest that you browse through the menu version of SPSS to learn the details. A simple example will illustrate the parallels. Imagine that we had carried out a study of voting and wished to know how to best predict whether people had voted Conservative or Labour. The commands would be:

```LOGISTIC REGRESSION /VARIABLES voting WITH age sex class
att1 att2 att3 att4 extro psycho neuro
/METHOD FSTEP(LR)
/CLASSPLOT.```

The dependent variable is separated from the independent variables by the term WITH. The METHOD subcommand uses the keyword FSTEP to specify a forward stepwise procedure; we could also use BSTEP which does a backward stepwise, i.e. it starts by entering all the variables and then takes them out one at a time; or ENTER is we were engaged in hypothesis testing rather than exploratory analysis. If no METHOD subcommand is given, ENTER will be assumed. The (LR) term after FSTEP specifies that likelihood ratio considerations will be used in selecting variables to add to or delete from the model; this is preferable but can slow computation, so it may be necessary to omit it. The /CLASSPLOT line is not strictly necessary but aids interpretation - see below.

A useful property of the LOGISTIC REGRESSION command is that it can cope automatically with categorical independent variables; we don't have to write a loop as we do for linear regression. All we have to do is declare any categorical variables on a /CATEGORICAL subcommand as well as on the /VARIABLES subcommand. The /CONTRAST subcommand should be used to control which category is dropped out when the dummy variables are formed; if the control or modal category of, say, a variable DIAGNOST was its third value, we would use the subcommand /CONTRAST(DIAGNOST)=INDICATOR(3) to tell the LOGISTIC REGRESSION to drop level 3 of the variable in forming dummy variables. Although this is an improvement over what we have to do when using SPSS to carry out linear regression, there is a snag. /CONTRAST likes its category levels specified in rather an odd way; in the example, 3 might not be the value used to code the modal category in DIAGNOST: for example, if psychotic, neurotic and normal people were coded 0, 1 and 2, the correct entry in /CONTRAST would be 3, not 2. Look, I didn't write this idiot system, I'm just trying to tell you about it.

As in linear regression, there is no need to declare dichotomous independent variables as categorical.

We can also use SPSS to carry out discriminant analysis. For the example just considered, the commands would be:

```DISCRIMINANT GROUPS=voting(0,1)
/VARIABLES = age sex class att1 to att4 extro psycho neuro
/METHOD=minRESID
/STATISTICS=TABLE.```

Note that we have to specify the two possible levels of the dependent variable (voting). We can use the /METHOD subcommand to request a variety of stepwise methods (RAO is another you might like to try), or to ENTER all or a subset of variables. The subcommand /STATISTICS=TABLE is needed to get the classification table which is needed for assessing goodness of fit (see below).

Interpreting and reporting logistic regression results

• Log likelihoods
• A key concept for understanding the tests used in logistic regression (and many other procedures using maximum likelihood methods) is that of log likelihood. Likelihood just means probability, though it tends to be used by statisticians of a Bayesian orientation. It always means probability under a specified hypothesis. In thinking about logistic regression, two hypotheses are likely to be of interest: the null hypothesis, which is that all the coefficients in the regression equation take the value zero, and the hypothesis that the model currently under consideration is accurate. We then work out the likelihood of observing the exact data we actually did observe under each of these hypotheses. The result is nearly always a frighteningly small number, and to make it easier to handle, we take its natural logarithm (i.e. its log base e) , giving us a log likelihood. Probabilities are always less than one, so log likelihoods are always negative; often, we work with negative log likelihoods for convenience.

• Goodness of fit
• Logistic regression does not give rise to an R2adj statistic. Darlington (1990, page 449) recommends the following statistic as a measure of goodness of fit:

```        exp[(LLmodel-LL0)/N] - 1
LRFC1 = ------------------------
exp(-LL0/N) - 1
```

where exp refers to the exponential function (the inverse of the log function), N as usual is sample size, and LLmodel and LL0 are the log likelihoods of the data under the model and the null hypothesis respectively. (Note that I have changed Darlington's notation a little to make it fit in with that used in the rest of these notes.) Darlington's statistic is useful because it takes values between 0 and 1 (or 0% and 100%) which have much the same interpretation as values of R2adj or R2adj in an linear regression, although unfortunately it looks from the formula that, of the two, it is more closely analogous to R2 . Unfortunately SPSS does not report this statistic. However, it does report negative log likelihoods, multiplied by 2, so with a little adjustment these can be inserted in the equation for LRFC1.

Rather than using a goodness of fit statistic, though, we often want to look at the proportion of cases we have managed to classify correctly. For this we need to look at the classification table printed out by SPSS, which tells us how many of the cases where the observed value of the dependent variable was 1 have been predicted with a value 1, and so on. An advantage of the classification table is that we can get one out of either logistic regression or discriminant analysis, so we can use it to compare the two approaches. Statisticians claim that logistic regression tends to classify a higher proportion of cases correctly.

Another very useful piece of information for assessing goodness of fit can be gained by using the /CLASSPLOT subcommand. This causes SPSS to print distributions of predicted logit values, distinguishing the observed category values. The resulting plot is very useful for spotting possible outliers. It will also tell you whether it might be better to separate the two predicted categories by some rule other than the simple one SPSS uses, which is to predict value 1 if logit(p) is greater than 0 (i.e. if p is greater than 0.5). A better separation of categories might result from using a different criterion. We might also want to use a different criterion if the a priori probabilities of the two categories were very different (one might be a rare disease, for example), or if the costs of mistakenly predicting someone into the two categories differ (suppose the categories were "found guilty of murder" and "not guilty", for example). The following is an example of such a CLASSPLOT:

```      32 +                                                           f+
|                                                           f|
|                                                           f|
F        |                                                           f|
R     24 +                                                           f+
E        |                                                           f|
Q        |                                                           f|
U        |                                                           f|
E     16 +                                                           f+
N        |                                                           f|
C        |                                                           f|
Y        |                                                           f|
8 +                                                           f+
|                                                           f|
|                  f f                  f           f  ffffff|
|          n fnn nnnnnf nnfnn nnn  n fn nnffnff  f ff nfnffff|
Predicted --------------+--------------+--------------+---------------
Prob:   0            .25            .5             .75             1
Group:  nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnffffffffffffffffffffffffffffff

Predicted Probability is of Membership for found guilty

Symbols: n - not guilty
f - found guilty

Each Symbol Represents 2 Cases.```

If we were called as expert witnesses to advise the court about the probability that the person accused had committed murder, using the variables in this particular logistic regression model, we might want to set a predicted probability criterion of .9 rather than .5

• Overall significance
• SPSS will offer you a variety of statistical tests. Usually, though, overall significance is tested using what SPSS calls the Model Chi-square, which is derived from the likelihood of observing the actual data under the assumption that the model that has been fitted is accurate. It is convenient to use -2 times the log (base e) of this likelihood; we call this -2LL. The difference between -2LL for the best-fitting model and -2LL for the null hypothesis model (in which all the b values are set to zero) is distributed like chi-squared, with degrees of freedom equal to the number of predictors; this difference is the Model chi-square that SPSS refers to. Very conveniently, the difference between -2LL values for models with successive terms added also has a chi-squared distribution, so when we use a stepwise procedure, we can use chi-squared tests to find out if adding one or more extra predictors singificantly improves the fit of our model.

• The interpretation of coefficients
• How can we describe the effect of a single regressor in logistic regression? The fundamental equation for logistic regression tells us that with all other variables held constant, there is a constant increase of b1 in logit(p) for every 1-unit increase in x1, and so on. But what does a constant increase in logit(p) mean? Because the logit transformation is non-linear, it does not mean a constant increase in p; so the increase in p associated with a 1-unit increase in x1 changes with the value of x1 you begin with.

It turns out that a constant increase in logit(p) does have a reasonably straightforward interpretation. It corresponds to a constant multiplication (by exp(b)) of the odds that the dependent variable takes the value 1 rather than 0. So, suppose b1 takes the value 2.30 - we choose this value as an example because exp(2.30) equals 10, so the arithmetic will be easy. Then if x1 changes increases by 1, the odds that the dependent variable takes the value 1 increase tenfold. So, with this value of b1, let us suppose that with all other variables at their mean values, and x1 taking the value 0, we predict a logit(p) of 0; this means that there is an even chance of the dependent variable taking the value 1. Now suppose x1 increases to 1. The odds that the dependent variable takes the value 1 rise by a factor of ten, so they go from an even chance (1:1) to 10:1, i.e. p changes to 0.909. If x1 further increases to 2, then the odds will move to 100:1, a p value of 0.990; and so on. This leads to a convenient way of representing the results of logistic regression by a plot showing the odds change produced by unit changes in different independent variables. A good example is the figure in Johnson et al's (1992) report on risk factors for contracting AIDS.

• Significance of individual regressors
• SPSS prints out the value of what it calls the Wald statistic for each regressor in each model, together with a corresponding significance level. The Wald statistic has a chi-squared distribution, but apart from that it is used in just the same way as the t values for individual regressors in linear regression. However, the Wald test gives the wrong results for very high coefficient values, and if you encounter those, you should use the difference of -2LL values for models with and without the predictor instead.

• A reassuring coda
• Some parts of this section may have seemed rather complex, and you may be tempted to give up on logistic regression at this point, deciding that you will never understand it. Remember, though, that the quantitative interpretation of the coefficients does not matter too much if all you want to do is any or all of the following:

• see how well you can classify people into groups from a knowledge of independent variables; this is addressed by the classification table and the LRFC1 goodness of fit statistics discussed above;
• see whether the independent variables as a whole significantly affect the dependent variable; this is addressed by the Model Chi-square statistic.
• identify the best variables to use in prediction. This is more complex than with linear regression, because SPSS does not give you beta values directly in the logistic regression output. But you can if necessary compare regressors by multiplying each coefficient by the standard deviation of the corresponding variable. The results will not be beta values, but their ranking will reflect relative importance of the regressors in the same way as beta values do.
• determine whether particular independent variables have significant effects on the dependent variable; this can be done using the Wald statistics which SPSS produces, or by comparing the -2LL values for models with and without the variables concerned.

• SPSS Advanced Statistics manual, for versions 4 onwards.
• Darlington, R. B. (1990), Regression and linear models. New York: McGraw-Hill. Chapter 18.
• Johnson, A. M., Wadsworth, K., Bradshaw, S., & Field, J. (1992). Sexual lifestyles and HIV risk. Nature, 360, 410-412.
• Press, S. J., & Wilson, S. (1978). Choosing between logistic regression and discriminant analysis. Journal of the American Statistical Association, 73, 699-705. This paper sets out to show that logistic regression is better than discriminant analysis and ends up showing that at a qualitative level they are likely to lead to the same conclusions. But it is very useful for clarifying terms.

Examples

1. The SPSS system file /singer1/eps/psybin/stats/atmmini.sys is a cut-down version of a file obtained from the ESRC data archive in Essex. This file is also available (in portable format) on the PSYCHO fileserver, currently in directory \scratch\segl\stats. Copy the system file into your own file space.
2. The study examined the factors that influenced whether or not people had cash cards and the original report consists of a long series of cross-tabulations. But obviously the data are ideally suited to logistic regression. The variables we have included are HAVECARD (the dependent variable) and age, sex, 10 attitude scales, judgements of how serious various problems were (prob1 to prob10) income (inc) and problems in use (useprob1 to useprob3). Carry out two logistic regressions, using first FSTEP and then BSTEP.
3. Carry out a couple of discriminant analyses on these data (using different methods). Compare the results with those of the logistic regression.

Remember not to take assumptions for granted.

Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623   (access count since 28th February 1997).