Contents of this handout: The problem of dichotomous dependent variables; Discriminant analysis; Logistic regression - theory; Logistic regression (and discriminant analysis) in practice; Interpreting and reporting logistic regression results; References and further reading; Examples.

A limitation of ordinary linear models is the requirement that the dependent
variable is numerical rather than categorical. But many interesting variables
are categorical - patients may live or die, people may pass or fail MScs
and so on. A range of techniques have been developed for analysing data
with categorical dependent variables, including **discriminant analysis,
probit analysis, log-linear regression **and** logistic regression**.
To contrast it with these, the kind of regression we have used so far is
usually referred to as **linear regression**.

The various techniques listed above are applicable in different situations: for example log-linear regression require all regressors to be categorical, whilst discriminant analysis strictly require them all to be continuous (though dummy variables can be used as for multiple regression). In SPSS at least, logistic regression is easier to use than discriminant analysis when we have a mixture of numerical and categorical regressors, because it includes procedures for generating the necessary dummy variables automatically.

*back to top*

The major purpose of discriminant analysis is to predict membership in two or more mutually exclusive groups from a set of predictors, when there is no natural ordering on the groups. So we may ask whether we can predict whether people vote Labour or Conservative from a knowledge of their age, their class, attitudes, values etc etc.

Discriminant analysis is just the inverse of a one-way **MANOVA**,
the multivariate analysis of variance. The levels of the independent variable
(or factor) for Manova become the categories of the dependent variable
for discriminant analysis, and the dependent variables of the Manova become
the predictors for discriminant analysis. In MANOVA we ask whether group
membership produces reliable differences on a combination of dependent
variables. If the answer to that question is 'yes' then clearly that combination
of variables can be used to predict group membership. Mathematically, MANOVA
and discriminant analysis are the same; indeed, the SPSS MANOVA command
can be used to print out the **discriminant** **functions** that
are at the heart of discriminant analysis, though this is not usually the
easiest way of obtaining them. These discriminant functions are the linear
combinations of the **standardised** independent variables which yield
the biggest mean differences between the groups. If the dependent variable
is a **dichotomy**, there is one discriminant function; if there are
*k* levels of the dependent variable, up to *k*-1 discriminant
functions can be extracted, and we can test how many it is worth extracting.
Successive discriminant functions are **orthogonal** to one another,
like **principal** **components**, but they are not the same as the
principal components you would obtain if you just did a principal components
analysis on the independent variables, because they are constructed to
maximise the differences between the values of the dependent variable.

The commonest use of discriminant analysis is where there are just two categories in the dependent variable; but as we have seen, it can be used for multi-way categories (just as MANOVA can be used to test the significance of differences between several groups, not just two). This is an advantage over logistic regression, which is always described for the problem of a dichotomous dependent variable.

You will encounter discriminant analysis fairly often in journals. But it is now being replaced with logistic regression, as this approach requires fewer assumptions in theory, is more statistically robust in practice, and is easier to use and understand than discriminant analysis. So we will concentrate on logistic regression.

*back to top*

Just like linear regression, logistic regression gives each regressor
a coefficient *b*_{1} which measures the regressor's independent
contribution to variations in the dependent variable. But there are technical
problems with dependent variables that can only take values of 0 and 1.
What we want to predict from a knowledge of relevant independent variables
is not a precise numerical value of a dependent variable, but rather the
probability (*p*) that it is 1 rather than 0. We might think that
we could use this probability as the dependent variable in an ordinary
regression, i.e. as a simple linear function of regressors, but we cannot,
for two reasons. First, numerical regressors may be unlimited in range.
If we expressed *p* as a linear function of income, we might then
find ourselves predicting that *p* is greater than 1 (which cannot
be true, as probabilities can only take values between 0 and 1). Second,
there is a problem of additivity. Imagine that we are trying to predict
success at a task from two dichotomous variables, training and gender.
Among untrained individuals, 50% of men succeed and 70% of women. Among
trained men, 90% succeed. If we thought of *p* as a **linear**
function of gender and training we would have to estimate the proportion
of trained women as 70% plus 40% = 110% (which again cannot be true).

We get over this problem by making a **logistic** transformation
of *p*, also called taking the **logit** of *p*. Logit(*p*)
is the log (to base *e*) of the **odds** or **likelihood ratio**
that the dependent variable is 1. In symbols it is defined as:

logit(*p*)=log(*p*/(1-*p*))

Whereas *p* can only range from 0 to 1, logit(*p*) ranges
from negative infinity to positive infinity. The logit scale is symmetrical
around the logit of 0.5 (which is zero), so the table below only includes
a couple of negative values.

Table 1. The relationship between probability of success (p) and logit(p)

p.3 .4 .5 .6 .7 .8 .9 .95 .99 logit(p) -.847 -.405 0.0 .405 .847 1.386 2.197 2.944 4.595

This table makes it clear that the differences between extreme probabilities is spread out; the differences of logits between success rates of .95 and .99 is much bigger than that between .5 and .7. In fact the logit scale is approximately linear in the middle range and logarithmic at extreme values.

We do not know that the logit scale is the best possible scale but it does seem intuitively reasonable. If we consider the example of training and gender used above, we can see how it works. On the logit scale, for untrained individuals, the difference of logits between men (success rate 0.50, logit 0.0) and women (success rate 0.70, logit 0.847) is 0.847. The success rate for trained men is .9 (logit 2.197), so we conclude that training makes a difference of logits of 2.197. We therefore predict for trained women a logit of 2.197 + 0.847 = 3.044 - which corresponds to a success probability of .955.

It follows that logistic regression involves fitting to the data an equation of the form:

logit(*p*)= *a* + *b*_{1}*x*_{1}
+ *b*_{2}*x*_{2} + *b*_{3}*x*_{3}
+ ...

The meaning of the coefficients *b*_{1}, *b*_{2},
etc is discussed below.

Although logistic regression finds a "best fitting" equation
just as linear regression does, the principles on which it does so are
rather different. Instead of using a **least-squared deviations** criterion
for the best fit, it uses a **maximum likelihood** method, which maximises
the probability of getting the observed results given the fitted regression
coefficients. A consequence of this is that the goodness of fit and overall
significance statistics used in logistic regression are different from
those used in linear regression.

*back to top*

Logistic regression is not available in Minitab but is one of the features relatively recently added to SPSS. The advanced statistics manuals for SPSS versions 4 onwards describe it well. If you are already familiar with the REGRESSION command, LOGISTIC REGRESSION is fairly straightforward to use and we suggest that you browse through the menu version of SPSS to learn the details. A simple example will illustrate the parallels. Imagine that we had carried out a study of voting and wished to know how to best predict whether people had voted Conservative or Labour. The commands would be:

LOGISTIC REGRESSION /VARIABLES voting WITH age sex class att1 att2 att3 att4 extro psycho neuro /METHOD FSTEP(LR) /CLASSPLOT.

The dependent variable is separated from the independent variables by
the term WITH. The METHOD subcommand uses the keyword FSTEP to specify
a **forward** **stepwise** procedure; we could also use BSTEP which
does a **backward stepwise**, i.e. it starts by entering all the variables
and then takes them out one at a time; or ENTER is we were engaged in hypothesis
testing rather than exploratory analysis. If no METHOD subcommand is given,
ENTER will be assumed. The (LR) term after FSTEP specifies that likelihood
ratio considerations will be used in selecting variables to add to or delete
from the model; this is preferable but can slow computation, so it may
be necessary to omit it. The /CLASSPLOT line is not strictly necessary
but aids interpretation - see below.

A useful property of the LOGISTIC REGRESSION command is that it can
cope automatically with categorical independent variables; we don't have
to write a loop as we do for linear regression. All we have to do is declare
any categorical variables on a /CATEGORICAL subcommand *as well as*
on the /VARIABLES subcommand. The /CONTRAST subcommand should be used to
control which category is dropped out when the dummy variables are formed;
if the control or modal category of, say, a variable DIAGNOST was its third
value, we would use the subcommand /CONTRAST(DIAGNOST)=INDICATOR(3) to
tell the LOGISTIC REGRESSION to drop level 3 of the variable in forming
dummy variables. Although this is an improvement over what we have to do
when using SPSS to carry out linear regression, there is a snag. /CONTRAST
likes its category levels specified in rather an odd way; in the example,
3 might not be the value used to code the modal category in DIAGNOST: for
example, if psychotic, neurotic and normal people were coded 0, 1 and 2,
the correct entry in /CONTRAST would be 3, not 2. Look, I didn't write
this idiot system, I'm just trying to tell you about it.

As in linear regression, there is no need to declare dichotomous independent variables as categorical.

We can also use SPSS to carry out discriminant analysis. For the example just considered, the commands would be:

DISCRIMINANT GROUPS=voting(0,1) /VARIABLES = age sex class att1 to att4 extro psycho neuro /METHOD=minRESID /STATISTICS=TABLE.

Note that we have to specify the two possible levels of the dependent variable (voting). We can use the /METHOD subcommand to request a variety of stepwise methods (RAO is another you might like to try), or to ENTER all or a subset of variables. The subcommand /STATISTICS=TABLE is needed to get the classification table which is needed for assessing goodness of fit (see below).

*back to top*

**Log likelihoods****Goodness of fit**

A key concept for understanding the tests used in logistic regression
(and many other procedures using maximum likelihood methods) is that of
**log likelihood**. Likelihood just means probability, though it tends
to be used by statisticians of a **Bayesian** orientation. It always
means probability *under a specified hypothesis*. In thinking about
logistic regression, two hypotheses are likely to be of interest: the null
hypothesis, which is that all the coefficients in the regression equation
take the value zero, and the hypothesis that the model currently under
consideration is accurate. We then work out the likelihood of observing
the exact data we actually did observe under each of these hypotheses.
The result is nearly always a frighteningly small number, and to make it
easier to handle, we take its natural logarithm (i.e. its log base *e*)
, giving us a log likelihood. Probabilities are always less than one, so
log likelihoods are always negative; often, we work with **negative log
likelihoods** for convenience.

Logistic regression does not give rise to an *R*^{2}_{adj}
statistic. Darlington (1990, page 449) recommends the following statistic
as a measure of goodness of fit:

exp[(LL_{model}-LL_{0})/N] - 1 LRFC_{1}= ------------------------ exp(-LL_{0}/N) - 1

where exp refers to the exponential function (the inverse of the log
function), *N* as usual is sample size, and *LL*_{model}
and *LL*_{0} are the log likelihoods of the data under the
model and the null hypothesis respectively. (Note that I have changed Darlington's
notation a little to make it fit in with that used in the rest of these
notes.) Darlington's statistic is useful because it takes values between
0 and 1 (or 0% and 100%) which have much the same interpretation as values
of *R*^{2}_{adj} or *R*^{2}_{adj}
in an linear regression, although unfortunately it looks from the formula
that, of the two, it is more closely analogous to *R*^{2}
. Unfortunately SPSS does not report this statistic. However, it does report
*negative* log likelihoods, multiplied by 2, so with a little adjustment
these can be inserted in the equation for *LRFC*_{1}.

Rather than using a goodness of fit statistic, though, we often want
to look at the proportion of cases we have managed to classify correctly.
For this we need to look at the **classification table** printed out
by SPSS, which tells us how many of the cases where the observed value
of the dependent variable was 1 have been predicted with a value 1, and
so on. An advantage of the classification table is that we can get one
out of either logistic regression or discriminant analysis, so we can use
it to compare the two approaches. Statisticians claim that logistic regression
tends to classify a higher proportion of cases correctly.

Another very useful piece of information for
assessing goodness of fit can be gained by using the /CLASSPLOT subcommand.
This causes SPSS to print distributions of predicted logit values, distinguishing
the observed category values. The resulting plot is very useful for spotting
possible outliers. It will also tell you whether it might be better to
separate the two predicted categories by some rule other than the simple
one SPSS uses, which is to predict value 1 if logit(*p*) is greater
than 0 (i.e. if *p* is greater than 0.5). A better separation of categories
might result from using a different criterion. We might also want to use
a different criterion if the *a priori* probabilities of the two categories
were very different (one might be a rare disease, for example), or if the
costs of mistakenly predicting someone into the two categories differ (suppose
the categories were "found guilty of murder" and "not guilty",
for example). The following is an example of such a CLASSPLOT:

32 + f+ | f| | f| F | f| R 24 + f+ E | f| Q | f| U | f| E 16 + f+ N | f| C | f| Y | f| 8 + f+ | f| | f f f f ffffff| | n fnn nnnnnf nnfnn nnn n fn nnffnff f ff nfnffff| Predicted --------------+--------------+--------------+--------------- Prob: 0 .25 .5 .75 1 Group: nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnffffffffffffffffffffffffffffff Predicted Probability is of Membership for found guilty Symbols: n - not guilty f - found guilty Each Symbol Represents 2 Cases.

If we were called as expert witnesses to advise the court about the probability that the person accused had committed murder, using the variables in this particular logistic regression model, we might want to set a predicted probability criterion of .9 rather than .5

SPSS will offer you a variety of statistical tests. Usually, though,
overall significance is tested using what SPSS calls the *Model Chi*-*square*,
which is derived from the likelihood of observing the actual data under
the assumption that the model that has been fitted is accurate. It is convenient
to use -2 times the log (base *e*) of this likelihood; we call this
-2*LL*. The difference between -2*LL* for the best-fitting model
and -2*LL* for the null hypothesis model (in which all the *b*
values are set to zero) is distributed like chi-squared, with degrees of
freedom equal to the number of predictors; this difference is the *Model
chi*-*square* that SPSS refers to. Very conveniently, the difference
between -2*LL* values for models with successive terms added also
has a chi-squared distribution, so when we use a stepwise procedure, we
can use chi-squared tests to find out if adding one or more extra predictors
singificantly improves the fit of our model.

How can we *describe* the effect of a single regressor in logistic
regression? The fundamental equation for logistic regression tells us that
with all other variables held constant, there is a constant increase of
*b*_{1} in logit(*p*) for every 1-unit increase in *x*_{1},
and so on. But what does a constant increase in logit(*p*) mean? Because
the logit transformation is non-linear, it does not mean a constant increase
in *p*; so the increase in *p* associated with a 1-unit increase
in *x*_{1} changes with the value of *x*_{1}
you begin with.

It turns out that a constant increase in logit(*p*) does have a
reasonably straightforward interpretation. It corresponds to a constant
*multiplication* (by exp(*b*)) of the **odds** that the dependent
variable takes the value 1 rather than 0. So, suppose *b*_{1}
takes the value 2.30 - we choose this value as an example because exp(2.30)
equals 10, so the arithmetic will be easy. Then if *x*_{1}
changes increases by 1, the odds that the dependent variable takes the
value 1 increase tenfold. So, with this value of *b*_{1},
let us suppose that with all other variables at their mean values, and
*x*_{1} taking the value 0, we predict a logit(*p*) of
0; this means that there is an even chance of the dependent variable taking
the value 1. Now suppose *x*_{1} increases to 1. The odds
that the dependent variable takes the value 1 rise by a factor of ten,
so they go from an even chance (1:1) to 10:1, i.e. *p* changes to
0.909. If *x*_{1} further increases to 2, then the odds will
move to 100:1, a *p* value of 0.990; and so on. This leads to a convenient
way of representing the results of logistic regression by a plot showing
the odds change produced by unit changes in different independent variables.
A good example is the figure in Johnson et al's (1992) report on risk factors
for contracting AIDS.

SPSS prints out the value of what it calls the **Wald statistic**
for each regressor in each model, together with a corresponding significance
level. The Wald statistic has a chi-squared distribution, but apart from
that it is used in just the same way as the *t* values for individual
regressors in linear regression. However, the Wald test gives the wrong
results for very high coefficient values, and if you encounter those, you
should use the difference of -2*LL* values for models with and without
the predictor instead.

Some parts of this section may have seemed rather complex, and you may be tempted to give up on logistic regression at this point, deciding that you will never understand it. Remember, though, that the quantitative interpretation of the coefficients does not matter too much if all you want to do is any or all of the following:

- see how well you can classify people into groups from a knowledge of
independent variables; this is addressed by the classification table and
the
*LRFC*_{1}goodness of fit statistics discussed above; - see whether the independent variables as a whole significantly affect the dependent variable; this is addressed by the Model Chi-square statistic.
- identify the best variables to use in prediction. This is more complex
than with linear regression, because SPSS does not give you
**beta**values directly in the logistic regression output. But you can if necessary compare regressors by multiplying each coefficient by the standard deviation of the corresponding variable. The results will not be beta values, but their ranking will reflect relative importance of the regressors in the same way as beta values do. - determine whether particular independent variables have significant
effects on the dependent variable; this can be done using the Wald statistics
which SPSS produces, or by comparing the -2
*LL*values for models with and without the variables concerned.

*back to top*

- SPSS Advanced Statistics manual, for versions 4 onwards.
- Darlington, R. B. (1990),
*Regression and linear models*. New York: McGraw-Hill. Chapter 18. - Johnson, A. M., Wadsworth, K., Bradshaw, S., & Field, J. (1992).
Sexual lifestyles and HIV risk.
*Nature*,*360*, 410-412. - Press, S. J., & Wilson, S. (1978). Choosing between logistic regression
and
*Journal of the American Statistical Association, 73*,

*back to top*

- The SPSS system file /singer1/eps/psybin/stats/atmmini.sys is a cut-down version of a file obtained from the ESRC data archive in Essex. This file is also available (in portable format) on the PSYCHO fileserver, currently in directory \scratch\segl\stats. Copy the system file into your own file space.
- The study examined the factors that influenced whether or not people had cash cards and the original report consists of a long series of cross-tabulations. But obviously the data are ideally suited to logistic regression. The variables we have included are HAVECARD (the dependent variable) and age, sex, 10 attitude scales, judgements of how serious various problems were (prob1 to prob10) income (inc) and problems in use (useprob1 to useprob3). Carry out two logistic regressions, using first FSTEP and then BSTEP.
- Carry out a couple of discriminant analyses on these data (using different methods). Compare the results with those of the logistic regression.

Remember not to take assumptions for granted.

*back to top*

Stephen Lea

University of Exeter Department of Psychology

Washington Singer Laboratories

Exeter EX4 4QG

United Kingdom

Tel +44 1392 264626

Fax +44 1392 264623

Send questions and
comments to the departmental
administrator or to the author
of this page

Goto Home page for
this course | previous topic | next
topic | FAQ file

Goto home page for: University of
Exeter | Department of
Psychology | Staff
| Students |
Research | Teaching
| Miscellaneous

(access count since 28th February 1997).