University of Exeter

DEPARTMENT OF PSYCHOLOGY


PSY6003 Advanced statistics: Multivariate analysis II: Manifest variables analyses

Topic 4: Logistic regression and discriminant analysis


Contents of this handout: The problem of dichotomous dependent variables; Discriminant analysis; Logistic regression - theory; Logistic regression (and discriminant analysis) in practice; Interpreting and reporting logistic regression results; References and further reading; Examples.

The Problem: Categorical dependent variables

A limitation of ordinary linear models is the requirement that the dependent variable is numerical rather than categorical. But many interesting variables are categorical - patients may live or die, people may pass or fail MScs and so on. A range of techniques have been developed for analysing data with categorical dependent variables, including discriminant analysis, probit analysis, log-linear regression and logistic regression. To contrast it with these, the kind of regression we have used so far is usually referred to as linear regression.

The various techniques listed above are applicable in different situations: for example log-linear regression require all regressors to be categorical, whilst discriminant analysis strictly require them all to be continuous (though dummy variables can be used as for multiple regression). In SPSS at least, logistic regression is easier to use than discriminant analysis when we have a mixture of numerical and categorical regressors, because it includes procedures for generating the necessary dummy variables automatically.

back to top

Discriminant analysis

The major purpose of discriminant analysis is to predict membership in two or more mutually exclusive groups from a set of predictors, when there is no natural ordering on the groups. So we may ask whether we can predict whether people vote Labour or Conservative from a knowledge of their age, their class, attitudes, values etc etc.

Discriminant analysis is just the inverse of a one-way MANOVA, the multivariate analysis of variance. The levels of the independent variable (or factor) for Manova become the categories of the dependent variable for discriminant analysis, and the dependent variables of the Manova become the predictors for discriminant analysis. In MANOVA we ask whether group membership produces reliable differences on a combination of dependent variables. If the answer to that question is 'yes' then clearly that combination of variables can be used to predict group membership. Mathematically, MANOVA and discriminant analysis are the same; indeed, the SPSS MANOVA command can be used to print out the discriminant functions that are at the heart of discriminant analysis, though this is not usually the easiest way of obtaining them. These discriminant functions are the linear combinations of the standardised independent variables which yield the biggest mean differences between the groups. If the dependent variable is a dichotomy, there is one discriminant function; if there are k levels of the dependent variable, up to k-1 discriminant functions can be extracted, and we can test how many it is worth extracting. Successive discriminant functions are orthogonal to one another, like principal components, but they are not the same as the principal components you would obtain if you just did a principal components analysis on the independent variables, because they are constructed to maximise the differences between the values of the dependent variable.

The commonest use of discriminant analysis is where there are just two categories in the dependent variable; but as we have seen, it can be used for multi-way categories (just as MANOVA can be used to test the significance of differences between several groups, not just two). This is an advantage over logistic regression, which is always described for the problem of a dichotomous dependent variable.

You will encounter discriminant analysis fairly often in journals. But it is now being replaced with logistic regression, as this approach requires fewer assumptions in theory, is more statistically robust in practice, and is easier to use and understand than discriminant analysis. So we will concentrate on logistic regression.

back to top

Logistic regression: theory

Just like linear regression, logistic regression gives each regressor a coefficient b1 which measures the regressor's independent contribution to variations in the dependent variable. But there are technical problems with dependent variables that can only take values of 0 and 1. What we want to predict from a knowledge of relevant independent variables is not a precise numerical value of a dependent variable, but rather the probability (p) that it is 1 rather than 0. We might think that we could use this probability as the dependent variable in an ordinary regression, i.e. as a simple linear function of regressors, but we cannot, for two reasons. First, numerical regressors may be unlimited in range. If we expressed p as a linear function of income, we might then find ourselves predicting that p is greater than 1 (which cannot be true, as probabilities can only take values between 0 and 1). Second, there is a problem of additivity. Imagine that we are trying to predict success at a task from two dichotomous variables, training and gender. Among untrained individuals, 50% of men succeed and 70% of women. Among trained men, 90% succeed. If we thought of p as a linear function of gender and training we would have to estimate the proportion of trained women as 70% plus 40% = 110% (which again cannot be true).

We get over this problem by making a logistic transformation of p, also called taking the logit of p. Logit(p) is the log (to base e) of the odds or likelihood ratio that the dependent variable is 1. In symbols it is defined as:

logit(p)=log(p/(1-p))

Whereas p can only range from 0 to 1, logit(p) ranges from negative infinity to positive infinity. The logit scale is symmetrical around the logit of 0.5 (which is zero), so the table below only includes a couple of negative values.

Table 1. The relationship between probability of success (p) and logit(p)

This table makes it clear that the differences between extreme probabilities is spread out; the differences of logits between success rates of .95 and .99 is much bigger than that between .5 and .7. In fact the logit scale is approximately linear in the middle range and logarithmic at extreme values.

We do not know that the logit scale is the best possible scale but it does seem intuitively reasonable. If we consider the example of training and gender used above, we can see how it works. On the logit scale, for untrained individuals, the difference of logits between men (success rate 0.50, logit 0.0) and women (success rate 0.70, logit 0.847) is 0.847. The success rate for trained men is .9 (logit 2.197), so we conclude that training makes a difference of logits of 2.197. We therefore predict for trained women a logit of 2.197 + 0.847 = 3.044 - which corresponds to a success probability of .955.

It follows that logistic regression involves fitting to the data an equation of the form:

logit(p)= a + b1x1 + b2x2 + b3x3 + ...

The meaning of the coefficients b1, b2, etc is discussed below.

Although logistic regression finds a "best fitting" equation just as linear regression does, the principles on which it does so are rather different. Instead of using a least-squared deviations criterion for the best fit, it uses a maximum likelihood method, which maximises the probability of getting the observed results given the fitted regression coefficients. A consequence of this is that the goodness of fit and overall significance statistics used in logistic regression are different from those used in linear regression.

back to top

Logistic regression (and discriminant analysis) in practice

Logistic regression is not available in Minitab but is one of the features relatively recently added to SPSS. The advanced statistics manuals for SPSS versions 4 onwards describe it well. If you are already familiar with the REGRESSION command, LOGISTIC REGRESSION is fairly straightforward to use and we suggest that you browse through the menu version of SPSS to learn the details. A simple example will illustrate the parallels. Imagine that we had carried out a study of voting and wished to know how to best predict whether people had voted Conservative or Labour. The commands would be:

LOGISTIC REGRESSION /VARIABLES voting WITH age sex class
    att1 att2 att3 att4 extro psycho neuro
    /METHOD FSTEP(LR)
    /CLASSPLOT.

The dependent variable is separated from the independent variables by the term WITH. The METHOD subcommand uses the keyword FSTEP to specify a forward stepwise procedure; we could also use BSTEP which does a backward stepwise, i.e. it starts by entering all the variables and then takes them out one at a time; or ENTER is we were engaged in hypothesis testing rather than exploratory analysis. If no METHOD subcommand is given, ENTER will be assumed. The (LR) term after FSTEP specifies that likelihood ratio considerations will be used in selecting variables to add to or delete from the model; this is preferable but can slow computation, so it may be necessary to omit it. The /CLASSPLOT line is not strictly necessary but aids interpretation - see below.

A useful property of the LOGISTIC REGRESSION command is that it can cope automatically with categorical independent variables; we don't have to write a loop as we do for linear regression. All we have to do is declare any categorical variables on a /CATEGORICAL subcommand as well as on the /VARIABLES subcommand. The /CONTRAST subcommand should be used to control which category is dropped out when the dummy variables are formed; if the control or modal category of, say, a variable DIAGNOST was its third value, we would use the subcommand /CONTRAST(DIAGNOST)=INDICATOR(3) to tell the LOGISTIC REGRESSION to drop level 3 of the variable in forming dummy variables. Although this is an improvement over what we have to do when using SPSS to carry out linear regression, there is a snag. /CONTRAST likes its category levels specified in rather an odd way; in the example, 3 might not be the value used to code the modal category in DIAGNOST: for example, if psychotic, neurotic and normal people were coded 0, 1 and 2, the correct entry in /CONTRAST would be 3, not 2. Look, I didn't write this idiot system, I'm just trying to tell you about it.

As in linear regression, there is no need to declare dichotomous independent variables as categorical.

We can also use SPSS to carry out discriminant analysis. For the example just considered, the commands would be:

DISCRIMINANT GROUPS=voting(0,1)
    /VARIABLES = age sex class att1 to att4 extro psycho neuro
    /METHOD=minRESID
    /STATISTICS=TABLE.

Note that we have to specify the two possible levels of the dependent variable (voting). We can use the /METHOD subcommand to request a variety of stepwise methods (RAO is another you might like to try), or to ENTER all or a subset of variables. The subcommand /STATISTICS=TABLE is needed to get the classification table which is needed for assessing goodness of fit (see below).

back to top

Interpreting and reporting logistic regression results

back to top

References and further reading

back to top


Examples

  1. The SPSS system file /singer1/eps/psybin/stats/atmmini.sys is a cut-down version of a file obtained from the ESRC data archive in Essex. This file is also available (in portable format) on the PSYCHO fileserver, currently in directory \scratch\segl\stats. Copy the system file into your own file space.
  2. The study examined the factors that influenced whether or not people had cash cards and the original report consists of a long series of cross-tabulations. But obviously the data are ideally suited to logistic regression. The variables we have included are HAVECARD (the dependent variable) and age, sex, 10 attitude scales, judgements of how serious various problems were (prob1 to prob10) income (inc) and problems in use (useprob1 to useprob3). Carry out two logistic regressions, using first FSTEP and then BSTEP.
  3. Carry out a couple of discriminant analyses on these data (using different methods). Compare the results with those of the logistic regression.

Remember not to take assumptions for granted.

back to top


Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623


Send questions and comments to the departmental administrator or to the author of this page
Goto Home page for this course | previous topic | next topic | FAQ file
Goto home page for: University of Exeter | Department of Psychology | Staff | Students | Research | Teaching | Miscellaneous


Disclaimer Home (access count since 28th February 1997).
Document revised 11th March 1997