PSY6003 Advanced statistics: Multivariate analysis II: Manifest variables analyses

This FAQ file includes (or will soon include) questions and answers about the PSY6003 multivariate analysis II course component, Basic techniques of linear regression, More advanced linear regression topics, Path analysis Logistic regression, Ordered logistic regression.

The PSY6003 Multivariate Analysis II course component

Q: I am a student or researcher at another institution. I've found your course materials on the web. Is it OK for me to use them? Do I have to pay anything?

A: Welcome. Yes, it's perfectly OK for you to use our materials, but please read our copyright and disclaimer notice. Basically it says you are welcome to read and use our notes, but please credit us in any reports or papers they contribute to, and you're not allowed to pass them off as your own or sell them. I might add that we're always interested to hear from you if you find them useful - please email Stephen Lea and let us know.

Basic techniques of linear regression

Q: I'm struggling to see how ancova and multiple regression are related and different.

A: Let's see if I can do any better at clearing it up. Anova and multiple regression are both special cases of the General Linear Model. Anova is the special case that is appropriate when ALL the independent variables (in this case called factors) can be arranged to be independent of each other (orthogonal in the jargon) - usually because we are doing an experiment. Multiple regression is the special case we have to use when NONE of the independent variables can be arranged to be independent of one another, so there is likely to be some correlation between any two of them. Ancova is simply what we use in the more general case where we have a mixture of the two, which we use when SOME (at least 2) of the independent variables can be arranged to be independent of each other, and SOME (at least one) cannot be arranged to independent of the others. So the effects of the main factors in an ancova are analysed by anova techniques; the effects of the covariates are analysed by regression techniques. If we have independent variables which we are able to make independent of each other by design, it is always better to use Anova or Ancova than multiple regression. This is because Anova/ancova allow us to ask more complex questions of the data than multiple regression, most especially about interactions; correspondingly, anova requires less in the way of assumptions about how the effects of the variables combine. There are ways of getting at interactions through multiple regression, but they are fairly clumsy. Just to confound confusion further, Minitab includes a command called GLM (General Linear Model) which does anovas/ancovas by a regression technique. This allows us to cope with factorial designs where the number of subjects in the various groups are unbalanced - thus introducing a correlation between the factors, but not the usual sort. There are some good notes about GLM within some notes on repeated measures analysis of variance using minitab prepared by Frank Bokhurst of the University of Cape Town; the url is:

http://www.uct.ac.za/depts/psychology/bok/repeat.mini

Q: I don't understand how two sets of data can have identical regression constants (a and b values) and different R2 values.

A: The issue is that::

1. a and b say WHAT the best fitting relationship is, but R2says HOW WELL it describes the data.
2. in consequence, R2doesn't have any units - it is a "pure number" - changing the units of x and y can't affect how good the fit between them is. But a and b do have units - since b expresses a slope, its units are (units of y)/(units of x). So if we change the units of x or y, we have to change the units of b.

It also happens to be true that:

• the square root of R2is the ordinary correlation coefficient, r
• if we change x and y to "standard deviation units", which we can do by subtracting the mean the x scores from each of them and then dividing them by the standard deviation of x, and subtracting the means of the y scores from them and dividing by the standard deviation of y, then it turns out (this is not something that should be obvious, it has to be proved mathematically) that b=r.

More advanced topics in multiple regression

Q: When choosing the best regression model, the handout says that "We could choose the model with the highest value of R2adj, or we could choose the model with the highest value of F". While I understand when to use the R2adj criterion, I cannot understand why the F value is used when we want an economical model.

A: There isn't, so far as I know, an objective justification. The justification for using the F value AT ALL is that it is a reasonable (being related to the amounts of variance attributable to the model and to "error", i.e. unexplained causes) and convenient (because it is calculated anyway) criterion. The justification for using it when you want an economical model is that, if R2adj and F pick out different models, the one that F picks out will have fewer variables in it.

Q: I don't understand why the regression coefficients will be unstable if the independent variables are multicollinear.

A: The handout didn't try to explain why - this is something that can be proved mathematically. However, you can get an intuitive grasp on it. If two variables are closely related to each other, the relative size of their regression coefficients will be strongly affected by the relatively small number of cases in the sample where they disagree. If we have two samples of equal size and similar characteristics, the number and properties of these disagreement cases is likely to change a lot, proportionately, just because of ordinary sampling error. So the regression coefficients of the related variables will change a lot from sample to sample, which is what we mean by saying they are unstable.

Q: I don't understand why it is that if you have as many independent variables as you have data points, you can predict the dependent variable value perfectly - but you are not explaining it at all.

A: This point is related to the concept of degrees of freedom. If I want to write an equation to best fit to 3 points, and I allow myself two independent variables, the equation will in general terms be:

y = a + b1x1 + b2x2

This set of equations only has 3 unknowns (a, b1, b2). Suppose the 3 data points are

y=1, x1=2, x2=2
y=2, x1=2, x2=1
y=3, x1=3, x2=3

Then I can substitute these 3 points into my general equation & get three equations:

1 = a + 2b1 + 2b2
2 = a + 2b1 + b2
3 = a + 3b1 + 3b2

This gives me 3 equations in 3 unknowns, so I can solve it exactly and find the values of a, b1and b2 that make it exactly true. It's not a "best fit" but a "perfect fit". But it can't be said to predict the values of y from the x1 and x2 values, because the two sets of information are exactly equivalent: we either have to know three y values, or three regression constants (a, b1and b2).

Q: I don't see how to interpret the regression coefficient for a dummy variable.

A: the simplest way of looking at the coefficient for a dummy variable (say, for district 4 in the example in the handout) is to see it as the difference, with all other variables held constant, between dependent variable values (e.g. rated neighbourliness) for a respondent in district 4 and rated neighbourliness for a respondent in the district you dropped (say, district 1).

Q: When I run a stepwise regression on the 'neighbourliness' data, the analysis picks up on the second of the districts as having a high partial correlation with 'neighbourliness' and adds it into the model. However, aren't we interested to see whether the variable of district as a whole predicts neighbourliness? Can this be done with a variable that has been divided into dummy variables, in a stepwise regression?

A: This is a very common puzzle. So far as I can see (I have never found a textbook treatment), there are two possible responses: a) decide that what we are interested in is the entire categorical variable, so if the stepwise procedure picks up any of its dummies, force the rest into the final model, on the grounds that separating one of them from the rest is artificial (you can do this using the method=enter subcommand in SPSS); b) decide that what the stepwise procedure is telling you is that we can collapse the original "districts 1-4" categorization into a simpler "district 2 vs all others" categorization (rather as one might following contrasts analysis in an anova design); in this case retaining the single dummy that has been picked out by stepwise will do just what we want. I tend to adopt (b), but I think which was sounder might depend on what the variables are, and what stage of the research you are at - in other words, it is not entirely a mathematical/statistical question.

Q: You state that ordinal dependent variables can usually be used with ordinary multiple regression, so long as there are enough levels. How many is enough?

A: 7 is generally reckoned to be ok; 3 is definitely not ok. In between is a grey area. Personally I would stick with ordered logit up to 5 categories. However it does depend on the distribution of observed values - if that is very odd (heavily skewed, say, or bimodal) I would expect ordered logit to be more robust.

Path analysis

Q: Why do we use 1-R2 and not 1-R2adj to calculate the error variance?

A: R2 is directly related to the proportions of variance accounted for by the model and by error. R2adj, in its efforts to take the number of degrees of freedom into account, makes that relationship more obscure.

Q:. A colleague of mine told me that the model you presented in your handout is "saturated" that is, there are arrows going from every part of the model to another part. Is it better to have some relationships between some parts of the model, fixed to zero , so that we have more degrees of freedom.

A: Depends on the nature of your research question and the stage the research has reached. With simple models, the saturated model is likely to be best; with complex ones, it is likely to be impossible. But there may be some relationships that can be discounted completely on theoretical grounds, and if there are any such, you should certainly get rid of them. The simpler the model, the more powerful the tests of the hypothesis it does test - but, of course, the fewer hypotheses it tests. As so often it is a compromise and we have to use scientific rather than statistical knowledge to decide where to draw it.

Q:. I have recently read on academic journals about more advanced techniques than path analysis, which are called structural equation modelling techniques. Do you know if we can include feedback loops in a specific model using this technique instead of path analysis?

A: I don't have a lot of experience with structural equation modelling (LISREL, AMOS and similar techniques) but my understanding is that you can't model feedback relations in them, either. If you have a problem where feedback is inherently involved, you probably need to get into control theory. There are books about this kind of analysis for psychologists, mostly in the area of motivation - Toates and McFarland have both done a lot of work in this area.

Logistic regression and discriminant analysis

Q: What do you mean by "a linear combination of variables" when describing discriminant analysis?

A: Suppose x1, x2, x3, are independent variables: then a linear combination is simply any function y such that y = a + b1x1 + b2x1 + b1x1 where a, b1, b2, b3are constants. A discriminant function is a linear combination of variables, as just defined, such that the difference(s) of mean y values between 2 or more groups is maximised: i.e. the function discriminates maximally between the 2 (or more) groups.

Q: > "Successive discriminant functions are orthogonal to one another, like principal components." Do you mean that they have very low or no correlation among them?

A: They have ZERO correlation. In matrix terms, the cross products of their coefficient vectors (the sum of the products of corresponding coefficients) are zero.

Q: When I ran the discriminant function analysis in the SPSS for Windows the computer gave me two options regarding the classification of the subjects to one of the two locus of control groups: (1) Specify equal prior probabilities for the two groups or (2) Let the observed group sizes in my sample determine the probabilities of group membership. Which should I use?

A: The two different possibilities with discriminant analysis are akin to the decision to adjust the group membership criterion, or not, by inspection of the classification table and CLASSPLOT following a logistic regression. "Specifiying equal priors" is similar to dividing the two groups at the point where logit(p)=0. "Letting observed group sizes determine probabilities of group membership" is similar (not identical) to using the CLASSPLOT to put the criterion in the most advantageous place. Strictly, it involves adjusting the criterion so the numbers predicted into each group are the same as the numbers observed in each group. As so often, the choice between the two ways of setting the criterion is a scientific rather than a statistical matter - it depends on the meaning you attach to the data and the use you will make of the results.

Q: In carrying out logistic regression, which method of independent variable selection is most appropriate? The available ones on the SPSS for Windows are Enter, Forward: Conditional, Forward: LR , Forward: Wald, Backward: Conditional, Backward: LR, Backward: Wald

A: (i) Whether you use Enter or Forward/Backward depends on what stage you are at with your research. If you are testing the hypothesis that your independent variables taken together predict your dependent variable, you should use Enter. If you are exploring to find variables which you can test as predictors in a subsequent study or hold-out sample, you should use Forward/Backward.

(ii) If you use any kind of Forward method, you should also use the corresponding Backward method, to check against local maxima.

(iii) Generally speaking, the LR method is theoretically preferable to Wald, but more expensive in computer time. So if you are prepared to wait, use LR. I am not sure about Conditional. Try the SPSS Advanced Statistics Manual for more information - it is quite good on Logistic Regression.

Q: In a forward stepwise logistic regression - I appear to have a '-2 Log Likelihood' of 145.104, but I thought probabilities were always less than 1?

A: Indeed they are. But what you are looking at is not a probability but the NEGATIVE LOG of a likelihood RATIO. i.e. we have found 2 probabilities - both lying, to be sure, between 0 and 1. But then we have taken the ratio of them - which could give us any positive number whatsoever. Then we've taken the log of that, which means from a positive number we can get anything from minus infinity to plus infinity. Finally we've negatived that. Since -2LLR comes out positive, LLR must have been negative; so the LR was less than one (remember that log 1 is zero, log x is positive for x>1 and negative for x<1)

Q: O.K. I can understand that if -2LLR is positive, LLR is negative. What I fail to understand is how that '-2 Log likelihood' that SPSS gives up relates to the LRFC statistic ie. how it helps us to see how well we can classify people into groups from a knowledge of the independent variables (point 1 of 'a reassuring coda').

A: The "Model Chi-Square" that SPSS gives you is twice the difference (LLmodel-LL0), which you need for the numerator of the LRFC formula. The denominator is just LL0, which is the -2LLR value given at the beginning of the logistic regression output, under the heading "Initial lLog Likelihood Function".

Q: In the log likelihood paragraph of your handout you write : "We then work out the likelihood of observing the exact data we actually did observe under each of these hypotheses. The result is nearly always a frighteningly small number". How can the probability values in each case both be very small, since the two hypotheses are quite different from each other?

A: The probability of any EXACT configuration of data is ALWAYS very small. Even in a binomial distribution, if I toss a fair coin 100 times, the probability of getting EXACTLY 50 heads is small: (100 x 99 x 98 x ... x 51)/(1 x 2 x 3 x ... x 50) x 0.5 to the power 100, which roughly equals 0.5 to the power 50, which is about 1/1,000,000,000,000,000 (I think). But this does not stop one probability being very much larger than another; the probability of exactly 51 heads in the above example would be about 2,500 times lower.

Q: where does logit (p) come in to interpretation?

A: The logit(p) value predicted for a particular case tells us how likely it is that that particular case has a dependent variable value of 1 (rather than 0). We can use this either in looking for anomalous cases, or in using our logistic regression model to predict the results of cases whose outcome is unknown.

Q: Is the idea of the CLASSPLOT that if there are lots of 'No's' above odds ratio 1:1 (ie. logit (p)>0), one should use a different rule to predict the data?

A: Yes: you might want to predict a 'Yes' only with an odds ratio of 2:1, say.

Q: In your example, you find that the success rate for trained men is .9 (whose logit is 2.197) and for trained women iit is .955, with a logit of 3.044. Since as you mention in the previous page "What we want to predict from a knowledge of relevant independent variables is...the probability that (the dependent variable) is 1 rather than 0", what do these logit numbers tell us about the classification of men and women ?

A: Standard logistic regression procedure is that if the predicted value of logit(p) for an observation (predicted from the regression equation) is greater than 0.0 (so that the predicted value of p is greater than 0.5), that observation is predicted to lie in group 1; otherwise it is predicted to lie in group 0. This makes sense: p is the probability that the observation is in group 1, so if that is greater than 0.5, we ought to predict membership in group 1. However, examination of the CLASSPLOT, or the relative costs of the two possible kinds of misclassification, might suggest that either the number of errors, or their cost, could be reduced if the criterion is set somewhere different, and it is legitimate to do that in such a case.

Q: If an independent variable has a b value on SPSS of .800, for example, does this mean that for each unit increase in that independent variable, the odds that the dependent variable takes the value 1 increase by exp(b), in this case 2.2264?

A: Let's think it through.. it means that logit(p) increases by b for a unit increase in x; logit(p) is log(p/(1-p)); so p/(1-p) goes up by exp(b)... yup, you're right. The only thing is that this isn't an additive increase but a multiplicative one - adding 2.2264 to the log of the odds means multiplying the odds themselves by 2.264.

Q: Does a high b mean that the odds of the dependent variable changing value when that independent variable value increases are higher than when independent variables with with lower b values increase by the same amount? I've interpreted this to imply that those independent variables with high b values are the best ones to use in prediction? Is this correct?

A: The first part is right, but the second part is wrong. To find the best independent variables to use in prediction, you need to multiply each b value by the standard deviation of the corresponding independent variable, and compare the products.

Q: When working out relative importances of variables, I can never remember whether to divide the b values by the standard deviation of the corresponding independent variable, or to multiply them.

A: I can never remember this either, and at various times I have got it wrong in the handouts. However, there is a logic to it, so the thing to do is to understand it, then you can work it out when needed. The trick is to think in terms of the units of the quantities involved, and to remember that to make comparisons we shall need "pure numbers", i.e. quantities that are the same regardless of the particular units of measurement we happen to have used. Consider a simple case, where the dependent variable was distance travelled. Then the b value would have units of (say) "per hour". So to get to a units-free value which we can use for comparisons, we need to multiply by something which has units of time. The s.d. of any variable has the same units as the variable. So we are going to have to multiply by the s.d. of the independent variable.

Q: We are a bit confused about the interpretation of dummy variables in logistic regression. How do you interpret the odds - are you comparing one of the dummy categories with all of the other dummy categories including the one which has been dropped, or only with the category that was dropped. For example say you had 5 categories, and recoded them all as dummy variables, and dropped out category 1 from the analysis. Would the exp (b) value for category 2 represent the odds of changing to ANY of the other categories including category 1, or the odds of changing ONLY from category 2 to category 1?.

A: Both your possible answers are right in a way. Say the dummy variable is nationality within the UK, & you have coded 1=English, 2=Scottish, 3=Welsh, 4=N Ireland; & you drop out 1 as the reference category. Then exp(b) for Scottish gives the extra odds of being in the Yes-group of the dependent variable if you are Scottish rather than not Scottish. However, everyone EXCEPT the English has a positive exp(b) for one of the other dummies. So exp(b) for Scottish only gives you DIRECTLY the extra odds of being a Yes if you are Scottish rather than English. If you want the extra odds of being in the Yes-group if you are Scottish rather than Welsh, you have to multiply by exp(b) for Scottish & divide by exp(b) for Welsh.

Q:I have carried out both a discriminant analysis and a logistic regression on my data. The logistic regression places more of the cases into the correct groups. Which should I prefer?

A: Logistic regression is superior (the statisticians assure us) on theoretical grounds, and it also tends, as in your case, to give better results. So we should prefer it. That means that we should normally not bother running the discriminant analysis as well, or we may be undermining the significance levels of our logistic regression.

Ordered logit

Q: In the Ordered Logit notes it says that regression coefficients can be interpreted in the usual way - does this mean like you would report the coefficients of normal linear regression or like you report those of logistic regression? I think you would report them like logistic regression, seeing as you are still dealing with log likelihoods etc.

A: Your reasoning is correct, and so is the answer you have reached.

Q: What is the the formula for relative importance of independent variables in ordered logit. Is it the same as for logistic regression?

A: Yes, it is. So you multiply each b value by the standard deviation of the corresponding independent variable, and compare the products.

Q: I've come up against a problem when trying to interpret results of an ordered logit analysis - the output does not give you any Exp(b) figures which I can use to find out the effects of a unit change in the to multiply the coefficients.

A You can get from b to Exp(b) by using a calculator. The button is likely to be labelled ex (not EXP which usually does something different, to do with the way numbers are displayed). Sometimes you have to press a shift key to get this button to work. To make sure you are doing it right, try it out first on some logistic regression output (where you get both b to Exp(b) given to you) to make sure you have got the right one.

Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623