These notes cover the material of the first lecture, which is designed to remind you briefly of the main ideas in multiple regression. They are not full explanations; they assume you have at least met multiple regression before. If you haven't, you will probably need to read Bryman & Cramer, pp. 177-186 and pp. 235-246. The words and phrases printed in bold type are all things which you should understand by the end of the course. Many of them you will already know; some will be explained in the course of this lecture. In some cases we will explain them later in the course. Some of the material in these notes will not be gone through in the lecture, and you should make sure to read it over and ask us for explanations if you don't understand it.
Multiple regression is the simplest of all the multivariate statistical techniques. Mathematically, multiple regression is a straightforward generalisation of simple regression, the process of fitting the best straight line through the dots on an x-y plot or scattergram. We will discuss what "best" means later in the lecture.
Regression (simple and multiple) techniques are closely related to the analysis of variance (anova). Both are special cases of the General Linear Model or GLIM, and you can in fact do an anova using the regression commands in statistical packages (though the process is clumsy). You can combine the two, when what you have is an analysis of covariance (ancova), which we will discuss briefly later in this course.
What distinguishes multiple regression from other techniques? The following are the main points:
This means that multiple regression is useful in the following general class of situations. We observe one dependent variable, whose variation we want to explain in terms of a number of other independent variables, which we can also observe. These other variables are not under experimental control - we just have to accept the variations in them that happen to occur in the sample of people or situations we can observe. We want to know which if any of these independent variables is significantly correlated with the dependent variable, taking into account the various correlations that may exist between the independent variables. So typically we use multiple regression to analyse data that come from "natural" rather than experimental situations. This makes it very useful in social psychology, and social science generally. Note, however, that it is inherently a correlational technique; it cannot of itself tell us anything about the causalities that may underlie the relationships it describes.
There are some additional rules that have to be obeyed if multiple regression is to be useful:
Like many statistical procedures, multiple regression has two functions: to summarise some data, and to examine it for (statistically) significant trends. The first of these is part of descriptive statistics, the second of inferential statistics. We spend most of our time in elementary statistics courses thinking about inferential statistics, because at that level they are usually more difficult. But at any level, descriptive statistics are more important. In this section, we concentrate on how multiple regression describes a set of data.
Any number we use to summarise a set of numbers is called a descriptive statistic. Many different descriptive statistics can be calculated for a given set of numbers, and different ones are useful for different purposes. In many cases, a descriptive statistic is chosen because it is in some sense the best summary of a particular type. But what do we mean by "best"?
Consider the best known of all descriptive statistics, the arithmetic mean - what lay people call the average. Why is this the best summary of a set of numbers? There is an answer, but it isn't obvious. The mean is the value from which the numbers in the set have the minimum sum of squared deviations. For the meaning of this, see Figure 1.
Consider observation 1. Its y value is y1. If we consider an "average" value ÿ, we define the deviation from the average as y1-ÿ, the squared deviation from the as (y1-ÿ)2, and the sum of squared deviations as sigmai(yi-ÿ)2. The arithmetic mean turns out to be the value of ÿ that makes this sum lowest. It also, of course, has the property that sigmai(yi-ÿ) = 0; that, indeed, is its definition.
If we look at Figure 1, it's obvious that we could summarise the data better if we could find some way of representing the fact that the observations with high y values tend to be those with high x values. Graphically, we can do this by drawing a straight line on the graph so it passes through the cluster of points, as in Figure 2. Simple regression is a way of choosing the best straight line for this job.
This raises two problems: what is the best straight line, and how can
we describe it when we have found it?
Let's deal first with describing a straight line. This is GCSE maths. Any straight line can be described by an equation relating the y values to the x values. In general, we usually write,
y = mx + c
Here m and c are constants whose values tell us which of the infinite number of possible straight lines we are looking at. m (from French monter) tells us about the slope or gradient of the line. Positive m means the line slopes upwards to the right; negative m that it slopes downwards. High m values mean a steep slope, low values a shallow one. c (from French couper) tells us about the intercept, i.e. where the line cuts the y axis: positive c means that when x is zero, y has a positive value, negative c means that when x is zero, y has a negative value. But for regression purposes, it's more convenient to use different symbols. We usually write:
y = a + bx
This is just the same equation with different names for the constants: a is the intercept, b is the gradient.
The problem of choosing the best straight line then comes down to finding the best values of a and b. We define "best" in the same way as we did when we explained why the mean is the best summary: we choose the a and b values that give us the line such that the sum of squared deviations from the line is minimised. This is illustrated in Figure 3. The best line is called the regression line, and the equation describing it is called the regression equation. The deviations from the line are also called residuals.
Having found the best straight line, the next question is how well it describes the data. We measure this by the fraction
(sum of squared deviations from the line) 1 - ----------------------------------------- (sum of squared deviations from the mean)
This is called the variance accounted for, symbolised by VAC or R2. Its square root is the Pearson correlation coefficient. R2 can vary from 0 (the points are completely random) to 1 (all the points lie exactly on the regression line); quite often it is reported as a percentage (e.g. 73% instead of 0.73). The Pearson correlation coefficient (usually symbolised by r) is always reported as a decimal value. It can take values from -1 to +1; if the value of b is negative, the value of r will also be negative.
Note that two sets of data can have identical a and b values and very different R2 values, or vice versa. Correlation measure the strength of a linear relationship: it tells you how much scatter there is about the best fitting straight line through a scattergram. a and b, on the other hand, tell you what the line is. The values of a and b will depend on the units of measurement used, but the value of r is independent of units. If we transform y and x to z-scores, which involves rescaling them so they have means of zero and standard deviations of 1, b will equal r.
Note carefully that a, b, R2 and r are all descriptive statistics. We have not said anything about significance tests. Given a set of paired x and y values, we can use virtually any statistics package to find the corresponding values of a, b and R2. It will also do some significance tests for us. The way to do this is described later. All the calculations can also be done by hand, or on a pocket calculator that has statistical functions.
What happens if we have more than two independent variables? In most cases, we can't draw graphs to illustrate the relationship between them all. But we can still represent the relationship by an equation. This is what multiple regression does. It's a straightforward extension of simple regression. If there are n independent variables, we call them x1, x2, x3 and so on up to xn. Multiple regression then finds values of a, b1, b2, b3 and so on up to bn which give the best fitting equation of the form
y = a + b1x1 + b2x2 + b3x3 + ... + bnxn
b1 is called the coefficient of x1, b2 is the coefficient of x2, and so forth. The equation is exactly like the one for simple regression, except that it is very laborious to work out the values of a, b1 etc by hand. Most statistics packages, however, do it with exactly the same command as for simple regression.
What do the regression coefficients mean? The coefficient of each independent variable tells us what relation that variable has with y, the dependent variable, when all the other independent variables are held constant. So, if b1 is high and positive, that means that if x2, x3 and so on up to xn do not change, then increases in x1 will correspond to large increases in y.
In multiple regression, as in simple regression, we can work out a value for R2. However, every time we add another independent variable, we necessarily increase the value of R2 (you can get a feel for how this happens if you compare Fig 3 with Fig 1). Therefore, in assessing the goodness of fit of a regression equation, we usually work in terms of a slightly different statistic, called R2-adjusted or R2adj. This is calculated as
R2adj = 1 - (1-R2)(N-n-1)/(N-1)
where N is the number of observations in the data set (usually the number of people) and n the number of independent variables or regressors. This allows for the extra regressors. You can see that R2adj will always be lower than R2 if there is more than one regressor. There is also another way of assessing goodness of fit in multiple regression, using the F statistic which is discussed below. It is possible in principle to to take the square root of R2 or R2adj to get what is called the multiple correlation coefficient, but we don't usually bother.
Regression equations can also be used to obtain predicted or fitted values of the dependent variable for given values of the independent variable. If we know the values of x1, x2, ... xn, it is obviously a simple matter to calculate the value of y which, according to the equation, should correspond to them: we just multiply x1 by b1, x2 by b2, and so on, and add all the products to a. We can do this for combinations of independent variables that are represented in the data, and also for new combinations. We need to be careful, though, of extending the independent variable values far outside the range we have observed (extrapolating), as it is not guaranteed that the regression equation will still hold accurately.
Multiple regression enables us to answer five main questions about a set of data, in which n independent variables (regressors), x1 to xn, are being used to explain the variation in a single dependent variable, y.
Either the nature of the data, or the regression results, may suggest further questions. For example, you may want to obtain means and standard deviations or histograms of variables to check on their distributions; or plot one variable against another, or obtain a matrix of correlations, to check on first order relationships. You should also check for unusual observations or "outliers": these will be discussed in the next session.
Reporting regression results
Research articles frequently report the results of several different regressions done on a single data set. In this case, it is best to present the results in a table. Where a single regression is done, however, that is unnecessary, and the results can be reported in text. The wording should be something like the following - this is for the depression vs age, income and gender example used as a Minitab example below:
The data were analysed by multiple regression, using as regressors age, income and gender. The regression was a rather poor fit (R2adj = 40%), but the overall relationship was significant (F3,12 = 4.32, p < 0.05). With other variables held constant, depression scores were negatively related to age and income, decreasing by 0.16 for every extra year of age, and by 0.09 for every extra pound per week income. Women tended to have higher scores than men, by 3.3 units. Only the effect of income was significant (t12 = 3.18, p < 0.01).
Normally you will need to go on to discuss the meaning of the trends you have described.
Note the following pitfalls for the unwary:
At the end of the handout there is a complete worked example on some made-up data, in which we attempt to predict scores on a paper and pencil test of depression (running from 0 to 100) from income (in pounds/week), gender (coded 0 for men and 1 for women) and age. Note that the REGRESS command, which actually carries out the regression, needs us to tell it how many independent variables there are. It is very important to make sure that we then provide the corresponding number of columns - if we provide too many, Minitab will not warn us of the error, but will write some detailed results into the extra columns, thus overwriting any data we might have in them, and producing mystifying errors later in our analysis.
The SPSS example uses a set of data on the psychology of tax avoidance. An appropriate command file would be as follows:
title test regression get file='/singer1/eps/psybin/stats/tax.sys' regression variables=index free1 to law5 /statistics=defaults /missing=meansubstitution /dependent=index /method=enter finish
Output from this file is given at the end of this handout. It shows that the 15 questionnaire items do quite a good job of predicting tax avoidance.
1. The following are the IQ scores on the Verbal and Numerical scales of a certain test for a group of students:
Verbal: 98 120 85 97 100 132 124 88 91 144 Numerical: 92 105 100 92 93 144 143 75 85 121
Use Minitab to calculate the mean and standard deviation of the scores on each scale. Use LET to work out the difference between them and put it in a new column. Use TTEST on this column to see whether there is a significant difference between the verbal and numerical scores.
2. Using the data from the previous example, work out the regression line for predicting Numerical scores (dependent variable) from Verbal scores (independent variable).
3. A social psychologist observes the scores achieved on a video game in a pub, by the first new (previously unobserved) player to use the machine after each half hour through the evening. They are as follows:
Time: 6pm 6.30 7pm 7.30 8pm 8.30 9pm 9.30 10pm 10.30 Score: 1760 995 2130 770 1535 3975 2120 5660 3341 4995
Use SPSS to investigate whether the data support the psychologist's hypothesis that more expert players use the machine later in the evening? What would be the most likely score to observe at 9.45pm?
4. The following data show the levels of anxiety recorded by a paper-and-pencil test just before a group of students took an examination, together with the exam marks obtained. Use Minitab's PLOT command to decide whether it would be appropriate to use linear regression to summarize these data.
Anxiety score: 5 17 10 12 3 19 2 11 9 8 13 18 4 7 Exam mark: 45 20 55 72 45 39 50 75 60 57 58 52 43 57
5. The Singer file /singer1/eps/psybin/stats/teengamb.DAT contains, for each of 47 teenagers, the following information:
Each line of the file contains all 6 data items for a single person. These are real data, collected during an undergraduate project a few years ago, and since published (Ide-Smith & Lea, 1988, Journal of Gambling Behavior, 4, 110-118). Note , though, that you won't get quite the same results as in the published article, because I've cut out the data from some subjects whose data would have given you problems.
Set up a Minitab worksheet with columns with appropriate names, and read these data into it using READ. Note that you don't need to type the file extension (.DAT) because this is the default for READ, but if you do type it, you must use CAPITALS. The rest of the filename must be typed in lower case.
MTB > set c1 DATA> 74 82 15 23 35 54 12 28 66 43 55 31 83 29 53 32 DATA> end MTB > set c2 DATA> 120 55 350 210 185 110 730 150 61 175 121 225 45 325 171 103 DATA> end MTB > set c3 DATA> 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 1 DATA> end MTB > set c4 DATA> 33 28 47 55 32 63 59 68 27 32 42 51 47 33 51 20 DATA> end MTB > name c1 'depress' MTB > name c2 'income' MTB > name c3 'm0f1' MTB > name c4 'age' MTB > regress c1 3 c2-c4 The regression equation is depress = 68.3 - 0.0934 income + 3.31 m0f1 - 0.162 age Predictor Coef Stdev t-ratio p Constant 68.28 15.44 4.42 0.001 income -0.09336 0.02937 -3.18 0.008 m0f1 3.306 8.942 0.37 0.718 age -0.1617 0.3436 -0.47 0.646 s = 17.70 R-sq = 52.0% R-sq(adj) = 39.9% Analysis of Variance SOURCE DF SS MS F p Regression 3 4065.4 1355.1 4.32 0.028 Error 12 3760.0 313.3 Total 15 7825.4 SOURCE DF SEQ SS income 1 3940.5 m0f1 1 55.5 age 1 69.4 Continue? y Unusual Observations Obs. income depress Fit Stdev.Fit Residual St.Resid 7 730 12.00 -6.10 15.57 18.10 2.15RX R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence.
some blank lines have been removed
* * * * M U L T I P L E R E G R E S S I O N * * * * Mean Substituted for Missing Data Equation Number 1 Dependent Variable.. INDEX Evasion measure Block Number 1. Method: Enter Variable(s) Entered on Step Number 1.. LAW5 2.. FREE2 3.. ALIEN2 4.. LAW3 5.. LAW4 6.. LAW2 7.. LAW1 8.. ALIEN4 9.. ALIEN1 10.. ALIEN5 11.. FREE5 12.. ALIEN3 13.. FREE3 14.. FREE4 15.. FREE1 Multiple R .93111 R Square .86696 Adjusted R Square .80460 Standard Error 1.93857 Analysis of Variance DF Sum of Squares Mean Square Regression 15 783.65847 52.24390 Residual 32 120.25820 3.75807 F = 13.90179 Signif F = .0000 * * * * M U L T I P L E R E G R E S S I O N * * * * Equation Number 1 Dependent Variable.. INDEX Evasion measure ------------------ Variables in the Equation ------------------ Variable B SE B Beta T Sig T LAW5 .103593 .172898 .046600 .599 .5533 FREE2 -1.278802 .641764 -.399007 -1.993 .0549 ALIEN2 -.177951 .513296 -.064325 -.347 .7311 LAW3 -.269736 .224503 -.117014 -1.201 .2384 LAW4 .294076 .286945 .112165 1.025 .3131 LAW2 .224659 .350312 .084970 .641 .5259 LAW1 .106083 .234746 .037769 .452 .6544 ALIEN4 .353269 .339096 .130039 1.042 .3053 ALIEN1 1.227092 .252610 .567362 4.858 .0000 ALIEN5 .272150 .293067 .125177 .929 .3600 FREE5 -.833464 .339398 -.340064 -2.456 .0197 ALIEN3 -.059760 .345620 -.025101 -.173 .8638 FREE3 -1.531610 .668878 -.485904 -2.290 .0288 FREE4 -2.142148 .657277 -.670692 -3.259 .0027 FREE1 4.401877 .721189 1.828551 6.104 .0000 (Constant) -.845601 4.572997 -.185 .8545
Send questions and
comments to the departmental
administrator or to the author
of this page
Goto Home page for this course | next topic | FAQ file
Goto home page for: University of Exeter | Department of Psychology | Staff | Students | Research | Teaching | Miscellaneous