Contents of this handout: What is multiple regression, where does it fit in, and what is it good for? The idea of a regression equation; From simple regression to multiple regression; interpreting and reporting multiple regression results; Carrying out multiple regression; Exercises; Worked examples using Minitab and SPSS

These notes cover the material of the first lecture, which is designed
to remind you briefly of the main ideas in multiple regression. They are
not full explanations; they assume you have at least met multiple regression
before. If you haven't, you will probably need to read Bryman & Cramer,
pp. 177-186 and pp. 235-246. The words and phrases printed in **bold type**
are all things which you should understand by the end of the course. Many
of them you will already know; some will be explained in the course of
this lecture. In some cases we will explain them later in the course. Some
of the material in these notes will not be gone through in the lecture,
and you should make sure to read it over and ask us for explanations if
you don't understand it.

Multiple regression is the simplest of all the **multivariate** statistical
techniques. Mathematically, multiple regression is a straightforward generalisation
of **simple regression**, the process of fitting the **best straight
line** through the dots on an **x-y** **plot** or **scattergram**.
We will discuss what "best" means later in the lecture.

Regression (simple and multiple) techniques are closely related to the
**analysis of variance (anova)**. Both are special cases of the **General
Linear Model** or **GLIM**, and you can in fact do an anova using
the regression commands in statistical packages (though the process is
clumsy). You can combine the two, when what you have is an **analysis
of covariance (ancova)**, which we will discuss briefly later in this
course.

What distinguishes multiple regression from other techniques? The following are the main points:

- In multiple regression, we work with
**one dependent variable**and**many independent variables**. In simple regression, there is only one independent variable; in factor analysis, cluster analysis and most other**latent variable**multivariate techniques, there are many dependent variables. - In multiple regression, the
**independent variables may be correlated**. In analysis of variance, we arrange for all the independent variables to vary completely independently of each other. - In multiple regression, the
**independent variables can be continuous**. For analysis of variance, they have to be categorical, and if they are naturally continuous, we have to force them into categories, for example by a**median split**.

This means that multiple regression is useful in the following general
class of situations. We observe one dependent variable, whose variation
we want to explain in terms of a number of other independent variables,
which we can also observe. These other variables are not under experimental
control - we just have to accept the variations in them that happen to
occur in the sample of people or situations we can observe. We want to
know which if any of these independent variables is significantly correlated
with the dependent variable, taking into account the various correlations
that may exist between the independent variables. So typically we use multiple
regression to analyse data that come from "natural" rather than
experimental situations. This makes it very useful in social psychology,
and social science generally. **Note, however, that it is inherently a
correlational technique; it cannot of itself tell us anything about the
causalities that may underlie the relationships it describes**.

There are some additional rules that have to be obeyed if multiple regression is to be useful:

- The units (usually people) we observe should be a
**random sample**from some well defined population. This is a basic requirement for all statistical work if we want to draw any kind of general inference from the observations we have made. - The dependent variable should be measured on an
**interval**,**continuous**scale. In practice an**ordinal**(ranking or rating) scale is usually good enough unless the number of levels is small. If the dependent variable is only measured on a**nominal**(unordered category, including**dichotomies**) scale, we have to use**discriminant analysis**or**logistic regression**instead. These are dealt with in a later lecture. - The independent variables should be measured on interval scales. However
this is not a serious restriction since most ordinal scale measurement
will be acceptable in practice; 2-valued categorical variables (dichotomies)
can be used directly; and there is way of dealing with
**k-valued**categorical variables (k usually stands for any**integer**greater than 2), by**dummy variables**, which we will discuss in the next lecture. - The distributions of all the variables should be
**normal**. If they are not roughly normal, this can often be corrected by using an appropriate**transformation**(e.g. taking**logarithms**of all the measurements). - The relationships between the dependent variable and the independent
variable should be
**linear**. That is, it should be possible to draw a rough straight line through an x-y scattergram of the observed points. If the line looks curved, but is**monotonic**(increases or decreases all the time), things are not too bad and could be made better by transformation. If the line looks U-shaped, we will need to take special steps before regression can be used. - Although the independent variables can be correlated, there must be
no perfect (or near-perfect) correlations among them, a situation called
**multicollinearity**(which will be explained later in the course). - There must be no
**interactions**, in the anova sense, between independent variables - the effect of each on the dependent variable must be roughly independent of the effects of all others. However, if interactions are obviously present, and not too complex, there are special steps we can take to cope with the situation.

Like many statistical procedures, multiple regression has two functions:
to *summarise* some data, and to *examine it for (statistically)
significant trends*. The first of these is part of **descriptive statistics**,
the second of **inferential statistics**. We spend most of our time
in elementary statistics courses thinking about inferential statistics,
because at that level they are usually more difficult. But at any level,
descriptive statistics are more important. In this section, we concentrate
on how multiple regression describes a set of data.

Any number we use to summarise a set of numbers is called a** descriptive
statistic**. Many different descriptive statistics can be calculated
for a given set of numbers, and different ones are useful for different
purposes. In many cases, a descriptive statistic is chosen because it is
in some sense the best summary of a particular type. But what do we mean
by "best"?

Consider the best known of all descriptive statistics, the **arithmetic
mean** - what lay people call the average. Why is this the best summary
of a set of numbers? There is an answer, but it isn't obvious. The mean
is the value from which the numbers in the set have the **minimum sum
of squared deviations**. For the meaning of this, see Figure 1.

**Figure 1**

Consider observation 1. Its *y* value is *y*_{1}.
If we consider an "average" value *ÿ*, we define the
deviation from the average as *y*_{1}-*ÿ*, the squared
deviation from the as (*y*_{1}-*ÿ*)^{2},
and the sum of squared deviations as sigma_{i}(*y*_{i}-*ÿ*)^{2}.
The arithmetic mean turns out to be the value of *ÿ* that makes
this sum lowest. It also, of course, has the property that sigma_{i}(*y*_{i}-*ÿ*)
= 0; that, indeed, is its definition.

If we look at Figure 1, it's obvious that we could summarise the data
better if we could find some way of representing the fact that the observations
with high *y* values tend to be those with high *x* values. Graphically,
we can do this by drawing a straight line on the graph so it passes through
the cluster of points, as in Figure 2. **Simple regression** is a way
of choosing the best straight line for this job.

**Figure 2**

This raises two problems: what is the best straight line, and how can
we describe it when we have found it?

Let's deal first with describing a straight line. This is GCSE maths. Any
straight line can be described by an equation relating the *y* values
to the *x* values. In general, we usually write,

*y* = *mx* + *c*

Here *m* and *c* are constants whose values tell us which
of the infinite number of possible straight lines we are looking at. *m*
(from French *monter*) tells us about the slope or **gradient**
of the line. Positive *m* means the line slopes upwards to the right;
negative *m* that it slopes downwards. High *m* values mean a
steep slope, low values a shallow one. *c* (from French *couper*)
tells us about the **intercept**, i.e. where the line cuts the y axis:
positive *c* means that when *x* is zero, *y* has a positive
value, negative *c* means that when *x* is zero, *y* has
a negative value. But for regression purposes, it's more convenient to
use different symbols. We usually write:

*y* = *a* + *bx*

This is just the same equation with different names for the constants:
*a* is the intercept, *b* is the gradient.

The problem of choosing the best straight line then comes down to finding
the best values of *a* and *b*. We define "best" in
the same way as we did when we explained why the mean is the best summary:
we choose the *a* and *b* values that give us the line such that
the sum of squared deviations *from the line* is minimised. This is
illustrated in Figure 3. The best line is called the **regression line**,
and the equation describing it is called the **regression equation**.
The deviations from the line are also called **residuals**.

**Figure 3**

Having found the best straight line, the next question is how well it describes the data. We measure this by the fraction

(sum of squared deviations from the line) 1 - ----------------------------------------- (sum of squared deviations from the mean)

This is called the **variance accounted for**, symbolised by VAC
or *R*^{2}. Its square root is the **Pearson correlation
coefficient**. *R*^{2} can vary from 0 (the points are completely
random) to 1 (all the points lie exactly on the regression line); quite
often it is reported as a percentage (e.g. 73% instead of 0.73). The Pearson
correlation coefficient (usually symbolised by *r*) is always reported
as a decimal value. It can take values from -1 to +1; if the value of *b*
is negative, the value of *r* will also be negative.

Note that two sets of data can have identical *a* and *b*
values and very different *R*^{2} values, or vice versa. Correlation
measure the strength of a linear relationship: it tells you how much scatter
there is about the best fitting straight line through a scattergram. *a*
and *b*, on the other hand, tell you what the line is. The values
of *a* and *b* will depend on the units of measurement used,
but the value of *r* is independent of units. If we transform *y*
and *x* to **z-scores**, which involves rescaling them so they
have means of zero and standard deviations of 1, *b* will equal *r*.

Note carefully that *a*, *b*, *R*^{2} and *r*
are all descriptive statistics. We have not said anything about significance
tests. Given a set of paired *x* and *y* values, we can use virtually
any statistics package to find the corresponding values of *a*, *b*
and *R*^{2}. It will also do some significance tests for us.
The way to do this is described later. All the calculations can also be
done by hand, or on a pocket calculator that has statistical functions.

What happens if we have more than two independent variables? In most
cases, we can't draw graphs to illustrate the relationship between them
all. But we can still represent the relationship by an equation. This is
what multiple regression does. It's a straightforward extension of simple
regression. If there are *n* independent variables, we call them *x*_{1},
*x*_{2}, *x*_{3} and so on up to *x*_{n}.
Multiple regression then finds values of *a*, *b*_{1},
*b*_{2}, *b*_{3} and so on up to *b*_{n}
which give the best fitting equation of the form

*y* = *a* + *b*_{1}*x*_{1}
+ *b*_{2}*x*_{2} + *b*_{3}*x*_{3}
+ ... + *b*_{n}*x*_{n}

*b*_{1} is called the **coefficient** of *x*_{1},
*b*_{2} is the coefficient of *x*_{2}, and so
forth. The equation is exactly like the one for simple regression, except
that it is very laborious to work out the values of *a*, *b*_{1}
etc by hand. Most statistics packages, however, do it with exactly the
same command as for simple regression.

What do the regression coefficients mean? The coefficient of each independent
variable tells us what relation that variable has with *y*, the dependent
variable, *when all the other independent variables are held constant*.
So, if *b*_{1} is high and positive, that means that if *x*_{2},
*x*_{3} and so on up to *x*_{n} do not change,
then increases in *x*_{1} will correspond to large increases
in *y*.

In multiple regression, as in simple regression, we can work out a value
for *R*^{2}. However, every time we add another independent
variable, we necessarily increase the value of *R*^{2} (you
can get a feel for how this happens if you compare Fig 3 with Fig 1). Therefore,
in assessing the goodness of fit of a regression equation, we usually work
in terms of a slightly different statistic, called *R*^{2}-adjusted
or *R*^{2}_{adj}. This is calculated as

*R*^{2}_{adj} = 1 - (1-*R*^{2})(*N*-*n*-1)/(*N*-1)

where *N* is the number of observations in the data set (usually
the number of people) and *n* the number of independent variables
or **regressors**. This allows for the extra regressors. You can see
that *R*^{2}_{adj} will always be lower than *R*^{2}
if there is more than one regressor. There is also another way of assessing
goodness of fit in multiple regression, using the *F* statistic which
is discussed below. It is possible in principle to to take the square root
of *R*^{2} or *R*^{2}_{adj} to get what
is called the **multiple correlation coefficient**, but we don't usually
bother.

Regression equations can also be used to obtain **predicted** or
**fitted** values of the dependent variable for given values of the
independent variable. If we know the values of *x*_{1}, *x*_{2},
... *x*_{n}, it is obviously a simple matter to calculate
the value of *y* which, according to the equation, should correspond
to them: we just multiply *x*_{1} by *b*_{1},
*x*_{2} by *b*_{2}, and so on, and add all the
products to *a*. We can do this for combinations of independent variables
that are represented in the data, and also for new combinations. We need
to be careful, though, of extending the independent variable values far
outside the range we have observed (**extrapolating**), as it is not
guaranteed that the regression equation will still hold accurately.

Multiple regression enables us to answer five main questions about a
set of data, in which *n* independent variables (regressors), *x*_{1}
to *x*_{n}, are being used to explain the variation in a single
dependent variable, *y*.

- How well do the regressors, taken together, explain the variation in
the dependent variable? This is assessed by the value of
*R*^{2}_{adj}. As a very rough guide, in psychological applications we would usually reckon an*R*^{2}_{adj}of above 75% as very good; 50-75% as good; 25-50% as fairr; and below 25% as poor and perhaps unacceptable. Alas,*R*^{2}_{adj}values above 90% are rare in psychological data, and if you get one, you should wonder whether there is some artefact in your data. - Are the regressors, taken together, significantly associated with the
dependent variable? This is assessed by the statistic
*F*in the "Analysis of Variance" or anova part of the regression output from a statistics package. This is the Fisher*F*as used in the ordinary anova, so its significance depends on its**degrees of freedom**, which in turn depend on sample sizes and/or the nature of the test used. As in anova,*F*has two degrees of freedom associated with it. In general they are referred to as the**numerator**and**denominator**degrees of freedom (because*F*is actually a ratio). In regression, the numerator degrees of freedom are associated with the regression (and equal the number of regressors used), and the denominator degrees of freedom with the**residual**or**error**; you can find them in the Regression and Error rows of the anova table in the output from a statistics package. If you were finding the significance of an*F*value by looking it up in a book of tables, you would need the degrees of freedom to do it. Statistics packages normally work out significances for you, and you will find them in the anova table next to the*F*value; but you need to use the degrees of freedom when reporting the results (see below). It is useful to remember that the higher the value of*F*, the more significant it will be for given degrees of freedom.

- What relationship does each regressor have with the dependent variable
when all other regressors are held constant? This is answered by looking
at the regression coefficients. Some statistics packages (e.g. Minitab)
report these twice, once in the form of a regression equation and again
(to an extra decimal place) in a table of regression coefficients and associated
statistics. Note that regression coefficients have units. So if the dependent
variable is number of cigarettes smoked per week, and one of the regressors
is annual income, the coefficient for that regressor would have units of
(cigarettes per week) per (pound of income per year). That means that if
we changed the units of one of the variables, the regression coefficient
would change - but the relationship it is describing, and what it is saying
about it, would not. So the size of a regression coefficient doesn't tell
us anything about the strength of the relationship it describes until we
have taken the units into account. The fact that regression coefficients
have units also means that we can give a precise interpretation to each
coefficient. So, staying with smoking and income, a coefficient of 0.062
in this case would mean that, with all other variables held constant, increasing
someone's income by one pound per year is associated with an increase of
cigarette consumption of 0.062 cigarettes per week (we might want to make
this easier to grasp by saying that an increase in income of 1 pound per
week would be associated with an increase in cigarette consumption of 52
* 0.062 = 3.2 cigarettes per week). Negative coefficients mean that when
the regressor increases, the dependent variable decreases. If the regressor
is a
**dichotomous**variable (e.g. gender), the size of the coefficient tells us the size of the difference between the two classes of individual (again, with all other variables held constant). So a gender coefficient of 2.6, with women coded 0 and men coded 1, would mean that with all other variables held constant, men's dependent variable scores would average 2.6 units higher than women's. - Which independent variable has most effect on the dependent variable?
It is not possible to give a fully satisfactory answer to this question,
for a number of reasons. The chief one is that we are always looking at
the effect of each variable
*in the presence of all the others*; since the dependent variable need not be independent, it is hard to be sure which one is contributing to a joint relationship (or even to be sure that that means anything). However, the usual way of addressing the question is to look at the**standardised regression coefficients**or**beta weights**for each variable; these are the regression coefficients we would get if we converted all variables (independent and dependent) to**z-scores**before doing the regression. SPSS reports beta weights for each independent variable in its regression output; Minitab, unfortunately, does not. - Are the relationships of each regressor with the dependent variable
statistically significant, with all other regressors taken into account?
This is answered by looking at the
*t*values in the table of regression coefficients. The degrees of freedom for*t*are those for the residual in the anova table, but statistics packages work out significances for us, so we need to know the degrees of freedom only when it comes to reporting results. Note that if a regression coefficient is negative, most packages will report the corresponding*t*value as negative, but if you were looking it up in tables, you would use the**absolute**(unsigned) value, and the sign should be dropped when reporting results.

Either the nature of the data, or the regression results, may suggest
further questions. For example, you may want to obtain means and standard
deviations or histograms of variables to check on their distributions;
or plot one variable against another, or obtain a matrix of correlations,
to check on first order relationships. You should also check for unusual
observations or "**outliers**": these will be discussed in
the next session.

**Reporting regression results**

Research articles frequently report the results of several different regressions done on a single data set. In this case, it is best to present the results in a table. Where a single regression is done, however, that is unnecessary, and the results can be reported in text. The wording should be something like the following - this is for the depression vs age, income and gender example used as a Minitab example below:

The data were analysed by multiple regression, using as regressors age,
income and gender. The regression was a rather poor fit (*R*^{2}_{adj}
= 40%), but the overall relationship was significant (*F*_{3,12}
= 4.32, *p* < 0.05). With other variables held constant, depression
scores were negatively related to age and income, decreasing by 0.16 for
every extra year of age, and by 0.09 for every extra pound per week income.
Women tended to have higher scores than men, by 3.3 units. Only the effect
of income was significant (*t*_{12} = 3.18, *p* <
0.01).

Normally you will need to go on to discuss the meaning of the trends you have described.

Note the following pitfalls for the unwary:

- The above brief paragraph does not exhaust what you can say about a set of regression results. There may be features of the data you should look at - "Unusual observations", for example.
- Always report what happened before moving on to its significance -
so
*R*^{2}_{adj}values before*F*values, regression coefficients before*t*values. Remember, descriptive statistics are more important than significance tests. - Degrees of freedom for both
*F*and*t*values must be given. Usually they are written as subscripts. For*F*the numerator degrees of freedom are given first. You can also put degrees of freedom in parentheses, or report them explicitly, e.g.: "*F*(3,12) = 4.32" or "*F*= 4.32, d. of f. = 3, 12". - Significance levels can either be reported exactly (e.g.
*p*= 0.032) or in terms of conventional levels (e.g.*p*< 0.05). There are arguments in favour of either, so it doesn't much matter which you do. But you should be consistent in any one report. - Beware of highly significant
*F*or*t*values, whose significance levels will be reported by statistics packages as, for example, 0.0000. It is an act of statistical illiteracy to write*p*= 0.0000, because significance levels can never be exactly zero - there is always*some*probability that the observed data could arise if the**null hypothesis**was true. What the package means is that this probability is so low it can't be represented with the number of columns available. We should write it as*p*< 0.00005 (or, if we are using conventional levels,*p*< 0.001). - Beware of
**spurious precision**, i.e. reporting coefficients etc to huge numbers of**significant figures**when, on the basis of the sample you have, you couldn't possibly expect them to replicate to anything like that degree of precision if someone repeated the study.*F*and*t*values are conventionally reported to two decimal places, and*R*^{2}_{adj}values to the nearest percentage point (sometimes to one decimal place). For coefficients, you should be guided by the sample size: with a sample size of 16 as in the example above, two significant figures is plenty, but even with more realistic samples, in the range of 100 to 1000, three significant figures is usually as far as you should go. This means that you will usually have to round off the numbers that statistics packages give you.

At the end of the handout there is a complete worked example on some made-up data, in which we attempt to predict scores on a paper and pencil test of depression (running from 0 to 100) from income (in pounds/week), gender (coded 0 for men and 1 for women) and age. Note that the REGRESS command, which actually carries out the regression, needs us to tell it how many independent variables there are. It is very important to make sure that we then provide the corresponding number of columns - if we provide too many, Minitab will not warn us of the error, but will write some detailed results into the extra columns, thus overwriting any data we might have in them, and producing mystifying errors later in our analysis.

The SPSS example uses a set of data on the psychology of tax avoidance. An appropriate command file would be as follows:

title test regression get file='/singer1/eps/psybin/stats/tax.sys' regression variables=index free1 to law5 /statistics=defaults /missing=meansubstitution /dependent=index /method=enter finish

- The
**regression**command indicates that one or several regression analyses are to be carried out, and is followed by a list of all the variables that are to be used, either as dependent or a independent variables. In this case they include an index of tax evasion, and 15 questionnaire items measuring alienation, free-rider tendencies and attitudes to the law). - The
**/statistics**line can be used to control what sort of output we get. The default output is very similar to Minitab's regression output. - The
**/missing**line tells the system how to deal with missing values. Replacing them with the mean, as here, is not very satisfactory, but was necessary with this data set because there were too many missing values to discard all cases involving missing values on any variable (the more usual procedure). - The
**/dependent**line tells us which variable will be the dependent variable. If we give no other information, all the others will be used as independent variables. This line must come immediately before the /method line. - The
**/method**line tells us how to use the independent variables. The simple**enter**option used here will run one regression, using all the independent variables. We shall look at other possibilities in a later lecture.

Output from this file is given at the end of this handout. It shows that the 15 questionnaire items do quite a good job of predicting tax avoidance.

1. The following are the IQ scores on the Verbal and Numerical scales of a certain test for a group of students:

Verbal: 98 120 85 97 100 132 124 88 91 144 Numerical: 92 105 100 92 93 144 143 75 85 121

Use Minitab to calculate the mean and standard deviation of the scores on each scale. Use LET to work out the difference between them and put it in a new column. Use TTEST on this column to see whether there is a significant difference between the verbal and numerical scores.

2. Using the data from the previous example, work out the regression line for predicting Numerical scores (dependent variable) from Verbal scores (independent variable).

3. A social psychologist observes the scores achieved on a video game in a pub, by the first new (previously unobserved) player to use the machine after each half hour through the evening. They are as follows:

Time: 6pm 6.30 7pm 7.30 8pm 8.30 9pm 9.30 10pm 10.30 Score: 1760 995 2130 770 1535 3975 2120 5660 3341 4995

Use SPSS to investigate whether the data support the psychologist's hypothesis that more expert players use the machine later in the evening? What would be the most likely score to observe at 9.45pm?

4. The following data show the levels of anxiety recorded by a paper-and-pencil test just before a group of students took an examination, together with the exam marks obtained. Use Minitab's PLOT command to decide whether it would be appropriate to use linear regression to summarize these data.

Anxiety score: 5 17 10 12 3 19 2 11 9 8 13 18 4 7 Exam mark: 45 20 55 72 45 39 50 75 60 57 58 52 43 57

5. The Singer file **/singer1/eps/psybin/stats/teengamb.DAT** contains,
for each of 47 teenagers, the following information:

- subject number
- gender (0=male, 1=female)
- status (arbitrary scale based on parents' occupation. Higher numbers => higher status)
- income (pocket money+earnings) in pounds/wk
- verbal intelligence (number of words out of 12 correctly defined)
- estimate (from questionnaire answers) of expenditure on all forms of gambling, in pounds/yr

Each line of the file contains all 6 data items for a single person. These are real data, collected during an undergraduate project a few years ago, and since published (Ide-Smith & Lea, 1988, Journal of Gambling Behavior, 4, 110-118). Note , though, that you won't get quite the same results as in the published article, because I've cut out the data from some subjects whose data would have given you problems.

Set up a Minitab worksheet with columns with appropriate names, and read these data into it using READ. Note that you don't need to type the file extension (.DAT) because this is the default for READ, but if you do type it, you must use CAPITALS. The rest of the filename must be typed in lower case.

- Use minitab's DESCRIBE command to get an overview of these data.
- Use multiple regression to see whether gambling can be predicted from status, income, and verbal intelligence. How good is the prediction? Which of the variables has most effect? Give a description in words, as precise as you can make it, of the most significant effect.
- Use TWOT to find out how gambling was affected by gender in this sample.
- Find out what happens if you include gender as a predictor variable in the REGRESS command.
- Write out a command file that would have carried out the same analyses using SPSS.

MTB > set c1 DATA> 74 82 15 23 35 54 12 28 66 43 55 31 83 29 53 32 DATA> end MTB > set c2 DATA> 120 55 350 210 185 110 730 150 61 175 121 225 45 325 171 103 DATA> end MTB > set c3 DATA> 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 1 DATA> end MTB > set c4 DATA> 33 28 47 55 32 63 59 68 27 32 42 51 47 33 51 20 DATA> end MTB > name c1 'depress' MTB > name c2 'income' MTB > name c3 'm0f1' MTB > name c4 'age' MTB > regress c1 3 c2-c4 The regression equation is depress = 68.3 - 0.0934 income + 3.31 m0f1 - 0.162 age Predictor Coef Stdev t-ratio p Constant 68.28 15.44 4.42 0.001 income -0.09336 0.02937 -3.18 0.008 m0f1 3.306 8.942 0.37 0.718 age -0.1617 0.3436 -0.47 0.646 s = 17.70 R-sq = 52.0% R-sq(adj) = 39.9% Analysis of Variance SOURCE DF SS MS F p Regression 3 4065.4 1355.1 4.32 0.028 Error 12 3760.0 313.3 Total 15 7825.4 SOURCE DF SEQ SS income 1 3940.5 m0f1 1 55.5 age 1 69.4 Continue? y Unusual Observations Obs. income depress Fit Stdev.Fit Residual St.Resid 7 730 12.00 -6.10 15.57 18.10 2.15RX R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence.

Output from the SPSS regression on the tax data set

some blank lines have been removed

* * * * M U L T I P L E R E G R E S S I O N * * * * Mean Substituted for Missing Data Equation Number 1 Dependent Variable.. INDEX Evasion measure Block Number 1. Method: Enter Variable(s) Entered on Step Number 1.. LAW5 2.. FREE2 3.. ALIEN2 4.. LAW3 5.. LAW4 6.. LAW2 7.. LAW1 8.. ALIEN4 9.. ALIEN1 10.. ALIEN5 11.. FREE5 12.. ALIEN3 13.. FREE3 14.. FREE4 15.. FREE1 Multiple R .93111 R Square .86696 Adjusted R Square .80460 Standard Error 1.93857 Analysis of Variance DF Sum of Squares Mean Square Regression 15 783.65847 52.24390 Residual 32 120.25820 3.75807 F = 13.90179 Signif F = .0000 * * * * M U L T I P L E R E G R E S S I O N * * * * Equation Number 1 Dependent Variable.. INDEX Evasion measure ------------------ Variables in the Equation ------------------ Variable B SE B Beta T Sig T LAW5 .103593 .172898 .046600 .599 .5533 FREE2 -1.278802 .641764 -.399007 -1.993 .0549 ALIEN2 -.177951 .513296 -.064325 -.347 .7311 LAW3 -.269736 .224503 -.117014 -1.201 .2384 LAW4 .294076 .286945 .112165 1.025 .3131 LAW2 .224659 .350312 .084970 .641 .5259 LAW1 .106083 .234746 .037769 .452 .6544 ALIEN4 .353269 .339096 .130039 1.042 .3053 ALIEN1 1.227092 .252610 .567362 4.858 .0000 ALIEN5 .272150 .293067 .125177 .929 .3600 FREE5 -.833464 .339398 -.340064 -2.456 .0197 ALIEN3 -.059760 .345620 -.025101 -.173 .8638 FREE3 -1.531610 .668878 -.485904 -2.290 .0288 FREE4 -2.142148 .657277 -.670692 -3.259 .0027 FREE1 4.401877 .721189 1.828551 6.104 .0000 (Constant) -.845601 4.572997 -.185 .8545

Stephen Lea

University of Exeter Department of Psychology

Washington Singer Laboratories

Exeter EX4 4QG

United Kingdom

Tel +44 1392 264626

Fax +44 1392 264623

Send questions and
comments to the departmental
administrator or to the author
of this page

Goto Home page for
this course | next topic | FAQ
file

Goto home page for: University of
Exeter | Department of
Psychology | Staff
| Students |
Research | Teaching
| Miscellaneous

(access count since 10th February 1997).