Minitab and multiple regression: Introduction
These notes are designed to help you remember what the introductory
lecture to the multiple regression part of the course was about. They are
not explanations! For those, you will have to listen to the lecture and/or
do some reading. In particular, the terms printed in bold type are
all things which you should understand by the end of the course. Many of
them you will already know; some will be explained in the course of this
lecture. In some cases we will explain them later in the course.
The first multiple regression lecture has three aims:
-
To remind you to get yourselves ready to use the department's network server
computer, singer, to access the statistics package called
Minitab, and use it for simple statistical
operations.
-
To explain what the statistical procedure called Multiple
Regression is, how it relates to other procedures, and what its uses
are.
-
To warn you of some rules that must be obeyed if multiple
regression is to give meaningful results
1. Using Minitab on singer.
You have already been taught how to do this. Before next week's class,
take a few minutes to remind yourself how to do it. Remember that you must
have a network password and user name valid for the current academic year.
If for some reason you have not, you must go to the computer centre in
the Laver building, taking your Guild card, and get a new password. Then
you must log in to singer using the password issued to you, and change
it to something more sensible and memorable.
At next week's class, I shall assume that you can:
-
Turn on one of the Apple Macs in room 220, and use the NCSA Telnet
program to set it up as a terminal
-
Log in to singer
-
Call Minitab
-
Enter data into Minitab using the SET command
-
Use Minitab's DESCRIBE command to find the mean, standard
deviation, median, and range of a set of variables, and
work out their variance.
-
Use Minitab's HISTOGRAM command to look at the distribution
of a set of data.
-
Use Minitab's PLOT command to plot a scattergram.
-
Stop Minitab when you have finished with it
-
Log out of singer
Even if you think you can do all these things, please try them again before
next week's class, to make sure.
2. What is multiple regression, where does it fit
in, and what is it good for?
Multiple regression is the simplest of the large family of multivariate
statistical techniques. That means it deals with numerous variables at
the same time. Other multivariate techniques used in psychology include
factor analysis, item analysis, multivariate analysis of variance (manova),
discriminant analysis, path analysis, cluster analysis, and multidimensional
scaling. Multiple regression is a manifest variables technique (i.e.
it says things about the variables you actually measured, not a latent
variables technique (these use hypothetical underlying quantities to
account for the observed data).
Mathematically, multiple regression is a straightforward generalisation
of simple regression, the process of fitting the best straight
line through the dots on an x-y plot or scattergram. We will
discuss what "best" means in this context in the next lecture.
Regression (simple and multiple) techniques are closely related to
the
analysis of variance (anova) which you studied last term. Both
regression and anova are special cases of a single underlying mathematical
model. You can combine the two, when what you have is an analysis of
covariance (ancova), which we will introduce briefly later this term.
Two main points distinguish multiple regression from these other techniques:
-
In multiple regression, we work with one dependent variable and
many independent variables. In simple regression, there is only
one independent variable; in factor analysis, cluster analysis and most
other multivariate techniques, there are many dependent variables.
-
In multiple regression, the independent variables may be correlated.
In analysis of variance, we arrange for all the independent variables to
vary completely independently of each other.
This means that multiple regression is useful in the following general
class of situations. We observe one dependent variable, whose variation
we want to explain in terms of a number of other independent variables,
which we can also observe. These other variables are not under experimental
control we just have to accept the variations in them that happen to occur
in the sample of people or situations we can observe. We want to know which
if any of these independent variables is significantly correlated with
the dependent variable, taking into account the various correlations that
may exist between the independent variables. So typically we use multiple
regression to analyse data that come from "natural" rather than experimental
situations. This makes it very useful in social psychology, and social
science generally, and also in biological field work. Note, however, that
it is inherently a correlational technique; it cannot of itself tell us
anything about the causalities that may underlie the relationships it describes.
Also, as with all statistical inference, the data need to be a random
sample from some specified population; the technique will allow
us draw inferences from our sample to that population, but not to any other.
3. Rules for using multiple regression
There are some additional rules that have to be obeyed if multiple regression
is to be useful:
-
The dependent variable should be measured on an interval (continuous)
scale. In practice an ordinal (ranking or rating) scale is usually
good enough. If it is only measured on a nominal (unordered category,
including dichotomies) scale, we have to use other techniques (discriminant
analysis or logistic regression). These are beyond the scope of this course,
though the course should put you in a position where you could learn about
them (e.g. from the notes for the PSY6003 course)
if you had to.
-
The independent variables should be measured on interval scales. However,
most ordinal scale measurement will be acceptable in practice; 2-valued
categorical variables (dichotomies) can be used directly; and there is
way of dealing with k-valued categorical variables (k
usually
stands for any integer greater than 2), by dummy variables,
which we will discuss later in the course.
-
The distributions of all the variables should be normal. If they
are not roughly normal, this can often be corrected by using an appropriate
transformation
(e.g. taking logarithms of all the measurements).
-
The relationships between the dependent variable and the independent variable
should be linear. That is, it should be possible to draw a rough
straight line through an xy scattergram of the observed points. If the
line looks curved, but is monotonic (increases or decreases all
the time), things are not too bad and could be made better by transformation.
If the line looks U-shaped, we will need to take special steps before regression
can be used.
-
There must be no interactions, in the anova sense, between independent
variables the effect of each on the dependent variable must be roughly
independent of the effects of all others. However, if interactions are
obviously present, and not too complex, there are special steps we can
to cope with the situation.
-
Although the independent variables can be correlated, there must be no
perfect (or nearcorrelations among them, a situation called multicollinearity
(which will be explained later in the course).
-
There are also requirements on the distributions of error, too technical
to be considered in this course.
Stephen Lea
University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623
Send questions and comments to the
departmental administrator or to the
author of this page
Goto Home page for
this course | next
topic
Goto home page for: University of
Exeter | Department of
Psychology | Staff
| Students |
Research
| Teaching |
Miscellaneous


(access count since 2nd January 1997).
Document revised 2nd January 1997