 Minitab and multiple regression: Introduction

These notes are designed to help you remember what the introductory lecture to the multiple regression part of the course was about. They are not explanations! For those, you will have to listen to the lecture and/or do some reading. In particular, the terms printed in bold type are all things which you should understand by the end of the course. Many of them you will already know; some will be explained in the course of this lecture. In some cases we will explain them later in the course.

The first multiple regression lecture has three aims:

1. To remind you to get yourselves ready to use the department's network server computer, singer, to access the statistics package called Minitab, and use it for simple statistical operations.
2. To explain what the statistical procedure called Multiple Regression is, how it relates to other procedures, and what its uses are.
3. To warn you of some rules that must be obeyed if multiple regression is to give meaningful results

1. Using Minitab on singer.

You have already been taught how to do this. Before next week's class, take a few minutes to remind yourself how to do it. Remember that you must have a network password and user name valid for the current academic year. If for some reason you have not, you must go to the computer centre in the Laver building, taking your Guild card, and get a new password. Then you must log in to singer using the password issued to you, and change it to something more sensible and memorable.
At next week's class, I shall assume that you can:
• Turn on one of the Apple Macs in room 220, and use the NCSA Telnet program to set it up as a terminal
• Call Minitab
• Enter data into Minitab using the SET command
• Use Minitab's DESCRIBE command to find the mean, standard deviation, median, and range of a set of variables, and work out their variance.
• Use Minitab's HISTOGRAM command to look at the distribution of a set of data.
• Use Minitab's PLOT command to plot a scattergram.
• Stop Minitab when you have finished with it
• Log out of singer
Even if you think you can do all these things, please try them again before next week's class, to make sure.

2. What is multiple regression, where does it fit in, and what is it good for?

Multiple regression is the simplest of the large family of multivariate statistical techniques. That means it deals with numerous variables at the same time. Other multivariate techniques used in psychology include factor analysis, item analysis, multivariate analysis of variance (manova), discriminant analysis, path analysis, cluster analysis, and multidimensional scaling. Multiple regression is a manifest variables technique (i.e. it says things about the variables you actually measured, not a latent variables technique (these use hypothetical underlying quantities to account for the observed data).
Mathematically, multiple regression is a straightforward generalisation of simple regression, the process of fitting the best straight line through the dots on an x-y plot or scattergram. We will discuss what "best" means in this context in the next lecture.
Regression (simple and multiple) techniques are closely related to the analysis of variance (anova) which you studied last term. Both regression and anova are special cases of a single underlying mathematical model. You can combine the two, when what you have is an analysis of covariance (ancova), which we will introduce briefly later this term.
Two main points distinguish multiple regression from these other techniques:
• In multiple regression, we work with one dependent variable and many independent variables. In simple regression, there is only one independent variable; in factor analysis, cluster analysis and most other multivariate techniques, there are many dependent variables.
• In multiple regression, the independent variables may be correlated. In analysis of variance, we arrange for all the independent variables to vary completely independently of each other.
This means that multiple regression is useful in the following general class of situations. We observe one dependent variable, whose variation we want to explain in terms of a number of other independent variables, which we can also observe. These other variables are not under experimental control we just have to accept the variations in them that happen to occur in the sample of people or situations we can observe. We want to know which if any of these independent variables is significantly correlated with the dependent variable, taking into account the various correlations that may exist between the independent variables. So typically we use multiple regression to analyse data that come from "natural" rather than experimental situations. This makes it very useful in social psychology, and social science generally, and also in biological field work. Note, however, that it is inherently a correlational technique; it cannot of itself tell us anything about the causalities that may underlie the relationships it describes. Also, as with all statistical inference, the data need to be a random sample from some specified population; the technique will allow us draw inferences from our sample to that population, but not to any other.

3. Rules for using multiple regression

There are some additional rules that have to be obeyed if multiple regression is to be useful:
• The dependent variable should be measured on an interval (continuous) scale. In practice an ordinal (ranking or rating) scale is usually good enough. If it is only measured on a nominal (unordered category, including dichotomies) scale, we have to use other techniques (discriminant analysis or logistic regression). These are beyond the scope of this course, though the course should put you in a position where you could learn about them (e.g. from the notes for the PSY6003 course) if you had to.
• The independent variables should be measured on interval scales. However, most ordinal scale measurement will be acceptable in practice; 2-valued categorical variables (dichotomies) can be used directly; and there is way of dealing with k-valued categorical variables (k usually stands for any integer greater than 2), by dummy variables, which we will discuss later in the course.
• The distributions of all the variables should be normal. If they are not roughly normal, this can often be corrected by using an appropriate transformation (e.g. taking logarithms of all the measurements).
• The relationships between the dependent variable and the independent variable should be linear. That is, it should be possible to draw a rough straight line through an xy scattergram of the observed points. If the line looks curved, but is monotonic (increases or decreases all the time), things are not too bad and could be made better by transformation. If the line looks U-shaped, we will need to take special steps before regression can be used.
• There must be no interactions, in the anova sense, between independent variables the effect of each on the dependent variable must be roughly independent of the effects of all others. However, if interactions are obviously present, and not too complex, there are special steps we can to cope with the situation.
• Although the independent variables can be correlated, there must be no perfect (or nearcorrelations among them, a situation called multicollinearity (which will be explained later in the course).
• There are also requirements on the distributions of error, too technical to be considered in this course.

Stephen Lea

University of Exeter

Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623   (access count since 2nd January 1997).