When we are at the exploratory stage of some research, we quite often
collect data on a large number of independent variables, in the hope of
finding out which of them will best predict an dependent variable. How
can we do this?
The obvious solution is to put them all into a multiple regression. However,
the result may be quite disappointing: we may get a low R2adj
value, and a lot of the regressors may have small, nonsignificant coefficients.
It's quite possible that a regression using some subset of the regressors
would have been better in R2adj terms. In
that case it would also, obviously, be a more efficient, compact, description
of the data. How can we go about looking for such a simpler model?
One way would simply be to drop out any regressors that don't have significant
coefficients. However, it can be shown that this isn't the best solution.
Minitab offers two better techniques, using commands BREG and STEPWISE.
Before we can consider how to use them, we need to know what makes one
regression model (which is what we call a set of regressors) better
than another.
If we consider all the possible models with the same number of regressors,
the answer is straightforward: we should choose the model with the highest
value of R2adj. This will also be the model
with the highest value of R2, and the highest value of
F. Usually, though, the question is which of two models involving
different numbers of regressors is better in particular, whether
it is worth adding an additional independent variable, or worth dropping
out one we originally put in. In this case, there are two possible answers.
We could choose the model with the highest value of R2adj,
or we could choose the model with the highest value of F. There
are arguments in favour of either policy. The model with the highest R2adj
will almost always have more regressors in it than the model with the highest
F. Therefore, a reasonable rule is: If you want the most complete
description of the data, which is also reasonably efficient, choose the
model with the highest R2adj value. If you
want the most efficient description, which is also reasonably complete,
choose the model with the highest F. Different research projects
will call for different strategies.
But how can we find models having these desirable properties? Minitab
will do it for us. The most straightforward way is to use command BREG.
This simply tries every possible regression model with a given set of regressors.
This is easy to understand, but it can take a long time, and for this reason,
Minitab will only do it with a limited number of regressors (not more than
20). With 20 regressors, there are over 1 million models to be tried (220-1).
Using BREG is easy. If the dependent variable was in C10, and you had 6
independent variables in C1to C6 you would just type
BREG C10 C1-C6
and Minitab would do the rest. Notice that you don't have to tell BREG
how many regressors there are (though it won't matter if you do). There
is a sample of BREG output with the examples sheet for this class.
The output from BREG tells you which regressors are included in the best
two models for each possible number of regressors. It also tells you their
R2adj values, so you can pick out the best
model of all in R2adj terms directly. It doesn't
tell you their F values, so if you want the best model in F
terms, you will have to use the REGRESS command on the best R2adj
model for each number of regressors, and look to see which one has the
best F.
Even if you are interested in the best R2adj
model, you should proceed to use REGRESS on the set of regressors it identifies,
so you can find out the values of the coefficients and their significance.
You'll also need the F value to assess the significance of the model
as a whole, though you should note that choosing a model likes this biases
the procedure in favour of producing large R2adj
values, and so undermines the logic of significance testing so it isn't
very surprising if the best regression model is reported as significant.
Because of the amount of computation involved in doing a BREG, this sort
of command has only recently been introduced. Many statistical packages
don't have it, and as we've seen, Minitab won't use it for large numbers
of regressors. So you may have to use the alternative procedure, STEPWISE,
which is almost always available. In Minitab, it looks for the model with
the best F, rather than the best R2adj.
The way it works is to start with no regressors, and tries putting each
one in in turn till it finds the one that gives the biggest gain in F.
It then repeats this, adding a second regressor, and then a third, etc,
until the gains in F become negligible. Once it has at least one regressor
in the model, it will also try taking each one out at each stage. You can
also make it start with all the regressors in the model and work downwards.
This procedure isn't as thorough as BREG, and it can find a local maximum
of F instead of the true maximum. STEPWISE is also a lot more complicated
to use than BREG, so for this course you just need to know that it exists:
you'll be able to do all but one of the examples, and all the test papers,
with BREG.
To do this week's examples, you'll need to be able to use the Minitab
commands RETRIEVE and SAVE. Before we can explain what they
do, we need to understand the difference between text files and
binary files.
A text file contains information in a very simple, standard code which
can be interpreted by a wide variety of programs; the code most often used
is called ASCII, but text files usually don't use the full list of 256
ASCII codes. The list of symbols allowed in text files varies a bit between
programs, but you can rely on being allowed the 26 letters of the English
alphabet in both capitals and lower case; the digits 09; some but not all
punctuation symbols; and some but not all mathematical symbols. In addition
you will always be allowed the control code ENTER (used to mark ends of
lines). The advantage of text files is that almost any program and almost
any computer can use them. So they are used for moving data between one
computer and another, or one program and another. For example, we can prepare
a data file in a word processor on a Macintosh, output it as a text file,
transfer that to singer, display it on the screen, edit it, or send it
to a printer, and read it into Minitab for statistical work.
Almost all programs also make use of binary files. These are files that
contain codes specific to a particular program or group of programs. Often
they are specific to a particular computer as well. For example, word processors
produce binary files that contain codes for different fonts, page layouts
etc, as well as the text you are writing. You couldn't take the binary
file produced by a word processor and read it straight into Minitab: the
codes in it would make no sense. The advantage of binary files is that
they are are handled faster than the corresponding text files, and can
contain more varied information.
Minitab uses both text and binary files. Its text files just contain columns
of data, usually numbers. Its binary files are called worksheet
files. They can contain data in columns, but also column names and other
information.
So far, we have been using text files to get data into Minitab, using the command READ. If the data are in a worksheet file, we have to use the command RETRIEVE instead. It is easier to use than READ. A worksheet file is the complete list of the columns you have in use, together with their names and the data set into them. To get a worksheet file into Minitab, type RETRIEVE followed by the filename. Note the following rules carefully:
Once you've used RETRIEVE it's a good idea to type INFO to check what
data you have retrieved.
It's very important to learn the difference between READ and RETRIEVE.
Learn the following by heart:
READ is used with text files, RETRIEVE with worksheet (binary) files;
Both commands are followed by a filename in single quotes;
READ expects filenames to end in .DAT, RETRIEVE expects them to end in
.MTW. These default extensions can be left off when typing the filename
within Minitab;
With READ you have to specify column numbers after the filename, with RETRIEVE
you do not;
After using READ you will have to set up column names using the NAME command
(and any old names will be left intact until you do - this can be very
misleading); RETRIEVE wipes out old column names and puts on names out
of the worksheet file.
More time is wasted, in tests and in real use of Minitab, by getting mixed
up between these two commands than any other way.
Sometimes we want to get data out of Minitab as well as into it. Each
of the commands READ and RETRIEVE has an "opposite number" which
puts data out instead of getting it in. READ's opposite number is called
WRITE, but we don't need it very often. RETRIEVE's opposite number is called
SAVE, and we need it frequently. We use it to create a worksheet file after
we have got some data into Minitab from READ, given it column names, and
so forth. To use it, we just type SAVE followed by a filename. When using
SAVE, note carefully:
(i) the filename must be given in single quotes.
(ii) if there are no dots in the filename, Minitab will add .MTW to the
end of your filename.
(iii) if you change your worksheet after you've SAVEd, the changes won't
be included in the fileSAVE again if you want to keep them.
(iv) if you are using SAVE to save some work you have been doing in Minitab,
you must do it BEFORE typing STOP to leave Minitab: as soon as you type
STOP all your work is lost;
(v) the file produced by SAVE is a worksheet file, not a text file. So
you won't be able to look at it on the screen, print it out, etc. BUT
(vi) if you reenter Minitab on a subsequent occasion, even if you have
meanwhile logged right out of the computer, you can call the worksheet
back in and continue working where you left off when you SAVEd, by calling
the file back in using RETRIEVE.
Often, we want a printout ("hard copy") of Minitab's output. To do this, we have to first put the output in a file. This is done by using Minitab's command OUTFILE, for example by typing
OUTFILE 'SNOEK'
(note the quotes - they are essential). After you have typed this, all
subsequent output to the screen will also be sent to a unix file
called SNOEK.LIS, until you cancel the instruction by using the command
NOOUTFILE (you don't have to specify the filename). Obviously, you can
use more sensible filenames. Note that results that came out before
you typed OUTFILE won't be in the file
Once you have typed NOOUTFILE, you can use unix commands to work
with SNOEK.LIS. For example, you can use command ls to show that
it is now in your directory. It is a text file, so you can also look at
it with the UNIX command more, print it with lpr or pcprint,
or download it to a Mac or PC for editing. There are three ways you can
do this:
Leave Minitab permanently by typing STOP. Don't forget to SAVE your work first if you have modified the worksheet and want to keep the modifications.
Leave Minitab temporarily by typing SYSTEM. You can then type some unix commands, finishing with the unix command exit which should bring you back into Minitab where you left it. Just in case it fails, make sure you SAVE first if you want to keep your work.
If you just want to give a single unix command, for example to print the output file, you can type SYSTEM followed by the unix command, all within Minitab, e.g.
SYSTEM lpr SNOEK.LIS
(obviously, you use your own file name instead of SNOEK)
Note, in this last case, that:
This will send the printout to the printer by the blackboard in the undergraduate lab. This printer is rather slow and cannot cope with huge amounts of student output; it also produces a lot of waste paper. So please be moderate in your use of it, and think carefully whether you have got all the information you need before asking for a printout. Your printout will be identified with a "banner" containing your userid; don't take anyone else's by mistake! There is another way of getting a printout in the department, by copying the file to a Mac and printing it on one of the Mac network printers. If you know how to do this, fine; if you don't, don't worry about it.
Text files can be moved freely between computers, e.g. from singer to
a PC or Mac, or vice versa; that is one of the reasons we use them. However,
if you have been doing a lot of work on some data, which you have stored
in a worksheet, you might want to move the whole worksheet to a different
type of computer without going right back to the data file. You can do
that by using the Minitab subcommand PORTABLE on the SAVE and RETRIEVE
commands. To use subcommands, you have to end a command line with a semi-colon;
Mintab then gives you the prompt SUBC>. End the last subcommand with
a full stop. So to save a worksheet into a file called SNOEK that you could
move to a different type of computer, you would type
MTB > SAVE 'SNOEK';
SUBC> PORTABLE.
To load this file into Minitab, you would type
MTB > RETRIEVE 'SNOEK';
SUBC> PORTABLE.
It often happens that in a large study, some items of data are not available for some respondents - perhaps because of errors, perhaps because of non-response, perhaps because of the design of the study, or perhaps because the data simply couldn't be collected. Minitab allows you to specify such "missing values" when entering data. Usually, missing values are indicated by an asterisk when data are entered or printed out. Most Minitab commands will behave sensibly if they encounter them. For example, if REGRESS, BREG or STEPWISE encounter a missing value in one of the variables they are working on, they will ignore all the data for that person.
Goto Home page for this
course | previous
topic | examples
sheet | next
topic
Goto home page for: University of Exeter
| Department of Psychology
| Staff | Students
| Research |
Teaching | Miscellaneous