Choosing between regression models; Minitab commands RETRIEVE and SAVE; missing values

Finding the best regression equation

When we are at the exploratory stage of some research, we quite often collect data on a large number of independent variables, in the hope of finding out which of them will best predict an dependent variable. How can we do this?
The obvious solution is to put them all into a multiple regression. However, the result may be quite disappointing: we may get a low R2adj value, and a lot of the regressors may have small, nonsignificant coefficients. It's quite possible that a regression using some subset of the regressors would have been better in R2adj terms. In that case it would also, obviously, be a more efficient, compact, description of the data. How can we go about looking for such a simpler model?
One way would simply be to drop out any regressors that don't have significant coefficients. However, it can be shown that this isn't the best solution. Minitab offers two better techniques, using commands BREG and STEPWISE. Before we can consider how to use them, we need to know what makes one regression model (which is what we call a set of regressors) better than another.
If we consider all the possible models with the same number of regressors, the answer is straightforward: we should choose the model with the highest value of R2adj. This will also be the model with the highest value of R2, and the highest value of F. Usually, though, the question is which of two models involving different numbers of regressors is better in particular, whether it is worth adding an additional independent variable, or worth dropping out one we originally put in. In this case, there are two possible answers. We could choose the model with the highest value of R2adj, or we could choose the model with the highest value of F. There are arguments in favour of either policy. The model with the highest R2adj will almost always have more regressors in it than the model with the highest F. Therefore, a reasonable rule is: If you want the most complete description of the data, which is also reasonably efficient, choose the model with the highest R2adj value. If you want the most efficient description, which is also reasonably complete, choose the model with the highest F. Different research projects will call for different strategies.

The Minitab commands BREG and STEPWISE.

But how can we find models having these desirable properties? Minitab will do it for us. The most straightforward way is to use command BREG. This simply tries every possible regression model with a given set of regressors. This is easy to understand, but it can take a long time, and for this reason, Minitab will only do it with a limited number of regressors (not more than 20). With 20 regressors, there are over 1 million models to be tried (220-1).
Using BREG is easy. If the dependent variable was in C10, and you had 6 independent variables in C1to C6 you would just type

BREG C10 C1-C6

and Minitab would do the rest. Notice that you don't have to tell BREG how many regressors there are (though it won't matter if you do). There is a sample of BREG output with the examples sheet for this class.
The output from BREG tells you which regressors are included in the best two models for each possible number of regressors. It also tells you their R2adj values, so you can pick out the best model of all in R2adj terms directly. It doesn't tell you their F values, so if you want the best model in F terms, you will have to use the REGRESS command on the best R2adj model for each number of regressors, and look to see which one has the best F.
Even if you are interested in the best R2adj model, you should proceed to use REGRESS on the set of regressors it identifies, so you can find out the values of the coefficients and their significance. You'll also need the F value to assess the significance of the model as a whole, though you should note that choosing a model likes this biases the procedure in favour of producing large R2adj values, and so undermines the logic of significance testing so it isn't very surprising if the best regression model is reported as significant.
Because of the amount of computation involved in doing a BREG, this sort of command has only recently been introduced. Many statistical packages don't have it, and as we've seen, Minitab won't use it for large numbers of regressors. So you may have to use the alternative procedure, STEPWISE, which is almost always available. In Minitab, it looks for the model with the best F, rather than the best R2adj. The way it works is to start with no regressors, and tries putting each one in in turn till it finds the one that gives the biggest gain in F. It then repeats this, adding a second regressor, and then a third, etc, until the gains in F become negligible. Once it has at least one regressor in the model, it will also try taking each one out at each stage. You can also make it start with all the regressors in the model and work downwards. This procedure isn't as thorough as BREG, and it can find a local maximum of F instead of the true maximum. STEPWISE is also a lot more complicated to use than BREG, so for this course you just need to know that it exists: you'll be able to do all but one of the examples, and all the test papers, with BREG.

Text files and binary files

To do this week's examples, you'll need to be able to use the Minitab commands RETRIEVE and SAVE. Before we can explain what they do, we need to understand the difference between text files and binary files.
A text file contains information in a very simple, standard code which can be interpreted by a wide variety of programs; the code most often used is called ASCII, but text files usually don't use the full list of 256 ASCII codes. The list of symbols allowed in text files varies a bit between programs, but you can rely on being allowed the 26 letters of the English alphabet in both capitals and lower case; the digits 09; some but not all punctuation symbols; and some but not all mathematical symbols. In addition you will always be allowed the control code ENTER (used to mark ends of lines). The advantage of text files is that almost any program and almost any computer can use them. So they are used for moving data between one computer and another, or one program and another. For example, we can prepare a data file in a word processor on a Macintosh, output it as a text file, transfer that to singer, display it on the screen, edit it, or send it to a printer, and read it into Minitab for statistical work.
Almost all programs also make use of binary files. These are files that contain codes specific to a particular program or group of programs. Often they are specific to a particular computer as well. For example, word processors produce binary files that contain codes for different fonts, page layouts etc, as well as the text you are writing. You couldn't take the binary file produced by a word processor and read it straight into Minitab: the codes in it would make no sense. The advantage of binary files is that they are are handled faster than the corresponding text files, and can contain more varied information.
Minitab uses both text and binary files. Its text files just contain columns of data, usually numbers. Its binary files are called worksheet files. They can contain data in columns, but also column names and other information.

So far, we have been using text files to get data into Minitab, using the command READ. If the data are in a worksheet file, we have to use the command RETRIEVE instead. It is easier to use than READ. A worksheet file is the complete list of the columns you have in use, together with their names and the data set into them. To get a worksheet file into Minitab, type RETRIEVE followed by the filename. Note the following rules carefully:

1. the filename must be enclosed in quotes;
2. if the filename ends in .MTW, you can leave that bit off. Otherwise you must give the whole filename.
3. you don't need to specify any column numbers, since these are specified within the worksheet
4. in UNIX filenames the difference between capitals and lower case matters, so copy filenames exactly.

Once you've used RETRIEVE it's a good idea to type INFO to check what data you have retrieved.
It's very important to learn the difference between READ and RETRIEVE. Learn the following by heart:
READ is used with text files, RETRIEVE with worksheet (binary) files;
Both commands are followed by a filename in single quotes;
READ expects filenames to end in .DAT, RETRIEVE expects them to end in .MTW. These default extensions can be left off when typing the filename within Minitab;
With READ you have to specify column numbers after the filename, with RETRIEVE you do not;
After using READ you will have to set up column names using the NAME command (and any old names will be left intact until you do - this can be very misleading); RETRIEVE wipes out old column names and puts on names out of the worksheet file.
More time is wasted, in tests and in real use of Minitab, by getting mixed up between these two commands than any other way.

Storing data: WRITE and SAVE.

Sometimes we want to get data out of Minitab as well as into it. Each of the commands READ and RETRIEVE has an "opposite number" which puts data out instead of getting it in. READ's opposite number is called WRITE, but we don't need it very often. RETRIEVE's opposite number is called SAVE, and we need it frequently. We use it to create a worksheet file after we have got some data into Minitab from READ, given it column names, and so forth. To use it, we just type SAVE followed by a filename. When using SAVE, note carefully:
(i) the filename must be given in single quotes.
(ii) if there are no dots in the filename, Minitab will add .MTW to the end of your filename.
(iii) if you change your worksheet after you've SAVEd, the changes won't be included in the fileSAVE again if you want to keep them.
(iv) if you are using SAVE to save some work you have been doing in Minitab, you must do it BEFORE typing STOP to leave Minitab: as soon as you type STOP all your work is lost;
(v) the file produced by SAVE is a worksheet file, not a text file. So you won't be able to look at it on the screen, print it out, etc. BUT
(vi) if you reenter Minitab on a subsequent occasion, even if you have meanwhile logged right out of the computer, you can call the worksheet back in and continue working where you left off when you SAVEd, by calling the file back in using RETRIEVE.

Storing Minitab output

Often, we want a printout ("hard copy") of Minitab's output. To do this, we have to first put the output in a file. This is done by using Minitab's command OUTFILE, for example by typing

OUTFILE 'SNOEK'

(note the quotes - they are essential). After you have typed this, all subsequent output to the screen will also be sent to a unix file called SNOEK.LIS, until you cancel the instruction by using the command NOOUTFILE (you don't have to specify the filename). Obviously, you can use more sensible filenames. Note that results that came out before you typed OUTFILE won't be in the file
Once you have typed NOOUTFILE, you can use unix commands to work with SNOEK.LIS. For example, you can use command ls to show that it is now in your directory. It is a text file, so you can also look at it with the UNIX command more, print it with lpr or pcprint, or download it to a Mac or PC for editing. There are three ways you can do this:

Leave Minitab permanently by typing STOP. Don't forget to SAVE your work first if you have modified the worksheet and want to keep the modifications.

Leave Minitab temporarily by typing SYSTEM. You can then type some unix commands, finishing with the unix command exit which should bring you back into Minitab where you left it. Just in case it fails, make sure you SAVE first if you want to keep your work.

If you just want to give a single unix command, for example to print the output file, you can type SYSTEM followed by the unix command, all within Minitab, e.g.

SYSTEM lpr SNOEK.LIS

Note, in this last case, that:

1. Quotes are not required round the filename because when you use it you are talking to unix rather than to Minitab;
2. .LIS is required, and MUST be typed in CAPITALS.

This will send the printout to the printer by the blackboard in the undergraduate lab. This printer is rather slow and cannot cope with huge amounts of student output; it also produces a lot of waste paper. So please be moderate in your use of it, and think carefully whether you have got all the information you need before asking for a printout. Your printout will be identified with a "banner" containing your userid; don't take anyone else's by mistake! There is another way of getting a printout in the department, by copying the file to a Mac and printing it on one of the Mac network printers. If you know how to do this, fine; if you don't, don't worry about it.

Moving worksheets between computers

Text files can be moved freely between computers, e.g. from singer to a PC or Mac, or vice versa; that is one of the reasons we use them. However, if you have been doing a lot of work on some data, which you have stored in a worksheet, you might want to move the whole worksheet to a different type of computer without going right back to the data file. You can do that by using the Minitab subcommand PORTABLE on the SAVE and RETRIEVE commands. To use subcommands, you have to end a command line with a semi-colon; Mintab then gives you the prompt SUBC>. End the last subcommand with a full stop. So to save a worksheet into a file called SNOEK that you could move to a different type of computer, you would type
MTB > SAVE 'SNOEK';
SUBC> PORTABLE.
To load this file into Minitab, you would type
MTB > RETRIEVE 'SNOEK';
SUBC> PORTABLE.

Missing values.

It often happens that in a large study, some items of data are not available for some respondents - perhaps because of errors, perhaps because of non-response, perhaps because of the design of the study, or perhaps because the data simply couldn't be collected. Minitab allows you to specify such "missing values" when entering data. Usually, missing values are indicated by an asterisk when data are entered or printed out. Most Minitab commands will behave sensibly if they encounter them. For example, if REGRESS, BREG or STEPWISE encounter a missing value in one of the variables they are working on, they will ignore all the data for that person.

Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623