# Examples on choosing the best regression model

All these examples use data held in the Singer file /singer1/eps/psybin/stats/debt.MTW. This is a Minitab worksheet containing some of the data from a large postal survey on the psychology of debt. The data in the file are, for each of 464 respondents,

• income group (1=lowest, 5=highest)
• security of housing tenure (1=rent, 2=mortgage, 3=owned outright)
• number of children in household
• is the respondent a single parent?
• age group (1=youngest)
• does the respondent have a bank account?
• does the respondent have a building society account?
• self-rating of money management skill (high values=high skill)
• how often did s/he use credit cards (1=never... 3=regularly)
• does s/he buy Christmas presents for children?
• score on a locus of control scale (high values=internal)
• score on a scale of attitudes to debt (high values=favourable to debt)

All yes/no questions are coded 0=no, 1=yes. These are real data (Lea, Webley & Walker, 1995, Journal of Economic Psychology, 16, 181-701), though the published paper also also deals with many other variables. Locus of control is a personality measure introduced by Rotter, which claims to differentiate people according to how much they feel things that happen to them are as a result of processes within themselves (internal locus of control) or outside events (external locus of control).

1. Get the data into Minitab. Use INFO to find out what columns are in use; use PRINT on some of these columns to see how Minitab reports missing values; and use DESCRIBE on these columns to see what Minitab does when there are values missing in data on which it is doing calculations.
2. Store this worksheet into your own filespace. Use the command SYSTEM ls to check that you have stored the worksheet correctly (note that ls is a unix command so must be in lower case). .
3. Use simple tto find out whether there are significant differences in debt attitudes between (a) smokers and non(b) those with and without bank accounts. Repeat these tests for locus of control.
4. Use BREG to find what combination of all the other variables in the list above gives the best explanation of variations in attitude to debt.
5. Use REGRESS to find out which of those variables are significantly associated with attitude, and to discover what the nature of the associations is. You may find that the R2adj value reported by REGRESS is not the same as the one you obtained from BREG; can you see why?
6. Get a printout of the full results of your best regression model
7. Use BREG to find out what combination of variables gives the most efficient explanation of variations in attitude to debt
8. (Optional). Use STEPWISE to answer the previous question in a different way, and see whether you get the same results as you did before. HELP STEPWISE will tell you more about how STEPWISE works.

### Sample of BREG output

This sample shows how BREG would be used to look for the best model to fit the teenage gambling data used in the introductory multiple regression examples. It assumes we have already read in the data and named the columns appropriately.

```        MTB > BREG C6 C2

Best Subsets Regression of gambling

p v
o e
s c r
t m b
m a o i
0 t n n
Vars   R-sq   R-sq    C-p         s   1 s y l
1   38.7   37.3   11.4    24.948       X
1   16.6   14.8   31.0    29.094   X
2   50.1   47.9    3.2    22.754   X   X
2   40.3   37.6   12.0    24.904     X X
3   52.6   49.3    3.0    22.434   X   X X
3   50.6   47.1    4.9    22.915   X X X
4   52.7   48.2    5.0    22.690   X X X X

```

Note the following:

• In calling BREG, the dependent variable comes first, then the full set of possible independent variables
• BREG reports the R2 and R2adj values for the best and second best model for each number of regressors. The first column gives the number of variables included in each model. Why does BREG only report on one 4-regressor model?
• You can ignore the columns labelled C-p and s.
• BREG lists all the possible independent variables, by name, spelt vertically. This can be quite difficult to spot, and it is still difficult to read even when you have spotted it.
• It then puts an X in the column corresponding to each independent variable that is included in the model being reported in a given row. So in the example above, the best 1-variable model includes the regressor 'pocmoney'.

Stephen Lea

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623