University of Exeter


PSY2005 Statistics and Research Methods: Quantitative data analysis component

Dummy test on Multiple regression and factor analysis: Answer sheet and marking guide

Where more than one mark is available for something, partly correct answers should be given some of the marks. Use half marks if you have got a 1-mark section partly right. The pass mark for the whole test is 40%, and it is not necessary to pass both parts separately, so a really good performance on either part of the test would enable you to pass overall.

In some questions, the marks available add up to more than the maximum allowed. This is to allow different ways of getting full credit, but you are not allowed to get more than the maximum mark for the question.

Reminder: all the data were entirely fictitious and should not be taken as giving information about real psychological phenomena.


Question Maximum marks









Total for Section A














Total for Section B


Grand total


Section A: Minitab and multiple regression

Question A1

To make the data available for analysis READ 'singer1/eps/psybin/stats/tests/students' C1-C5 1 mark use READ because this is a text file
To name the variables NAME C1 'version' C2 'f1m2' C3 'lasttrm%'

NAME C4 'faculty' C5 'mark/50'

1 mark The naming could all be done on a single line
To answer question (a) TWOT C5 C1 1 mark You must use TWOT, and get the right variables in the right order. You can then use the TWOT output to examine the difference in the two means, the value of the t statistic reported, and its significance level. If t is not significant, there is no real case for proceeding further.
Assuming that in part (a) there is a difference between the means and it is significant, to answer question (b) use INDICATOR C4 C11-C13 1 mark to form dummy variables from the faculty codes
NAME C11 'arts' C12 'socstuds' C13 'science 1 mark to name the dummy variables
TALLY C4 1 mark to examine which faculty code is modal, i.e. occurs most often
supposing faculty code 2 is found to be modal REGRESS C5 5 C1-C3 C11 C13 1 mark note that neither C4, the categorical variable itself, nor C12, the dummy variable for the modal category, is included in the regression model.
To see whether version has an effect when all the other variables are taken into account we look at the regression coefficient for C1, and ask (i) is it comparable in size to the difference of means observed in (a), and (ii) is the corresponding t value significant? 2 marks
An alternative approach use BREG C5 C1-C3 C11 C13 and consider whether C1 is included in the best regression model. 2 marks
Maximum available for question A1 7 marks

Question A2

(a) The most complete prediction of photocopier use that is reasonably efficient is obtained from the maximum R2adj model, which includes the variables teaching contact hours per week, psychoticism, and neuroticism. 1 mark
We do not have the data available in this printout to find the model which is most efficient while being also reasonably complete, which would require F values for each model in the table. 1 mark
(b) The appropriate Minitab command would be REGRESS C1 3 C3 C6 C8 1 mark
(c) There is a high negative correlation (-0.72) between the seniority measure and the number of copies made. Since low seniority scores mean high seniority, more senior staff tend to make more copies. 1 mark
However, the seniority variable is not included in the best regression model. Further examination of the correlation matrix suggests that this is because of a strong correlation between seniority and teaching contact hours (which is in the model), with more senior staff doing more teaching. 1 mark
The fact that it is teaching hours rather than seniority that is included in the best regression model suggests that the apparent relation between seniority and usage is due only to the extra teaching done by senior staff (though because there is such a high correlation between seniority and teaching load, there is an identification problem here and the conclusion can only be reached tentatively). 1 mark
(d) The only variables retained in the best regression model for photocopier usage are teaching hours and personality variables. It is unlikely that the department can do much about the personality of its staff. Therefore it needs to look at the relation between teaching hours and copies made, and consider whether the pattern of teaching could be changed so it was not so dependent on the production of photocopies. Perhaps it would be more efficient to rely more on textbooks and less on handouts, though the costs of buying extra copies of texts for the library would have to be taken into account. 2 marks
Maximum for question A2 8 marks

Question A3.

(a) The means are as follows:
  • Fear of death 7.2
  • Eysenck N score 6.6
  • Grandparents died 1.7
2 marks for getting all

lose 1 mark for spurious precision (more than 1 decimal place reported)

The index of closer deaths, and respondent gender, are not quantitative data, and means should not be reported lose 1 mark for giving their means
(b) The regression equation is a fair fit, 1 mark
since the R2adj value is 66.6%. 1 mark
The regression equation accounts for a significant proportion of the variance in fear of death scores (F4,40 = 22.98, p < 0.0005). 2 marks
The regression equation is

fear of death score = 3.9 + 0.53 * EysenckN 0.35 * GPsdead + 3.5 * otherdeaths + 0.32 * m1f2

2 marks
With all other variables taken into account, the associations of fear of death with the Eysenck N score and with the occurrence of deaths among close relatives are significant: the t40 values are 8.51 (p < 0.0005) and 2.82 (p < 0.01) respectively. 2 marks
Fear of death scores increase by about half a scale point of every one scale point increase in the Eysenck N score, and are about 3.5 units higher for respondents who have suffered a close bereavement than for other respondents. 2 marks
The effect of the number of grandparents who have died approaches significance (t40 = 1.73, p<0.10): the more grandparents have died within the respondent's lifetime, the lower the fear of death score. It might be worth pursuing this question with a larger sample. 1 mark
Examination of the Unusual Observations report suggests that there may be several outliers. Plotting fear of death scores against the Eysenck N scores suggests that the only one of these likely to be serious is observation 41. Rerunning the regression with this observation deleted slightly reduces the significance of the trends reported above, but does not change them qualitatively, so they can be accepted as reasonable. 2 marks
Good reporting style 2 marks
Maximum for question A3 15 marks

Question A4.

(a) The median value of 'hoard' is 173.5, which tells us that there is no unique median hamster on this variable. We can choose either hamster 5 (hoard = 167) or hamster 15 (hoard = 180) as a median animal. Since the mean of 'hoard' is higher than the median, it might be better to take the higher value, and use hamster 15. 1 mark for choosing either 5 or 15
1 extra for a good rationale for preferring one of them
For hamster 15, weight during the experiment was 105% of its pre-experimental value; the hamster came from supplier 2 and was female. She established her nest 1.31 metres from the food source. 1 mark
(b) Supplier is an unordered categorical variable. Therefore, before we can carry out regression using this variable, we need to produce dummy variables corresponding to the three suppliers. 1 mark for recognizing this
1 mark for doing it correctly
We will also have to decide which supplier to drop from the analysis. Since none of them is in any sense a control or normal group, we use TALLY or TABLE to find the mode (supplier 3) and drop that one. 1 mark
To investigate how the other variables affect hoard size, we use Best or Stepwise regression. 1 mark
Using BREG, we find that the regression model with the highest R2adj value (68.7%) is the three-variable model including sex, distance to nest site, and supplier 1. However, the 2-variable model using only sex and distance to nest site has a better F value (22.45 as against 17.84), so if we want the most economical model we would prefer that (the best onemodel, using distance to nest site, does not have such a good F (21.62). The wording of the question suggests using the 3-variable model, for a better description 2 marks
(for identifying either the best R2adj or the best F model)
The best fitting regression equation is:

hoard size = 60 94 * sex + 269 * distance to nest + 61 * supplier 1.

1 mark
It is a fair fit, with R2adj equal to 68.7% 1 mark
and accounts for a significant proportion of the variation in hoard size (F3,20 = 17.84, p<0.0005) 1 mark
though it must be borne in mind that its significance will be inflated since it has been selected as the best model 1 mark
With all other variables held constant, the effects of sex and distance to the nest site are significant (t20 values of 3.29 and 6.49, p < 0.01 for sex and p < 0.0005 for distance to nest). 1 mark
though these signficance levels will also be inflated 1 mark
Females hoard about 93 more pellets per day than males, and mean hoard size rises by about 27 pellets per day for each 10cm by which the nest is distant from the food source. 1 mark
Hamsters from supplier 1 hoard about 61 pellets per day more than those from the other two suppliers, and this difference approaches significance (t20 = 1.85, p < 0.10). 1 mark
investigating the supplier variable as a whole at any stage 2 marks
good reporting style 2 marks
(c) Plotting the relation between hoarding and the distance from food to nest shows a noticeable outlier (observation 9), which is also picked out by the Unusual Observations report on the 3-variable model presented above. It would be worth repeating the entire analysis with the observation dropped. 1 mark for spotting the outlier
1 mark for carrying out a further analysis
(d) It would be worth repeating the study with a larger group from supplier 1, and checking that the same gender and distance relationships held regardless of supplier 1 mark
The relationship between nest site distance is much more regular for males than it is for females (you would need to use a selective COPY command to find this out), so future studies should include enough of each gender for their data to be studied separately. 2 marks
Maximum for question A4 20 marks

Total available for Section A 50 marks

Section B: SPSS and factor analysis

B1 A factor is a hypothetical construct, or 'latent variable' which is derived from other, directly observable variables, and helps to explain the correlations between a range of different responses or behaviours. A basic assumption of Factor Analysis is that the observed correlations between observed variables result from their sharing a smaller set of underlying variables up to 6 marks for a full answer
B2 A 'scree' test is a method of deciding how many factors are needed to capture the important dimensions in the data. It is done by looking at the plot of eigenvalues against their associated factors and looking for a sharp change, or 'elbow' in the plot that occurs when a steep drop gives way to a shallower slope, resembling the rubble that piles up at the bottom of a scree slope. up to 5 marks for a full answer
B3 There are 598 men and 721 women 4 marks
and 344 people in the 'salariat' class. 2 marks
GVTTRUST: mean: 3.157; minimum: 1.000; maximum: 5.000

GVTBENEF: mean:3.308; minimum: 1.000; maximum: 5.000

2 marks
B4 Yes, there is a significant difference between men and women on GETNEED, with a Pearson chi-square probability level of .00001. 5 marks
B5 Variables loading at .3 and above on Factor 1 are GVTTRUST, GVTBENEF, TRIAL, EQOPP, GETNEED, and VOTERCH. On factor 2, the variables with loadings above .3 are: REWEFFRT, EQOPP, GETNEED, REWSKILL, and VOTERCH. 12 marks
You have to use your own judgement in interpreting factor 1, but if you look just at those which load highly on factor 1 and not on factor 2, it could mean something like 'Britain has a fair political and legal system'. Note that some of the variables are somewhat complex, with loadings above .3 on both factors. We would have to be cautious about using some of these as a scale. 3 marks
B6 Cronbach's alpha value = .69; 6 marks
Yes, it would be a reasonably reliable scale for a sample of this size 2 marks
In principle, it could be improved (though this would leave only a 2-item 'scale') by dropping TRIAL, and the value of Cronbach's alpha would then go up to .80 (rounded). 3 marks
Total for Section B 50 marks

Total marks for the paper 100

Stephen Lea, Carole Burgoyne

University of Exeter
Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623

Send questions and comments to the departmental administrator or to the author of this page

Goto Home page for this course | dummy test paper
Goto home page for: University of Exeter | Department of Psychology | Staff | Students | Research | Teaching | Miscellaneous

Disclaimer Home (access count).
Document revised 10th January 1997