Contents: How do we choose a descriptive statistic?; Describing data with a simple regression equation; Goodness of fit in regression
Like many statistical procedures, multiple regression has two functions: to summarise some data, and to examine it for (statistically) significant trends. The first of these is part of descriptive statistics, the second of inferential statistics. Most of the time, we think more about inferential statistics, because they are usually more difficult. But descriptive statistics are more important. These notes concentrate on how regression describes a set of data.
Figure 1
Consider observation 1. Its y value is y1. If we consider an "average" value ÿ, we define the deviation from the average as y1, the squared deviation from the average as (y1-ÿ)2, and the sum of squared deviations as sigma(yi-ÿ)2. The arithmetic mean turns out to be the value of ÿ which makes this sum lowest. Note that the sum of the (unsquared) deviations from the mean is zero, in symbols sigma(yi-ÿ)=0: if you think about it, you will see that that follows from the definition of the arithmetic mean.
Figure 2
This raises two problems: what is the best straight line, and how can
we describe it when we have found it?
Let's deal first with describing a straight line. This is GCSE maths.
Any straight line can be described by an equation relating the y
values to the x values. In general, we usually write,
y = mx + c
Here m and c are constants whose values tell us which of the infinite number of possible straight lines we are looking at. m (from French monter) tells us about the slope or gradient of the line. Positive m means the line slopes upwards to the right; negative m that it slopes downwards. High m values mean a steep slope, low values a shallow one. The value of c (from French couper) tells us about the intercept, i.e. where the line cuts the y axis: positive c means that when x is zero, y has a positive value, negative c means that when x is zero, y has a negative value. But for regression purposes, it's more convenient to use different symbols. We usually write:
y = a + bx
This is just the same equation with different names for the constants:
a
is the intercept, b is the gradient.
The problem of choosing the best straight line then comes down to finding
the best values of a and b. We define "best" in the same
way as we did when we explained why the mean is the best summary of a set
of data: we choose the a and b values that give us the line
such that the sum of squared deviations from the line, instead of
from the average, is minimised. This is illustrated in Figure 3. The best
line is called the regression line, and the equation describing
it is called the regression equation.
Figure 3
(sum of squared deviations from the line) 1 - ------------------------------------------- (sum of squared deviations from the mean)This is called the variance accounted for, symbolised by VAC or R2. Its square root is the Pearson product-moment correlation coefficient. R2 can vary from 0 (the points are completely random) to 1 (all the points lie exactly on the regression line); quite often it is reported as a percentage (e.g. 73% instead of 0.73). Two sets of data can have identical a and b values and very different R2 values, or vice versa.
Goto Home page for
this course | previous
topic | example
sheet | next topic
Goto home page for: University of
Exeter | Department of
Psychology | Staff
| Students |
Research
|
Teaching |
Miscellaneous