 # The idea of a regression equation

Note to users: This page, and the site of which it is part, was prepared for a course that I no longer teach, and I have therefore not beenupdating it systematically since 1997.  Because many people still use these pages, and have linked to them, I am leaving them available.  They will become progressively outdated, especially with regard to the details of computing procedures.  The current version of this site can be found at http://www.ex.ac.uk/Psychology/docs/courses/2005/mr/index.html , but will not necessarily cover the same material

Like many statistical procedures, multiple regression has two functions: to summarise some data, and to examine it for (statistically) significant trends. The first of these is part of descriptive statistics, the second of inferential statistics. Most of the time, we think more about inferential statistics, because they are usually more difficult. But descriptive statistics are more important. These notes concentrate on how regression describes a set of data.

### How do we choose a descriptive statistic?

Any number we use to summarise a set of numbers is called a descriptive statistic. Many different descriptive statistics can be calculated for a given set of numbers, and different ones are useful for different purposes. In many cases, a descriptive statistic is chosen because it is in some sense the best summary of a particular type. But what do we mean by "best"?
Consider the best known of all descriptive statistics, the arithmetic mean what lay people call the average. Why is this the best summary of a set of numbers? There is an answer, but it isn't obvious. The mean is the value from which the numbers in the set have the minimum sum of squared deviations. For the meaning of this, see Figure 1. Figure 1

Consider observation 1. Its y value is y1. If we consider an "average" value ÿ, we define the deviation from the average as y1, the squared deviation from the average as (y1-ÿ)2, and the sum of squared deviations as sigma(yi-ÿ)2. The arithmetic mean turns out to be the value of ÿ which makes this sum lowest. Note that the sum of the (unsquared) deviations from the mean is zero, in symbols sigma(yi-ÿ)=0: if you think about it, you will see that that follows from the definition of the arithmetic mean.

### Describing data with a simple regression equation

If we look at Figure 1, it's obvious that we could summarise the data better if we could find some way of representing the fact that the observations with high y values tend to be those with high x values. Graphically, we can do this by drawing a straight line on the graph so it passes through the cluster of points, as in Figure 2. Simple regression is a way of choosing the best straight line for this job. Figure 2

This raises two problems: what is the best straight line, and how can we describe it when we have found it?
Let's deal first with describing a straight line. This is GCSE maths. Any straight line can be described by an equation relating the y values to the x values. In general, we usually write,

y = mx + c

Here m and c are constants whose values tell us which of the infinite number of possible straight lines we are looking at. m (from French monter) tells us about the slope or gradient of the line. Positive m means the line slopes upwards to the right; negative m that it slopes downwards. High m values mean a steep slope, low values a shallow one. The value of c (from French couper) tells us about the intercept, i.e. where the line cuts the y axis: positive c means that when x is zero, y has a positive value, negative c means that when x is zero, y has a negative value. But for regression purposes, it's more convenient to use different symbols. We usually write:

y = a + bx

This is just the same equation with different names for the constants: a is the intercept, b is the gradient.
The problem of choosing the best straight line then comes down to finding the best values of a and b. We define "best" in the same way as we did when we explained why the mean is the best summary of a set of data: we choose the a and b values that give us the line such that the sum of squared deviations from the line, instead of from the average, is minimised. This is illustrated in Figure 3. The best line is called the regression line, and the equation describing it is called the regression equation. Figure 3

### Goodness of fit in regression

Having found the best straight line, the next question is how well it describes the data. We measure this by the fraction
```           (sum of squared deviations from the line)
1 -  -------------------------------------------
(sum of squared deviations from the mean)```
This is called the variance accounted for, symbolised by VAC or R2. Its square root is the Pearson product-moment correlation coefficient. R2 can vary from 0 (the points are completely random) to 1 (all the points lie exactly on the regression line); quite often it is reported as a percentage (e.g. 73% instead of 0.73). Two sets of data can have identical a and b values and very different R2 values, or vice versa.
Note carefully that a, b and R2 are all descriptive statistics. We have not said anything yet about significance tests. Given a set of paired x and y values, we can use Minitab to find the corresponding values of a, b and R2. It will also do some significance tests for us. The way to do this is described in a separate handout, which also gives you some examples to work on. The calculations can also be done by hand, or on a pocket calculator that has statistical functions.

Stephen Lea

University of Exeter

Department of Psychology
Washington Singer Laboratories
Exeter EX4 4QG
United Kingdom
Tel +44 1392 264626
Fax +44 1392 264623

Goto Home page for this course | previous topic | example sheet | next topic
Goto home page for: University of Exeter | Department of Psychology | Staff | Students | Research | Teaching | Miscellaneous   (access count since 2nd January 1997).
Document revised 26th August 2002