Regression V – Why does OLS make me so blue?

When we derived our formula for the coefficients of regression, we minimised the sum of the squares of the difference between our predicted values for Y and the true values of Y. This technique is called ordinarly least squares. It is sometimes characterised as the best unbiased linear estimator (aka BLUE), what exactly does this mean?

First, let’s think about regression again, suppose we have m samples of a set of n predictors Xi and a response Y. We can say that the predictors and response are observable, because we have real measurements of their values. We hypothesise that there is an underlying linear relationship of the form

Yj=ni=1Xi,jβi+ϵj

where the ϵ represent noise or error terms. We can write this in matrix form as

yT=Xβ+ϵ

Since they are supposed to be noise , let's also assume that the ϵ terms have expectation zero, that they all have the same fine standard deviation σ and that they are uncorrelated.

This is quite a strong set of assumptions, in particular we are assuming that there is a real linear relationship between our predictors and response that is constant over all our samples, and that everything else is just uncorrelated random noise. (We're also going to quietly assume that our estimators are independent, and so X has full rank.)

This can be a little uninuitive, because normally we think about regression as knowing our predictors and using those to estimate our response. Now we imagine that we know a set of values for our predictors and response and using those to estimate the underlying linear relationship between them. More concretely vector β is unobservable, as we cannot measure it directly and when we "do" a regression, we are estimating this value. The ordinary least squares estimate of β is given by

ˆβ=(XTX)1XTy

Usually we don't differentiate between ˆβ and β.

OLS is linear and Unbiased

First of all, what makes an estimator linear? Well, an estimator is linear, if it it is a linear combination y. Or equivalently, it is a matrix multiplication of y, which we can see is true of ˆβ.

Now, what is bias? An estimator is unbiased when the expected value of the estimator is equal to the underlying value. We can see that this is true for our ordinary least squares estimate by using our formula for Y and the linearity of expectation,

E(ˆβ)=E((XTX)1XTY)=E((XTX)1XTXβ)+E((XTX)1XTϵ)

and then remembering that the expectation of the ϵ is zero, which gives

E(ˆβ)=E((XTX)1XTXβ)=E(β)=β

So, now we know that, ˆβ is an unbiased linear estimator of β.

OLS is Best

How do we compare different unbiased linear estimators of β? Well, all unbiased estimators will have the same expectation, so the one with the lowest variance should be best, in some way. It is important conceptually to understand that we are thinking about the variance of an estimator of β, so how far away does it usually get from β.

Now, β is a vector, so there is not a single number representing it's variance. We have to look at the whole covariance matrix, not just a single variance term. We say that the variance of the estimator γ is lower than γ if the matrix

Var(γ)Var(γ)

is positive semidefinite. Or equivalently (by the definition of positive semi definite), if for all vectors c we have

Var(cTγ)<=Var(cTγ)

First, let's derive the covariance matrix for our estimator ˆβ. We have

Var(ˆβ)=Var((XTX)1XTY)=Var((XTX)1XT(Xβ+ϵ))

By the normal properties of variance and because the first terms are constant and add no covariance terms, this is equal to

Var((XTX)1XTϵ)=((XTX)1XT)Var(ϵ)((XTX)1XT)T

which equals,

σ2(XTX)1XTX(XTX)1=σ2(XTX)1

Now, let's compare ˆβ to an arbitrary unbiased linear estimator. That is, suppose we are estimating β with some other linear combination of Y, given by Cy, for some matrix C, with

E(Cy)=β

Now, let's look at the covariance matrix of Cy. We are going to use a trick here, first we define the matrix D as C(XTX)1XT, then we have

Cy=Dy+(XTX)1XTy=Dy+ˆβ

Now, if we take the expectation of this, we have

E(Cy)=E(Dy+ˆβ)=E(Dy)+β

and expanding that expectation, we have

E(Dy)=E(DXβ+ϵ)=DXβ

putting this all together we have,

DXβ+β=E(Cy)=β

So, that gives us, DXβ=0, which might not seem very helpful right now, but let's look at the covariance matrix of Cy.

Var(Cy)=CVar(y)CT=CVar(Xβ+ϵ)CT

now, by our assumptions about \epsilon, and as X and \beta our constants, so they have no variance, this is equal to

σ2CCT=σ2(D+(XTX)1XT)(D+(XTX)1XT)T

distributing the transpose this is

σ2(D+(XTX)1XT)(DT+X(XTX)1)

writing this out in full we have

σ2(DDT+DX(XTX)1+(XTX)1XTDT+(XTX)1XTX(XTX)1)

using our above result for DX, and cancelling out some of the X, we get

σ2DTD+σ2(XTX)1=σ2DTD+Var(ˆβ)

We can rearrange the above as

Var(Cy)Var(ˆβ)=σ2DDT

and a matix of the form DDT is always positive semidefinite. So we have shown that the variance of our arbitrary unbiased linear operator is at least as great as that of ˆβ.

So, the result of all this is that we have a pretty good theoretical justification for using OLS in regression! However, it does not mean that OLS is always the right choice. In some ways, an unbiased estimator is the correct estimator for β, but sometimes there are other things we are considering, and we are actually quite happy with bias! We will see such estimators when we look at feature selection.

Leave a Reply

Your email address will not be published. Required fields are marked *