R squared

This quantity is often misunderstood and/or misrepresented. Let’s assume we have some random variable y with mean \bar{y} and some estimator for it \hat{y}. The commonly accepted fundamental definition of R^2 is:

R^2=1 - \frac{\sum(\hat{y_i} - y_i)^2}{\sum(\bar{y_i} - y_i)^2} = 1 - \frac{SSE}{SST}

Sometimes people call this the “fraction of explained variance”. But that’s only the right way to look at it under special circumstances. All the equation above shows is that R^2 “compares” the model \hat{y} to the model \bar{y}. It’s a little like comparing a random walk model to something someone clever cooked up hoping to beat it. Note immediately that there’s nothing stopping R^2 from being negative, and so the square in notation is unfortunate in that respect. If you choose a bad enough model, say \hat{y} = \bar{y} - \alpha for any \alpha \in \mathbb{R} with \alpha \neq 0 , well then R^2 will be negative*.

But the comparison made here is not simply comparing the errors of the two models. Instead it compares two variances (which harkens back to Principal Component Analysis, and I will update this blog with that discussion) and then subtracts that ratio from one. If the model for \hat{y} has the property that the errors are zero on average, then the numerator is proportional to the variance of the errors. The denominator is proportional to the variance of y. So, if errors are denoted by \epsilon, that ratio is just \frac{V(\epsilon)}{V(y)}. So whenever the variance of errors are greater than the variance of y, R^2 ends up in negative territory.

In terms of this fraction of explained variance interpretation, that turns out to only be the case when Cov(y, \hat{y}) \geq \frac{1}{2}V(\hat{y}), which follows from simple relation below:

V(\epsilon) = V(y) + V(\hat{y}) - 2Cov(y, \hat{y})

Although that all seems obvious, given the simple equations above, it’s actually *not* always done in mainstream practice. For example, the statsmodels package in Python gets this mixed up. Try running sm.OLS(Y, X, const=False) and checking out the excellent R^2. Better yet, check out statsmodels VIF calc, which makes the same mistake. I will add more decision around that to this blog.

* To see that any constant guess other than the mean itself will result in a negative R squared, consider one such guess as \hat{y} = \bar{y} + \alpha. Then we have,

SSE = \sum (\bar{y} - y_i + \alpha)^2 = \sum \bar{y}^2 + y_i^2 + \alpha^2 + 2 \bar{y}\alpha - 2 y_i \alpha - 2 \bar{y} y_i

=\sum (\bar{y} - y_i)^2 + \alpha^2 - 2\alpha(y_i - \bar{y})

=SST + \epsilon

and it remains to show that \epsilon is greater than zero:

\epsilon = \sum \alpha^2 - 2\alpha(y_i - \bar{y}) = n\alpha^2 - 2\alpha \sum{y_i} + 2n\alpha \bar{y} = n\alpha^2

And therefore R^2 = 1 - SSE/SST = - \epsilon / SST < 0

Some useful discussions below

https://www.mathworks.com/help/stats/coefficient-of-determination-r-squared.html

Click to access Sec10.pdf

https://en.wikipedia.org/wiki/Coefficient_of_determination

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s