What does the R-Squared value of a regression refer to?
The R-squared value of a linear regression is the percentage of variation in your response variable (y) explained by your model.
Let us take a dataset with an explanatory variable X and a response variable Y. Now, let us fit a linear regression, of the form :
##Y = aX+b##
We obtain what we see in figure 1.
We obtain a R-squared value of 0.667, which means that 66.7% of the variation in Y is explained by X. The more the R-squared is important, the more your model fits your data and the closer are the observed to the fitted values.
As a consequence, a relatively high R-squared is interesting when you want to predict data. What is a high R-squared ? It depends on the fields. For example, in ecology, it is rare to have R-squared above 50%.
However, whenever you fit a model, be cautious to respect the assumptions of this model. In fact, you cannot trust the r-squared if the assumptions are not respected.
Here is an example of a linear regression conducted on different datasets. All the regressions give the same R-squared (0.667) :
This means that you have to question yourself about your data before doing a linear regression. Is it the best regression to adopt ? If the answer is yes, as in figure 1, you can raisonnably interpret the R-squared. However, if it is not case (figure 2 for example), your R-squared will be unreliable.
Reference :
F.J Anscombe, 1973. Graphs in Statistical Analysis, The American Statistician , vol. 27, No. 1, p.17-21.