Linear regression helps you simplify a dataset by modelling and drawing a straight line representing this dataset. It is often used to find a relationship between a continuous response variable and a continuous predictor variable. Examples are numerous: finding the relationship between bodyweight and height is one of them.
Let’s use the following dataset as an example:
# variable1
bodyweight <- c(70, 75, 72, 58, 80, 80, 48, 56, 103, 51)
# variable2
size <- c(177, 178, 167, 153, 174, 177, 152, 134, 191, 136)
# dataframe
my.dataframe <- data.frame(bodyweight, size)
As usual, everything starts with a plot. A scatter plot of the dataset is usually a good beginning to assess relationship between continuous variables:
ggplot(my.dataframe, aes(size, bodyweight)) +
geom_point(color = "blue", size = 2.5)
lm()
We may try to fit a linear model with the function lm()
which we have already encountered when performing analysis of variance (ANOVA).
lm(bodyweight~size)
##
## Call:
## lm(formula = bodyweight ~ size)
##
## Coefficients:
## (Intercept) size
## -56.2716 0.7661
The output gives us everything we need to draw a line in our original plot:
Using the function summary()
, we get much more from the linear model. We get of course the slope and intercept in the column Estimates in the table coefficients, but we also get the range, interquartile range and median, among other useful values:
summary(lm(bodyweight~size))
##
## Call:
## lm(formula = bodyweight ~ size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1828 -4.5643 0.4942 3.0471 12.9374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.2716 23.7016 -2.374 0.044953 *
## size 0.7661 0.1437 5.331 0.000702 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.282 on 8 degrees of freedom
## Multiple R-squared: 0.7803, Adjusted R-squared: 0.7529
## F-statistic: 28.42 on 1 and 8 DF, p-value: 0.000702
Of interest is the (adjusted) R-squared (R2) at the bottom of the output, which describes how well the model matches the data (NB: be careful when interpreting R-squared, see this blogpost for some info).
Having these values, we may thus add this line “manually” to our previous scatter plot:
ggplot(my.dataframe, aes(size, bodyweight)) +
geom_point(color = "blue", size = 2.5) +
geom_abline(slope=0.7661,intercept=-56.2716)
Note that the function geom_smooth()
in ggplot2
can draw the line directly on top of the scatter plot. Make sure that you use the argument method="lm"
:
ggplot(my.dataframe, aes(size, bodyweight)) +
geom_point(color = "blue", size = 2.5) +
geom_smooth(method="lm")
The gray area along the line is the confidence interval. If this is irrelevant, it may be removed with the extra argument se=FALSE
:
ggplot(my.dataframe, aes(size, bodyweight)) +
geom_point(color = "blue", size = 2.5) +
geom_smooth(method="lm", se=FALSE)