linear-regression.utf8

Linear regression helps you simplify a dataset by modelling and drawing a straight line representing this dataset. It is often used to find a relationship between a continuous response variable and a continuous predictor variable. Examples are numerous: finding the relationship between bodyweight and height is one of them.

Let’s use the following dataset as an example:

# variable1
bodyweight <- c(70, 75, 72, 58, 80, 80, 48, 56, 103, 51)

# variable2
size <- c(177, 178, 167, 153, 174, 177, 152, 134, 191, 136)

# dataframe
my.dataframe <- data.frame(bodyweight, size)

As usual, everything starts with a plot. A scatter plot of the dataset is usually a good beginning to assess relationship between continuous variables:

ggplot(my.dataframe, aes(size, bodyweight)) + 
  geom_point(color = "blue", size = 2.5)

Fitting a linear model with `lm()`

We may try to fit a linear model with the function lm() which we have already encountered when performing analysis of variance (ANOVA).

lm(bodyweight~size)

## 
## Call:
## lm(formula = bodyweight ~ size)
## 
## Coefficients:
## (Intercept)         size  
##    -56.2716       0.7661

The output gives us everything we need to draw a line in our original plot:

the intercept is -56.2716,
the slope is 0.7661.

Using the function summary(), we get much more from the linear model. We get of course the slope and intercept in the column Estimates in the table coefficients, but we also get the range, interquartile range and median, among other useful values:

summary(lm(bodyweight~size))

## 
## Call:
## lm(formula = bodyweight ~ size)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1828  -4.5643   0.4942   3.0471  12.9374 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.2716    23.7016  -2.374 0.044953 *  
## size          0.7661     0.1437   5.331 0.000702 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.282 on 8 degrees of freedom
## Multiple R-squared:  0.7803, Adjusted R-squared:  0.7529 
## F-statistic: 28.42 on 1 and 8 DF,  p-value: 0.000702

Of interest is the (adjusted) R-squared (R²) at the bottom of the output, which describes how well the model matches the data (NB: be careful when interpreting R-squared, see this blogpost for some info).

Plotting the line

Having these values, we may thus add this line “manually” to our previous scatter plot:

ggplot(my.dataframe, aes(size, bodyweight)) + 
  geom_point(color = "blue", size = 2.5) +
  geom_abline(slope=0.7661,intercept=-56.2716)

Note that the function geom_smooth() in ggplot2 can draw the line directly on top of the scatter plot. Make sure that you use the argument method="lm":

ggplot(my.dataframe, aes(size, bodyweight)) + 
  geom_point(color = "blue", size = 2.5) +
  geom_smooth(method="lm")

The gray area along the line is the confidence interval. If this is irrelevant, it may be removed with the extra argument se=FALSE:

ggplot(my.dataframe, aes(size, bodyweight)) + 
  geom_point(color = "blue", size = 2.5) +
  geom_smooth(method="lm", se=FALSE)

Fitting a linear model with lm()

Plotting the line

Fitting a linear model with `lm()`