**Linear regression** helps you simplify a dataset by modelling and drawing a straight line representing this dataset. It is often used to find a relationship between a *continuous* response variable and a *continuous* predictor variable. Examples are numerous: finding the relationship between bodyweight and height is one of them.

Let’s use the following dataset as an example:

```
# variable1
bodyweight <- c(70, 75, 72, 58, 80, 80, 48, 56, 103, 51)
# variable2
size <- c(177, 178, 167, 153, 174, 177, 152, 134, 191, 136)
# dataframe
my.dataframe <- data.frame(bodyweight, size)
```

As usual, everything starts with a plot. A scatter plot of the dataset is usually a good beginning to assess relationship between continuous variables:

```
ggplot(my.dataframe, aes(size, bodyweight)) +
geom_point(color = "blue", size = 2.5)
```

`lm()`

We may try to fit a linear model with the function `lm()`

which we have already encountered when performing analysis of variance (ANOVA).

`lm(bodyweight~size)`

```
##
## Call:
## lm(formula = bodyweight ~ size)
##
## Coefficients:
## (Intercept) size
## -56.2716 0.7661
```

The output gives us everything we need to draw a line in our original plot:

- the intercept is -56.2716,

- the slope is 0.7661.

Using the function `summary()`

, we get much more from the linear model. We get of course the slope and intercept in the column *Estimates* in the table *coefficients*, but we also get the range, interquartile range and median, among other useful values:

`summary(lm(bodyweight~size))`

```
##
## Call:
## lm(formula = bodyweight ~ size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1828 -4.5643 0.4942 3.0471 12.9374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.2716 23.7016 -2.374 0.044953 *
## size 0.7661 0.1437 5.331 0.000702 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.282 on 8 degrees of freedom
## Multiple R-squared: 0.7803, Adjusted R-squared: 0.7529
## F-statistic: 28.42 on 1 and 8 DF, p-value: 0.000702
```

Of interest is the (adjusted) R-squared (R^{2}) at the bottom of the output, which describes how well the model matches the data (NB: be careful when interpreting R-squared, see this blogpost for some info).

Having these values, we may thus add this line “manually” to our previous scatter plot:

```
ggplot(my.dataframe, aes(size, bodyweight)) +
geom_point(color = "blue", size = 2.5) +
geom_abline(slope=0.7661,intercept=-56.2716)
```

Note that the function `geom_smooth()`

in `ggplot2`

can draw the line directly on top of the scatter plot. Make sure that you use the argument `method="lm"`

:

```
ggplot(my.dataframe, aes(size, bodyweight)) +
geom_point(color = "blue", size = 2.5) +
geom_smooth(method="lm")
```

The gray area along the line is the confidence interval. If this is irrelevant, it may be removed with the extra argument `se=FALSE`

:

```
ggplot(my.dataframe, aes(size, bodyweight)) +
geom_point(color = "blue", size = 2.5) +
geom_smooth(method="lm", se=FALSE)
```