When plotting your dataset, you will often realize (or have the feeling) that there is no way that a simple, straight line can represent the data. You might think that there exists a curved relationship between the response variable and the predictor variable, in which case polynomial regression may help you.
Let’s take an example. We follow the growth of a rat (bodyweight in grams) between the 4th and the 20th week after birth. Here is the code for the dataframe:
# variable1
bodyweight <- c(65,99,123,148,172,194,212,230,248,276,288,296,307,321,325,337,345)
# variable2
week <- 4:20
# dataframe
my.dataframe <- data.frame(week, bodyweight)
Let’s start with a scatter plot of the dataset:
ggplot(my.dataframe, aes(week, bodyweight)) +
geom_point(color = "blue", size = 2.5)
A quick look at this plot makes you realize that growth has not been very linear… Let’s try to fit a linear model to see whether that fits anyway:
ggplot(my.dataframe, aes(week, bodyweight)) +
geom_point(color = "blue", size = 2.5) +
geom_smooth(method="lm", se=FALSE)
The linear model does not appear to be a great fit… The regression line underpredicts the data in the central part of the range while it overpredicts the data at both the start and the end of the range. You might thus be interested in running a polynomial regression to find a better curve to fit the data.
lm()
and poly()
The way to go is to use lm()
and introduce the function poly(predictor, n)
to set the order of the polynomial. Here predictor
is the predictor variable, and n
is the order of the polynomial. Assuming that we wish to get a second order polynomial, we can run the following code:
lm(bodyweight ~ poly(week,2))
##
## Call:
## lm(formula = bodyweight ~ poly(week, 2))
##
## Coefficients:
## (Intercept) poly(week, 2)1 poly(week, 2)2
## 234.47 347.89 -63.91
To draw the corresponding curve on top of the scatter plot, we use stat_smooth(method="lm", formula = y ~ poly(x, 2, raw = TRUE))
. We could replace x
and y
with week
and bodyweight
, respectively. However x
and y
have already been associated with the variables in the function ggplot(aes(x, y))
; thus there is no need to repeat.
ggplot(my.dataframe, aes(week, bodyweight)) +
geom_point(color = "blue", size = 2.5) +
stat_smooth(method="lm", formula= y ~ poly(x, 2, raw=TRUE), colour="red")
The new curve (in red) fits much better the data.