draw-the-line.utf8

A question that often comes when working with polynomial regression and fitting a model is “when do I stop adding degrees to the polynomial?“

Generally, it is best to keep things simple! Adding extra degrees to the polynomial regression should always be done when it makes sense, both in terms of fitting to the data and in terms of biology. The model that you are building must be meaningful!

The following method is a “try and see” procedure: we start with a linear regression and then add one extra degree to the polynomial to see whether it helps the curve to fit the data. This is assessed visually by plotting and comparing the model(s) to the data. Then, every time we add a degree, we must verify the significance of the added degree(s) by running an F-test between the models.

Let’s reuse the example from polynomial regression:

# variable1
bodyweight <- c(65,99,123,148,172,194,212,230,248,276,288,296,307,321,325,337,345)

# variable2
week <- 4:20

# dataframe
my.dataframe <- data.frame(week, bodyweight)

null model

We can start with creating a null model (where the equation for the line is restricted to the average of the dataset, in other words a constant which is independent of the predictor variable)):

null.model.lm <- lm(bodyweight~1)
null.model.lm

## 
## Call:
## lm(formula = bodyweight ~ 1)
## 
## Coefficients:
## (Intercept)  
##       234.5

The corresponding output gives us the intercept for the horizontal line represented here:

ggplot(my.dataframe, aes(x = week, y = bodyweight)) + 
  geom_point(color = "blue", size = 2.5) +                                 # scatter plot
  geom_smooth(method='lm', formula= y~1, se=FALSE, col="green", lwd=1)     # null model

Linear regression

Now we step up to the linear regression. The following code fits the linear model:

one.degree.lm <- lm(bodyweight~week)
one.degree.lm

## 
## Call:
## lm(formula = bodyweight ~ week)
## 
## Coefficients:
## (Intercept)         week  
##       27.79        17.22

The corresponding line defined by the intercept and the slope may be added to the previous figure:

ggplot(my.dataframe, aes(week, bodyweight)) + 
  geom_point(color = "blue", size = 2.5) +                                    # scatter plot
  geom_smooth(method='lm', formula= y~1, se=FALSE, col="green", lwd=1) +      # null model   
  geom_smooth(method='lm', formula= y~x, se=FALSE, col="red", lwd=1.5)        # linear model

Comparing these two models with an F-test

To compare this linear model to the first one (null model), we can run an F-test by the mean of the function anova(). The syntax anova(model1, model, test="F") will do:

anova(null.model.lm, one.degree.lm, test='F')

## Analysis of Variance Table
## 
## Model 1: bodyweight ~ 1
## Model 2: bodyweight ~ week
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     16 125272                                  
## 2     15   4246  1    121026 427.56 1.946e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output of the test tells us whether the two models are similar (NB: which matches the null hypothesis H0 at the basis of the F-test) or whether there is a significant difference between them. In the case where the two models are similar, you understand that the implementation of an extra degree has had no significant meaning; you may thus step back to the previous model and possibly accept this one as the best fit so far in your investigation. In our present example, the F-test rejects the hypothese H0 (the p-value is indeed very low).

Stepping up to a polynomial regression

Now we step up once more. Let’s go for polynomial regression:

two.degrees.lm <- lm(bodyweight~poly(week,2))
two.degrees.lm

## 
## Call:
## lm(formula = bodyweight ~ poly(week, 2))
## 
## Coefficients:
##    (Intercept)  poly(week, 2)1  poly(week, 2)2  
##         234.47          347.89          -63.91

And let’s add the curve to the figure:

ggplot(my.dataframe, aes(week, bodyweight)) + 
  geom_point(color = "blue", size = 2.5) +                                           # scatter plot
  geom_smooth(method='lm', formula= y~1, se=FALSE, col="green", lwd=1) +             # null model
  geom_smooth(method='lm', formula= y~x, se=FALSE, col="red", lwd=1) +               # linear model
  geom_smooth(method='lm', formula= y~poly(x,2), se=FALSE, col="orange", lwd=1.5)    # 2nd order polynomial

The orange curve fits much better the data. Let’s see what the F-test between the linear regression and the polynomial regression tells us.

Comparing the last two models with an F-test

anova(one.degree.lm, two.degrees.lm, test='F')

## Analysis of Variance Table
## 
## Model 1: bodyweight ~ week
## Model 2: bodyweight ~ poly(week, 2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     15 4245.9                                  
## 2     14  161.4  1    4084.5 354.23 2.446e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, the test shows significant difference between the two models; combined to the facts that there is visually a better fit of the curve and a lower value for RSS (Residual Sum of Squares), we can think that this polynomial regression model is definitely acceptable. But what about a third order?

Stepping up to a 3rd order polynomial

Now we step up once more.

three.degrees.lm <- lm(bodyweight~poly(week,3))
three.degrees.lm

## 
## Call:
## lm(formula = bodyweight ~ poly(week, 3))
## 
## Coefficients:
##    (Intercept)  poly(week, 3)1  poly(week, 3)2  poly(week, 3)3  
##       234.4706        347.8883        -63.9102          0.3694

And let’s add the curve to the figure:

ggplot(my.dataframe, aes(week, bodyweight)) + 
  geom_point(color = "blue", size = 2.5) +                                             # scatter plot
  geom_smooth(method='lm', formula= y~1, se=FALSE, col="green", lwd=1) +               # null model
  geom_smooth(method='lm', formula= y~x, se=FALSE, col="red", lwd=1) +                 # linear model
  geom_smooth(method='lm', formula= y~poly(x,2), se=FALSE, col="orange", lwd=1) +      # 2nd order polynomial 
  geom_smooth(method='lm', formula= y~poly(x,3), se=FALSE, col="yellow", lwd=1.5)      # 3rd order polynomial

This time, the yellow line (3rd order polynomial) overlaps nearly perfectly with the orange one (2nd order polynomial) that we drew earlier. This shows an obvious, strong similarity between the last two models, bringing quite some doubts about the usefulness of the addition of a third degree to the polynomial.

Comparing the last two models with an F-test

Let’s see whether the F-test concludes in the same direction:

anova(two.degrees.lm, three.degrees.lm, test='F')

## Analysis of Variance Table
## 
## Model 1: bodyweight ~ poly(week, 2)
## Model 2: bodyweight ~ poly(week, 3)
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1     14 161.43                          
## 2     13 161.29  1   0.13648 0.011 0.9181

… and indeed the RSS for the two models are nearly equal, the F value is close to zero and the p-value close to 1. We must thus accept the null hypothesis H0 that the two models are equal and that the third order of the polynomial must be abandoned.

Conclusion: the best model that we tested here is the second order polynomial regression. There is no need to use a third order polynomial regression.