Statistics in R
Basic Statistics

Before we start

You should be familiar with basic statistical theory, basics of R, continuous and categorical data, hypothesis testing, statistical modelling, and linear models.

Introduction

In this section, we will look at how we can use simple linear regression to analyse data with a continuous numeric response and a numeric explanatory variable. Linear regression is a type of linear model.


Linear regression has two motivations. The first is called inference. This is when you want to say something about a population from a sample. The second is prediction, where we use models to predict values of the response for specified values of the explanatory variable. These predictions can either be for observed values of the explanatory (mainly for plotting), for unobserved values of the explanatory variable within the same range as observations, or for novel values of the explanatory variable outside the range of observations (this is more risky! - more on this later).


Example inference question:

does the height of plants increase with increasing temperatures?

Example prediction question:

how tall will a plant be if mean temperatures increase by 2°C?


Figure 1: Cartoon of plant height and temperature. Created by Emily G. Simmonds


In simple terms, we fit a straight line to:

  1. estimate a relationship between \(X\) and \(Y\)

  2. predict change in \(Y\) from change in \(X\).


Linear regression assumes a causal relationship between \(X\) and \(Y\), i.e. it assumes that \(X\) does something to \(Y\). However, we can never actually test this, we can only quantify the patterns. To test if \(X\) really does have a causal affect on \(Y\), you would need experiments. It is our job when analysing data, to decide whether we can assume causality or not. For some examples of when causality is not a sensible assumption click here.


Figure 2: Scatterplot of number of maths PhDs awarded in a year against the amount of uranium stored in US power plants annually. Smooth line indicates a regression line for this data with the 95% confidence interval as the shaded area.


Which questions?


Example questions you can answer with linear regression:

  • How does temperature change with time?
  • How does the amount of crime change with temperature?
  • How does relative plant biomass change with light intensity?
  • What will the temperature be in 2100?
  • How tall will someone be if they have hair 30cm long?


Type of data


Theory


A linear regression is used when you have continuous numeric response variable and a continuous numeric explanatory variables. A simple linear regression has only one explanatory variable, a multiple linear regression has more than one. There is no upper limit in theory, but if you add too many, your model will be no less complex than reality and will be overfitted.


Examples of continuous numeric variables:

  • Temperature
  • Rainfall
  • Distance
  • Height
  • Weight


Always remember to check that the way your variables are classified in R is the same as the format you expect. It is a common mistake for a variable that should be numeric to be classified as a Factor, or a Factor as a character string etc.


You can check the way R has classified your variables using the Environment panel, printing the object by typing its name in the R console, or using the str() function.


If the classification R has given the variables doesn’t match what you expect, you can change the classification using functions such as: as.numeric(), as.factor(), or as.integer(). These as. functions exist for every data type.


In the Example Data below, you can see that x and y are both numeric variables. On the other hand, z is a Factor.

ExampleData
## # A tibble: 100 x 3
##         x     y z    
##     <dbl> <dbl> <fct>
##  1   4.77  1    1    
##  2   5.02  1.49 2    
##  3  -7.98  1.99 3    
##  4  -7.3   2.48 4    
##  5 -23.0   2.98 5    
##  6  13.2   3.47 6    
##  7  16.4   3.97 7    
##  8   5.71  4.46 8    
##  9  26.6   4.96 9    
## 10  11.2   5.45 10   
## # ... with 90 more rows


Worked example


For this worked example we will be using some data on dive depths of marine mammals.

Figure 1: Illustration of marine mammals by Emily G. Simmonds.


Introduction to the data


These data have four variables; species, maximum recorded dive depth in metres, body weight in kilograms, and taxonomic group.

The data were collected from a variety of sources and cover several taxonomic groups (including polar bear, sea otters, baleen whales, toothed whales, seals, sea lions).


Full list of data sources

Sources:


You can find the data here if you want to follow along with this example. It is a .csv file with column headings.

The first thing we want to do with the data once it is imported is to look at it. We will do this by looking at the raw data and making a plot.


DiveData
## # A tibble: 30 x 4
##    Species                 MaxDepth BodySizeKG Group   
##    <fct>                      <int>      <dbl> <fct>   
##  1 Orca                         264       4500 Twhale  
##  2 WeddellSeal                  741        500 Pinniped
##  3 LongFinnedPilot              828       1300 Twhale  
##  4 ShortFinnedPilot            1019       2000 Twhale  
##  5 BlainvillesBeakedWhale      1408        900 Twhale  
##  6 NorthernBottlenoseWhale     1453       8000 Twhale  
##  7 SouthernElephantSeal        1653       3000 Pinniped
##  8 BairdsBeakedWhale           1777      12000 Twhale  
##  9 SpermWhale                  2035      27000 Twhale  
## 10 CuviersBeakedWhale          2992       2500 Twhale  
## # ... with 20 more rows
ggplot(DiveData, aes(x=BodySizeKG, y=MaxDepth, color=Group))+
  geom_point()+
  theme_minimal()

Figure 2: Scatterplot of body size against maximum dive depth for marine mammals. Colours indicate taxonomic group.


This gives some idea of what our data look like. We can see that two of our variables are factors (Species and Group). One of our variables is clearly continuous numeric (MaxDepth) and one is integer (BodySizeKG). From our knowledge of the characteristics of body size, we know that this actually is continuous even though the scientists who measured it have rounded to make it integer. Therefore, we will change this to be numeric to ensure R treats it the right way in our analyses.

DiveData <- mutate(DiveData, BodySizeKG = as.numeric(BodySizeKG))

In this example, we are interested in whether body size (kg) influences maximum dive depth (m) in marine mammals. To answer this question we will need the variables: BodySizeKG and MaxDepth. We will not need Species or Group for now, but we might need them later.


Now we have familiarised ourselves with our data, know what question we want to ask, and which variables we need to use, we are ready to start an analysis.


Model details


Theory


When we create a model we want to represent mathematically how the data were generated.

When we use a regression model (this is also true for linear models) make an assumption that there is a linear relationship between our two variables. Mathematically, we say that we can capture the data generation process with a straight line and some error.

The line is defined by two parameters: \(\alpha\) = the intercept, where the line crosses the y-axis and \(\beta\) = the slope of the line (steepness/gradient), it is how much \(Y\) changes for every increase in 1 unit of \(X\). We can alter the position of the line using these two parameters. The final part of the model is \(\epsilon\), which is the error around the line, we estimate this using a parameter \(\sigma^{2}\) that is the variance of the error.


We can write these model components as and equation in terms of \(Y\):

\[ Y_i = \color{orange}\alpha + \color{blue}\beta X_i + \color{red}\epsilon_i \]

Assumptions


There are several assumptions that we make when using a linear regression model:

  • The relationship between X and Y is linear
  • Residuals (error) are normally distributed
  • The residuals have a mean of 0
  • The variance of the residuals is equal for all fitted values (homoscedasticity)
  • There are no outliers
  • Each value of Y is independent


All of these assumptions should be met for the model to work properly. We check five of them after we have fit the model. The independence of Y should be determined during data collection. Data should be collected in a way that ensures each Y is independent of the others.


Writing the model in


To fit the simple linear regression in we will use the lm() function.

lm() stands for linear model (should seem familiar) and it takes several arguments:

  • formula in form: y~x
  • data: your data object


The function will fit the regression model using maximum likelihood estimation and give us the maximum likelihood estimates of \(\alpha\) and \(\beta\) as an output. It does also estimate \(\sigma^{2}\) of the error, but it does not report this.


To use the lm() function you first need to think about the formula argument, the y~x part. The same way as in the equation above, the letter \(Y\) always corresponds to the response variable (the thing we are trying to explain) and \(X\) to an explanatory variable (the thing we assume affects the response).


Therefore, to write the formula for the lm() function, you need to know which variable is your response and which is your explanatory. You can work this out by thinking about what question you want to ask with your model e.g.


Does temperature influence wing length of butterflies?

Figure 3: Cartoon illustration of the Colias hecla butterfly by Emily G Simmonds.

InsectDataF <- InsectData %>% filter(sex == "f") %>% group_by(year) %>%
                              summarise(AvWingLength = mean(winglength), 
                                        Temp = mean(temp)) %>%
                              ungroup()

ggplot(InsectDataF, aes(x = Temp, y = AvWingLength))+
  geom_point()+
  ylab("Mean wing length in mm")+
  xlab("Temperature for May/June (ºC)")+
  theme_minimal()

Figure 4: Scatterplot of the butterfly wing length data.


For this example we use some data on Colias hecla, the northern clouded yellow butterfly. These data were taken from the paper: High-Arctic butterflies become smaller with rising temperatures by Joseph J. Bowden, Anne Eskildsen, Rikke R. Hansen, Kent Olsen, Carolyn M. Kurle and Toke T. Høye Biology Letters Volume 11, Issue 10 Published:01 October 2015 [doi:10.1098/rspb.2013.0174] (https://doi.org/10.1098/rsbl.2015.0574). The data are freely available here. The researchers measured the wing length of the butterflies to the nearest 0.01 mm. They measured many individuals over many years. In this example we will use data from only the females and take the mean wing length for each year. The measure of temperature is the average temperature from May and June in ºC.


Then you can see your explanatory (\(Y\)) = temperature, it is the variable that does the influencing.The response (\(X\)) = wing length, it is the result.

We can then plug these variables into the lm() function in the below format using the column names in place of y and x and including our data frame name as the data argument.


InsectModel <- lm(AvWingLength ~ Temp, data = InsectDataF)

I saw an lm() written differently, what’s that about?

You can use the lm() function without the data argument. If you do this, you need to refer to your (\(X\)) and (\(Y\)) variables in the y~x formula using a $ between the data name and the column name.

We do not recommend using this approach. There are several reasons for this but a key one is that when using the $ syntax, R sees the variable name as the whole entry DiveData$MaxDepth rather than as the column name MaxDepth. This makes it difficult to use this model for other things e.g. to predict.

InsectModel_alt <- lm(InsectDataF$AvWingLength ~ InsectDataF$Temp)


You can then look at the results using the function coef(). This takes the output of lm(), the model object, as its argument and extracts the maximum likelihood estimates of \(\alpha\) and \(\beta\).

coef(InsectModel)
## (Intercept)        Temp 
##  22.9337432  -0.1053717


Worked example


This worked example demonstrates how to fit a linear regression model in using the lm() function for the dive depths example.


In this example we are asking:

Does body size influence maximum dive depth in marine mammals?


Our question is formulated to suggest a direction of causality, we assume body size has a causal effect on maximum dive depth, therefore maximum depth is our response (\(Y\)) and body size as our explanatory variable (\(X\)).


We can put these variables into the lm() function in the below format.

DiveModel <- lm(MaxDepth ~ BodySizeKG, data = DiveData)


Great. We have run a model and assigned it to an object name.

We can look at the maximum likelihood estimates of our model parameters (\(\alpha\) and \(\beta\)) using the function coef().

coef(DiveModel)
##   (Intercept)    BodySizeKG 
## 755.977336598  -0.005384027


We will look at interpreting these in the next part of the worked example.

Parameters


Theory


We introduced the three model parameters of a simple linear regression in the section above: \(\alpha\) = the intercept, \(\beta\) = the slope of the line (steepness/gradient), and \(\sigma^{2}\) the variance of the error.


But what do these parameters really mean?


All regression analyses are fundamentally about using straight lines to represent the relationship between a response (\(Y\)) and some explanatory variables (\(X\)), called a regression line. The parameters of the model determine the placement and gradient of the straight line, as well as representing the distribution of data points around the line.


Figure 5: Illustration of a straight line and data.

We will illustrate the meaning of parameters continuing with the insect example from the previous section.

In this section we will go through each parameter we estimate in a simple linear regression and what it means in terms of the relationship between \(X\) and \(Y\).

\(\alpha\), the intercept

This first parameter gives the value of \(Y\) when \(X\) = 0, it is the point that the straight line crossed the y-axis.

In the figure below, the intercept is highlighted.

ggplot(InsectDataF, aes(x=Temp, y=AvWingLength))+
  geom_point(colour = 'grey70')+
  geom_abline(intercept = coef(InsectModel)[1], slope = coef(InsectModel)[2])+
  geom_point(aes(x=0, y=coef(InsectModel)[1]), colour = 'orange', size = 3)+
  ylab("Mean wing length in mm")+
  xlab("Temperature for May/June (ºC)")+
  ylim(21,24)+
  theme_minimal()

Figure 6: Plot of regression line and data.


If you want to find the maximum likelihood estimate of \(\alpha\) from your lm() object you can use the coef() function as we did in the previous section. The \(\alpha\) value is the one labelled (Intercept), 22.93 in the example below.

coef(InsectModel)
## (Intercept)        Temp 
##  22.9337432  -0.1053717


\(\beta\), the slope

This second parameter gives the amount of change in \(Y\) for every change of 1 in \(X\), it is the gradient of the regression line.

In the figure below, the intercept is highlighted.

ggplot(InsectDataF, aes(x=Temp, y=AvWingLength))+
  geom_point(colour = 'grey70')+
  geom_abline(intercept = coef(InsectModel)[1], slope = coef(InsectModel)[2],
              colour = 'blue', size = 2)+
  ylab("Mean wing length in mm")+
  xlab("Temperature for May/June (ºC)")+
  ylim(21,24)+
  theme_minimal()

Figure 7: Plot of regression line and data.


If you want to find the maximum likelihood estimate of \(\beta\) from your lm() object you can use the coef() function as we did in the previous section. The \(\beta\) value is the one labelled by the variable name (x here), -0.11 in the example below.


coef(InsectModel)
## (Intercept)        Temp 
##  22.9337432  -0.1053717


Together \(\alpha\) and \(\beta\) control the position of the regression line. They are called the systematic part of the model, the bit that links \(Y\) to the covariate \(X\). You will notice that we use them to plot the line on our graph using geom_abline(). Changing the intercept shifts how high up on the y-axis the line sits and changing the slope alters the steepness (gradient). But it is important to note that these two parameters are not independent.


The regression line must always pass through the point that represents the mean of \(X\) and the mean of \(Y\). (\(\bar{X}\), \(\bar{Y}\)). Therefore, if you change the intercept, the slope must change as well to keep the line going through the (\(\bar{X}\), \(\bar{Y}\)) point. You can have a go at doing this below.



Click here to open the app in full scale in a separate tab.


\(\sigma^{2}\), the variance of error

This is the final parameter we need to estimate for the simple regression and it is a bit different from \(\alpha\) and \(\beta\). This parameter does not relate directly to the shape or position of the regression line, instead, this parameter captures the variance of the data points around that line. This can be called the random part of the model, the error. In the case of a simple linear regression, the error is a normally distributed.


As you can see in the figures above, although we have plotted our fitted line, it is not capturing all of the variation in data. We can see that not all of the data points are sitting on the regression line.


The value of \(Y\) for each \(X\) at the regression line is called a fitted value. The distance between the fitted values and the values of \(Y\) that were actually observed are called residuals. You can extract the residuals from your model object using the function residuals().


InsectPredictions <- predict(InsectModel, type="response")
InsectResiduals <- resid(InsectModel)
highlight <- tibble(x=c(-0.509, -0.509), prediction = c(22.98737, 23.69016))


residuals <- residuals(InsectModel)

ggplot(InsectDataF, aes(x=Temp, y=AvWingLength))+
  geom_point(colour = 'grey70')+
  geom_line(aes(y=InsectPredictions))+
  geom_segment(aes(xend=Temp, yend=InsectPredictions), colour = 'red')+
  geom_point(aes(x=-0.509, y= 22.98737), colour = 'purple')+
  geom_point(aes(x=-0.509, y= 23.69016), colour = 'purple')+
  geom_line(data=highlight, aes(x=x, y=prediction), colour='purple')+
  ylab("Mean wing length in mm")+
  xlab("Temperature for May/June (ºC)")+
  ylim(21,24)+
  theme_minimal()

Figure 8: Plot of regression line with residuals highlighted.


The residual is the difference between the observed point and the predicted/fitted point. Therefore, when a data point is below the regression line, the residual will be negative. Above the line the residuals will be positive. E.g. for the fitted value for \(X\) = -0.5 is 22.9, the residual = 0.70 (highlighted in purple on the graph above).


Interpreting the parameters

Now we know what each of the three parameters in a simple linear regression mean, we can now think about interpreting them.

Which of the three parameters do you think is most important for answering our research question “Does temperature influence body size of insects?”?


I had a go, now show me the answer.

The slope (\(\beta\)). This is because the gradient or slope of the regression line is what tells us how strong and in what direction the relationship between \(X\) and \(Y\) is. While we need to estimate the intercept of the line and the residual variance, these do not directly tell us about the relationship of interest.

To predict, we need all of the parameters.


Worked example


In the previous section of this worked example, we fit a simple linear regression using the lm() function and looked at the estimates of some parameters using the coef() function. In this section, we will use model theory to interpret what those parameters mean.


The intercept and slope

We already know that the parameters of the intercept and slope control the position and gradient of the regression line. It is the estimates of these two parameters that we get from the coef() function.


For our dive depths model the estimates are:

coef(DiveModel)
##   (Intercept)    BodySizeKG 
## 755.977336598  -0.005384027


The intercept is 756 m and the slope of the relationship between body size and dive depth is -0.005.


In this case, the intercept is not that interesting. It tells us the expected value of \(Y\) (maximum dive depth) when \(X\) (body size) = 0. It does not make a lot of biological sense to know the expected dive depth of a marine mammal that weighs 0 kg.


The slope on the other hand is interesting. It tells us the direction and strength of the relationship between body size and maximum dive depth. In this case our model estimates that for every increase of 1 kg in body size, marine mammals will have a maximum dive depth that is 0.005 m less deep. In other words, as the slope is negative, we can see this is a negative relationship and as \(X\) increases \(Y\) decreases.


Residual variance

The coef() function can give us the maximum likelihood estimates of the intercept and slope parameters, but it hasn’t given any information on the final parameter \(\sigma^{2}\). To get an estimator of this parameter, we need to use the residuals() function to extract the residuals from our model and the var() function to calculate the variance of our model residuals.


DiveResiduals <- residuals(DiveModel)
sigma2 <- var(DiveResiduals)
sigma2
## [1] 508212.3


For this model the \(\sigma^{2}\) = 508212. This number is not very interesting to us in terms of answering whether body size influences dive depth. But we will use it for prediction later.


Plotting the results

As well as looking at the maximum likelihood estimates of the parameters from the simple linear regression, we can also plot the results.


To do this, we will use ggplot() with geom_line(). We will also need to use a new function called predict().


To make this first plot, we only need to use two arguments:

  • object = your model object
  • type = “response”, which means predict on the response scale


DepthPredictions <- predict(DiveModel, type="response")


Once we have created predictions of \(Y\) from the model object, we can then plot these using geom_line() as in the code below.

ggplot(DiveData, aes(x=BodySizeKG, y=MaxDepth))+
  geom_point(colour = 'grey70')+
  geom_line(aes(y=DepthPredictions))+
  ylab("Maximum Dive Depth (m)")+
  xlab("Body Size (kg)")+
  theme_minimal()

Figure 4: Scatter plot of dive depths against body size. Line is the regression line from a linear model.


In the next section we will look at how to add uncertainty to these plots and our interpretation.


Quantify uncertainty


Theory


You should already know that statistics does not give a single correct answer. When we estimate the values of parameters in our statistical model, there are many different values that could plausibly have produced our observed data. Some of these are more likely than others but others will have very similar likelihoods.


A simple linear regression is no different. We still need to consider and present the uncertainty in the parameters we estimate.


The lm() function uses maximum likelihood estimation for parameter estimation. Therefore, our consideration of uncertainty for these models follows the same principles as discussed here. We will quantify uncertainty using standard errors, confidence intervals, and prediction intervals which should be familiar to you but head to the uncertainty pages if you need a recap.


For any regression there are two different types of uncertainty we will look at uncertainty in our line parameters \(\alpha\) and \(\beta\) and uncertainty in a prediction of \(Y\).


Uncertainty in the estimates of \(\alpha\) and \(\beta\)

Standard error

The standard error of a parameter is the standard deviation of its sampling distribution. It gives a measure of the spread of the sampling distribution i.e. our uncertainty. To find the standard errors for the estimates of \(\alpha\) and \(\beta\) we can use the summary() function. The argument that summary() takes is a model object, the output from lm(). This function gives a big table with lots of information. The first line shows the model formula used for the model object. The second line shows a summary of the residuals of the model and the standard errors are shown as the second column in the third part, Coefficients:.

summary(InsectModel)
## 
## Call:
## lm(formula = AvWingLength ~ Temp, data = InsectDataF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54745 -0.19618 -0.09290  0.07555  0.70279 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.9337     0.1396 164.227   <2e-16 ***
## Temp         -0.1054     0.0945  -1.115    0.282    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3417 on 15 degrees of freedom
## Multiple R-squared:  0.07655,    Adjusted R-squared:  0.01498 
## F-statistic: 1.243 on 1 and 15 DF,  p-value: 0.2824


If we take the summary() of the InsectModel, we can see the standard error or the intercept (\(\alpha\)) is 0.1396, and the standard error for the slope (\(\beta\)) is 0.0945.

For both parameters, the standard error is smaller than the estimated effect. But, how small is small enough?


Confidence intervals

For interpretation of the uncertainty, it can be easier to use the standard error to calculate confidence intervals. Confidence intervals indicate the range of plausible values for a parameter. They represent an interval, that if you were to collect a sample and run the analysis, then repeat that many many times AND each time draw a confidence interval, on average 95% of the time, the true population value of the parameter would be found in within the confidence interval.


To calculate a confidence interval from a standard error you need to use the formulas:

\[ \begin{aligned} UpperCI = estimate + (1.96 SE) \\ LowerCI = estimate - (1.96 SE) \\ \end{aligned} \] 1.96 is used because in a standard normal distribution 95% of the distribution lies within 1.96 standard deviations of the mean. In this case the distribution is the sampling distribution, which is normal for \(\alpha\) and \(\beta\) and the standard deviation is the standard error.


Using the formulas above, we can get confidence intervals for the intercept and slope from the InsectModel.


I had a go at calculating, what is the correct answer?

# INTERCEPT

# Upper confidence interval
UpperCI_intercept <- 22.9337 + (1.96*0.1396)
# Lower confidence interval
LowerCI_intercept <- 22.9337 - (1.96*0.1396)

# Print the interval
c(UpperCI_intercept, LowerCI_intercept) 
## [1] 23.20732 22.66008
# SLOPE

# Upper confidence interval
UpperCI_slope <- -0.1054 + (1.96*0.0945)
# Lower confidence interval
LowerCI_slope <- -0.1054 - (1.96*0.0945)

# Print the interval
c(UpperCI_slope, LowerCI_slope) 
## [1]  0.07982 -0.29062


It is also possible to get R to calculate the confidence intervals for you. To do this you can use the confint() function. The argument is a model object.

confint(InsectModel)
##                 2.5 %      97.5 %
## (Intercept) 22.636094 23.23139219
## Temp        -0.306787  0.09604363


Hopefully these confidence intervals look the same as those you calculated yourself.


Plotting uncertainty

We have seen how you can quantify uncertainty in our line parameters into numbers. But often in science, it is clearer to show these things visually.

We can add the confidence intervals to Figure 8 showing our regression line.

To do this, we need to generate new predictions with the predict() function. This time, we need to add the argument interval = "confidence". The interval argument tells R to include predictions of \(Y\) based on the upper and lower confidence intervals of the estimates of \(\alpha\) and \(\beta\). We can then add these to the plot using geom_ribbon()

InsectPredictions <- predict(InsectModel, type="response", interval = "confidence")

# turn the predictions into a tibble
# and add the x values

InsectPredictions <- InsectPredictions %>% as_tibble(InsectPredictions) %>%
                      mutate(x=InsectDataF$Temp)

ggplot()+
  geom_ribbon(data=InsectPredictions, aes(x=x, ymin=lwr, ymax=upr), fill='grey50')+
  geom_point(data=InsectDataF, aes(x=Temp, y=AvWingLength), colour = 'grey70')+
  geom_abline(intercept = coef(InsectModel)[1], slope = coef(InsectModel)[2],
              colour = 'blue', size = 1)+
  ylab("Mean wing length in mm")+
  xlab("Temperature for May/June (ºC)")+
  ylim(21,24)+
  xlim(-2.9, 0.4)+
  theme_minimal()

Figure 9: Plot of regression line and confidence interval for the insect data.

You will notice that the confidence interval is narrower in the middle and wider at the ends. This is partly to do with the constraint that the regression line must go through the (\(\bar{X}\), \(\bar{Y}\)) point so the intercept and slope are not independent. Therefore, the confidence interval will be narrowest close to that point.


Uncertainty in a prediction of \(Y\)

So far, we have looked at uncertainty in parameter estimates using standard errors and confidence intervals but linear regression analyses can also be used for prediction.


When we use a regression for prediction it has uncertainty. Uncertainty in relationship is captured by confidence intervals. Prediction uncertainty captured by prediction intervals. A prediction interval gives a range of values that can be considered likely to contain a future observation.


For prediction we need to convey uncertainty in estimated relationship and scatter around the line (error). So, prediction error includes the variance of the residuals too (\(\sigma^2\) is finally useful!). Therefore, prediction intervals are always wider than confidence intervals.


While confidence intervals are used to represent uncertainty in parameters and to relate them to an underlying population, prediction intervals are used to represent uncertainty in predictions. It gives a plausible range for the next observation of the response (i.e. for the new value of Y). It is a type of confidence interval, but in regression it will be wider than the confidence interval of the line (which includes uncertainty in our parameters \(\alpha\) and \(\beta\)). A 95% prediction interval tells you, if you were to collect a sample and run the analysis, then go out an collect a new observation of the response variable (\(Y\)) with particular value of the explanatory variable (\(X\)) many many times AND each time draw a prediction interval, 95% of the time, the new observation would fall in within the prediction interval.


To find the prediction interval for a prediction you need to change the interval argument in the predict() function to ="prediction". The below example predicts the wing length for a temperate of 10.

predict(InsectModel, newdata=data.frame(Temp=10), 
                             type="response", interval = "prediction")
##        fit      lwr      upr
## 1 21.88003 19.50497 24.25508


The predicted wing length is 21.9mm with a prediction interval of 19.5mm to 24.3mm.


Worked example


At the end of the last section, we created a plot of our dive depth data and the estimated linear regression line. Now, we will add uncertainty to that plot.


First, we should look at the confidence intervals of our parameter estimates.

round(confint(DiveModel),2)
##              2.5 %  97.5 %
## (Intercept) 449.03 1062.93
## BodySizeKG   -0.02    0.01


The confidence intervals have been rounded to 2 decimal places to make them easier to read. The intercept interval spans from approx 450 to 1060. The slope interval crosses 0, going from -0.02 to +0.01.


To add these intervals to the plot, we need to make new predictions including the interval.

DepthPredictions <- as_tibble(predict(DiveModel, type="response", interval="confidence"))


Once we have created predictions of \(Y\) from the model object, we can then plot these using geom_line() as in the code below.

DepthPredictions <- DepthPredictions %>% as_tibble(DepthPredictions) %>%
                      mutate(x=DiveData$BodySizeKG)

ggplot()+
  geom_ribbon(data=DepthPredictions, aes(x=x, ymin=lwr, ymax=upr), fill='grey50')+
  geom_point(data=DiveData, aes(x=BodySizeKG, y=MaxDepth),colour = 'grey70')+
  geom_line(data=DepthPredictions, aes(y=fit, x=x))+
  ylab("Maximum Dive Depth (m)")+
  xlab("Body Size (kg)")+
  theme_minimal()

Figure 5: Scatter plot of dive depths against body size. Line is the regression line from a linear model. Shaded area is 95% confidence interval


The plot shows that the uncertainty in the estimated relationship gets increasingly uncertain as you get to higher body sizes.


Predicting dive depths for a body size of 75000kg

A colleague as just found a new species of whale (fictional). The whale washed up on shore in Tromsø, it weighed 75000kg. Based on our linear regression analysis, how deep would we expect it to dive?

predict(DiveModel, newdata=data.frame(BodySizeKG=75000), 
                             type="response", interval = "prediction")
##        fit       lwr      upr
## 1 352.1753 -1328.376 2032.727


The mean prediction is 352m deep. This seems ok. But when we look at the prediction interval, we see that when we include uncertainty, we are not even sure if they whale will dive below the surface by 2km or jump into the air by 1.3km. When we include uncertainty, it is clear that based on the current data and model, we cannot say anything about the possible dive depth of the new whale. We even get biologically unrealistic predictions.


This is something we will look at in the next section.


Model checking

We have now have a model, estimates of parameters, and uncertainty. But how do we know if this model is any good?


Theory


To find out if our model is any good from a theoretical perspective, we need to consider whether it meets the assumptions of the linear regression that are stated in the Model details section above.


To do this, we use graphs called diagnostic plots, which can be easily made in R. Each plot tests whether a different assumption has been met. There are four key diagnostic plots that we use for simple linear regression. To make the diagnostic plots, we use the plot() function with our linear model object as the first argument and which = number as the second. The number should be replaced by the number corresponding to the plot you want.


1. Residuals vs fitted plot

This plot tests the assumptions of The relationship between X and Y is linear, The residuals have a mean of 0, and The variance of the residuals is equal for all fitted values (homoscedasticity).


There are two examples of residuals vs fitted plots shown below. If our assumptions are met, we expect a few characteristics of the plot:

  • The red line should be horizontally straight and at 0. This line shows the mean residual value for each fitted value. We expect the mean to be 0. If it is straight, this also suggests that the relationship is linear.
  • There should be no structure in the residuals. The residuals should look like a random cloud of points. The variance (distance above and below the line) should be constant. If points are e.g. close to the line at low fitted values but far from the line for higher fitted values. This would suggest variance is not equal.
  • The residuals should not curve. Curvature will be shown by the residuals themselves and by the red line because if there is curvature, we would expect the mean residual values to be pulled away from 0 as more observations fall above or below the fitted line.


Figure 10: Example of a good (left) and a bad (right) residuals vs fitted plot.


Figure 10 (left) is an example of a good residuals vs fitted plot, where the assumptions of a simple linear regression are met because there is no structure in the residuals (they look random), the variance is approximately equal across the fitted values, and the residuals are centered on 0 (suggesting linearity).

Figure 10 (right) on the other hand is not good. Here we can see clear curvature, suggesting that the assumption of linearity and of residuals having a mean of 0 are both violated. If you look closely you can also see the variance is lower for lower fitted values and increases as you go from left to right (the residuals have more spread to the right of the plot). This means the equal variance assumption, is also violated.

In reality, it will not be as clear cut as the examples above. You will need to use your own judgement to decide if the assumption is sufficiently met. Do not expect perfection.


Now we can look at the residuals vs fitted plot for the butterfly model.

plot(InsectModel, which = 1)

Figure 11: Residuals vs fitted plot for the butterfly wing length model.

What do you think of Figure 11? Are the assumptions sufficiently met?


Check your answer

This is less straightforward than the examples above.

Firstly, there are fewer points, so it is harder to see a pattern. However, to me, it looks ok. The red line is not perfectly horizontal, but neither is there a clear curve and it is placed at 0. The residuals are evenly distributed above and below 0 and the variance seems pretty constant.

I would be ok with this, even though it is not perfect.


What if the assumptions aren’t met?

If the residuals vs fitted, there are a few things we can consider to improve our model.


Transform the response

The first option is to transform the response variable. The most common transformations are to take the natural log (log() in R) or to take the square root (sqrt in R). These work best for fixing violations of the linearity and equal variance assumptions.


Normal QQ plot

This plot tests the assumption of the residuals are normally distributed.


There are two examples of normal QQ plots shown below. If our assumptions are met, we expect:

  • The points lie along the line. The line represents a perfect theoretical normal distribution, the points come from our model.


Figure 12: Example of a good (left) and a bad (right) normal QQ diagnostic plot.


Notice that the y-axes have different values.

Figure 12 (left) is an example of a good normal QQ plot. Here you can see that although the points do not fall perfectly on the line, they are very close. The only deviations are at the extremes, which is most common.

Figure 12 (right) on the other hand is not good. In this case there is much more of a deviation in the upper right of the plot. The lower part also shows a deviation in the same direction. This suggests that the assumption of the residuals following a normal distribution is not met sufficiently.


Now we can look at the normal QQ plot for the butterfly model.

plot(InsectModel, which = 2)

Figure 13: normal QQ plot for the butterfly wing length model.

What do you think of Figure 13? Are the assumptions sufficiently met?


Check your answer

The lower part of the plot looks ok. But for theoretical quantiles greater than 1, there is more deviation between the standardised residuals of our model and the theoretical normal distribution. This suggests some deviation from normality.

Something to consider is that only 4 points deviate from the theoretical normal distribution by a noticeable amount and only 2 are very far away. You might also notice that two of the points furthest from the expectation have numbers next to temp (4 and 17). By putting numbers next to them, R is labelling these points as possible outliers. The numbers refer to the rows in the dataframe that these points come from. We should look into these more in the next two plots.

In general, I would be a bit concerned by this plot and want to look a bit further at the others. However, it would not worry me enough to say the assumption definitely isn’t met.


What if the assumptions aren’t met?

If the assumptions aren’t met there are a few things we can check:

  • Do we have enough data? Sample sizes < 30 (like we have in the butterfly example) are more likely to not fit a normal distribution. It is generally better to have more data.
  • Are there any outliers? We will look into this more in the next plots, but outliers could make it seem like the assumption is violated when the rest of the data fit.


Scale-location

This plot tests the assumption of The variance of the residuals is equal for all fitted values (homoscedasticity).


There are two examples of scale-leverage plots shown below. If our assumptions are met, we expect:

  • The red line to be horizontal, showing that there is no trend in the data (similar to the residuals vs fitted).
  • There is no structure in the points, we expect the points to be spread evenly across the plot.


Figure 14: Example of a good (left) and a bad (right) scale-location plot.


Figure 14 (left) is an example of a good scale-location plot. The red line is not perfectly horizontal, but it is quite close and the points are randomly spread.

Figure 14 (right) on the other hand is not good. There is clearly a lot of structure in the points and the variance is not equal across the plot.


Now we can look at the scale-location plot for the butterfly model.

plot(InsectModel, which = 3)

Figure 25: the scale-location plot for the butterfly wing length model.

What do you think of Figure 15?


Check your answer

As is often the case, this plot from real data is not as neat as the example.

In Figure 15 the red line is not completely horizontal but has a slight negative trend. This is driven mostly by the final point at a fitted value > 23.20. There is only one point at this value, which pulls the line down. Over the whole plot the red line decreases by 0.5, this is probably ok. Higher than this i.e. a change of up to 1 or more, would be a reason for concern.

The points also appear quite randomly spread. Given the small amount of data, I think this is ok.


Cook’s Distance

This plot tests the assumption of no outliers. It lets you identify influential points.


There are two examples of Cook’s Distance plots shown below. If our assumptions are met, we expect:

  • Low values of Cook’s Distance (y-axis) and no points standing out on their own.


Figure 16: Example of a good (left) and a bad (right) Cook’s Distance plot.


Notice that the y-axes have different values.

Figure 16 (left) is an example of a good Cook’s Distance plot. Here you can see that although some of the Cook’s Distances are higher than others, the overall values are low (<0.13) and many points have a similar value. There are none that stand out.

Figure 16 (right) on the other hand is not good. Now the Cook’s Distance values go as high as 1.4 and there is a clear difference between point 99 and all others. Point 99 has a big impact (high influence) on the fitted regression line.


Now we can look at the Cook’s Distance plot for the butterfly model.

plot(InsectModel, which = 4)

Figure 17: Cook’s Distance plot for the butterfly wing length model.

What do you think of Figure 17?


Check your answer

Figure 17 is midway between the two examples in Figure 16. The Cook’s Distance values are as high as 0.6 and point 14 appears quite different to all others, it has more influence. But we already know this to some extent from the normal QQ plot.

To decide if point 14 is an outlier, we would need to go back to the data and check for typos, mistakes, or anything that could be wrong with point 14. As these data are from a published paper, we can assume this has already been done. Therefore, although point 14 is a bit different to the rest, it is still data so should remain in the analysis.


Worked example


Using the theory covered in the previous section, we can now check our DiveModel to ensure that it meets the assumptions of a simple linear regression.


We will use one plot at a time to test specific assumptions.


Residuals vs fitted

plot(DiveModel, which = 1)

Figure 6: Residuals vs fitted plot for dive depths model.


Figure 6 looks quite unusual. There are a few assumptions we are checking with this plot:

  • Is the relationship between X and Y linear? It does seem to be. While the red line is not straight, it does not show a clear pattern until the very end. At the very end, there is a strong negative trend. We might need to come back to this! It seems like there might be one pattern for shallow diving species and another for the deep divers.
  • Do the residuals have a mean of 0? There is a deviation at fitted values of > 700, otherwise yes, the mean of the residuals is approx. 0.
  • Is the variance of the residuals is equal for all fitted values (homoscedasticity)? It is not the nice cloud of random points that we expect. But is it a problem? To answer this, we need to look at bit closer. The strange shape comes from the residuals at low fitted values, these are the shallow diving marine mammals. You might notice that there are very few of these values. The majority of the data have fitted values > 700. Where we have more data, the residuals vs fitted plot looks better.

At this point, it might be hard to say how problematic the low variance caused by the lack of data for low fitted values is. But, we can have a look at the points causing the pattern (those with a maximum dive depth < 600m).

filter(DiveData, MaxDepth < 600)
## # A tibble: 19 x 4
##    Species            MaxDepth BodySizeKG Group   
##    <fct>                 <int>      <dbl> <fct>   
##  1 Orca                    264       4500 Twhale  
##  2 HomoSapien              112         85 Other   
##  3 BlueWhale               172     100000 Bwhale  
##  4 FinWhale                128      60000 Bwhale  
##  5 HumpbackWhale           240      36000 Bwhale  
##  6 SouthernRightWhale      180      54000 Bwhale  
##  7 BrydesWhale             300      41000 Bwhale  
##  8 Walrus                  450       1500 Pinniped
##  9 LeopardSeal              16        400 Pinniped
## 10 HarbourSeal             481         85 Pinniped
## 11 CaliforniaSeaLion       536        226 Pinniped
## 12 NZFurSeal               238        100 Pinniped
## 13 NZSeaLion               550        110 Pinniped
## 14 MinkeWhale              106       8000 Bwhale  
## 15 CommonDolphin           260         80 Twhale  
## 16 HarbourPorpoise         100         55 Twhale  
## 17 PolarBear                13        500 Other   
## 18 SeaOtter                101         30 Other   
## 19 BottlenoseDolphin       300        450 Twhale


From these data, we can see that five of the six baleen whales (Group = Bwhale) are included in this subset. This is quite interesting. It could be biologically reasonable that these whales follow a different pattern to the other species because their physiology is very different.

Keep that last point in mind when we get on to interpreting the results.

Normal QQ

plot(DiveModel, which = 2)

Figure 7: Normal QQ plot for dive depth model.


The assumption we are checking with this plot is: are the residuals are normally distributed?

As expected, there is not a perfect match between the theoretical normal distribution and the distribution of the residuals. There is some deviation at both tails of the distribution. At lower quantiles, this seems ok. At higher values, points 9 and 10 deviate quite a lot. These points also stood out in Figure 6. We will need to look into them more in Figure 9.


Scale-location

plot(DiveModel, which = 3)

Figure 8: Scale-location plot for dive depth model.


The assumption we are checking with this plot is: Do the residuals have equal variance across fitted values?

Figure 8 shows a very similar picture to Figure 6. While there is a slight increase in variance as the amount of data increases, the amount of change is < 0.5 and there is not much structure in the points. So, this looks like things are ok.


Cook’s distance

plot(DiveModel, which = 4)

Figure 9: Cook’s distance plot for the dive depth model.


The assumption we are checking with this plot is: Are there any outliers?

Figure 9 shows that the Cook’s distances of this model are not very high (max = 0.2). So, it does not seem that any points have that large an influence on the fitted values. However, point 10 does seem to be quite different from the others. This might be worth looking into.


We can find point 10 by looking at the 10th row of our data frame. It is the entry for the Cuvier’s beaked whale. While this is a rare and unusual whale, it is not the only beaked whale in our dataset and we have no reason to believe this data is a typo. Therefore, we would not consider this an outlier and would not remove from the data.

DiveData[10,]
## # A tibble: 1 x 4
##   Species            MaxDepth BodySizeKG Group 
##   <fct>                 <int>      <dbl> <fct> 
## 1 CuviersBeakedWhale     2992       2500 Twhale


Summary

Overall, it seems that most of the model assumptions are met well. The only red flag is in the equal variance assumption. However, this seems to be caused in part by a lower amount of data available at the extremes of dive depths. Therefore, we think it is ok to proceed with this model.


In the next section we will interpret our results.


Draw conclusions

Theory

In the previous sections you learned how to run a simple linear regression models, what the parameters of the model mean, how to quantify uncertainty in the parameters, and how to check the assumptions of the model. Now, we can bring everything together to draw some conclusions.


There are several components required in drawing a conclusion:

  • statement of the maximum likelihood estimate of the parameters of interest (including strength and direction).
  • statement of the uncertainty in the estimate
  • statement of how good the model is i.e. how well the model meets assumptions and the amount of variance explained (using \(R^2\) - explained below)
  • link the results to biology and the question asked
  • Discussion of next directions


Example conclusion for butterfly wing length data

The question we asked in this example was Does temperature influence wing length of butterflies?

To answer this, we need an estimate of the strength of the relationship between wing length and temperature. We find this from the slope of our simple linear regression. The maximum likelihood estimate of the slope is -0.105. This corresponds to a negative relationship between temperature and wing length where wings are 0.105 mm shorter for every 1ºC warmer the environment.


However, there is some uncertainty around this estimate. The 95% confidence intervals of the relationship are -0.307 to 0.096. The confidence intervals show that the estimate of the intercept has high uncertainty relative to the estimate size. The confidence interval spans 0. In other words, 0 sits between the upper and lower interval because they are different signs (positive and negative). This means that 0 is included in the plausible range of parameter values for the relationship (a slope of 0 = no relationship). As a result, we cannot conclude that there seems to be any affect of temperature on wing length in female butterflies at the population level.


Despite the lack of a clear result, our model did meet the assumptions of a simple linear regression reasonably well. However, the sample size was a bit low (<30), so more data would have been better.


To work out how much of the variation in \(Y\) (wing length) is explained by \(X\) (temperature) we can use a measure called the \(R^2\). It is calculated using the following equation:

\[ R^2 = 1 - \frac{\color{darkblue}{\text{Residual Variance}} }{\text{Total Variance}} \] and can be found in R using summary(YOURModel)$r.squared. The value is a proportion, so between 0 (no variance explained) and 1 (all variance explained). A good value is subjective, but > 0.5 is usually considered good, > 0.7 is very good and a value of 1 is suspicious.

The \(R^2\) for the butterfly wing length model is:

summary(InsectModel)$r.squared
## [1] 0.07654834


The \(R^2\) suggests that only 8% of the variation in wing length is explained by temperature. This is very low, there is more unexplained variation than explained.


Overall, it seems that temperature does not have much effect on the wing length of female butterflies based on the data we analysed. The negative effect we estimated is weak and when we consider uncertainty, the direction of the effect is not clear. The model also did not explain much variation in wing length. All of this suggests that there are other drivers that are influencing wing length in female butterflies that we did not measure here.

But we should also not forget that the sample size is low. If the effect really is small, we would need much more data to identify the effect with confidence.


For next steps, we might want to collect new data looking at other explanatory variables that we think might be more important, just increase our sample size, or explore the pattern in male butterflies.


Worked example

This is the final section of our analysis of the data on marine mammal maximum dive depths. We will now bring together all of the results we have obtained and draw a conclusion following the same format as the butterfly example.

A reminder, we were asking: Does body size influence maximum dive depth in marine mammals?


The maximum likelihood estimate of the relationship between body size and maximum dive depth in marine mammals was -0.005. In other words, for every 1 kg increase in body weight, marine mammals dived 0.005 m less deep. Given that some marine mammals can dive 1000s of metres, this increase per kg is very low.


When we look at the uncertainty in this estimate, we see the 95% confidence interval is -0.017 to 0.006. The confidence interval limits are different signs, meaning that 0 is included as a plausible value for the strength of the relationship. Therefore, we cannot conclude that body size has any impact on maximum dive depth in marine mammals. This is supported by the \(R^2\) value, which is 0.03, suggesting only 3% of the variation in maximum dive depth is explained by body size.


In model checking, there was an interesting pattern shown, of little data at the lowest maximum dive depths and a disproportionate representation of baleen whales. If we plot the data by group and allow ggplot to fit a regression line per group, we can see that there seems to be a different pattern for different taxonomic groups. This would make sense based on physiology and might explain why the variance assumption was not met well AND why the estimated relationship was so weak.

ggplot(DiveData, aes(x=BodySizeKG, y=MaxDepth, color=Group))+
  geom_point()+
  facet_wrap(.~Group)+
  geom_smooth(method='lm')+
  theme_minimal()

Figure 10: Plot of the relationship between dive depth and body size faceted by taxonomic group.


While the baleen whales show no relationship between dive depth and body size, both the pinnipeds (seals and sea lions) and the toothed whales (orca, dolphins, porpoises) show quite strong positive relationships between dive depth and body size. By including all groups in one analysis, we might have been masking the true effects.

It might be more appropriate to consider an effect of Group as well as body size. We can do this using Linear models for categorical explanatory variables/ANOVA, check out these pages to continue the analysis of this data.

What’s next?

  • Linear models for categorical explanatory variables/ANOVA for analyses when your explanatory variable is not numeric.
  • Multiple regression for analyses with more than one numeric explanatory variable.
  • Generalised linear models for analyses when your response variable is not normally distributed.