You should be familiar with basic statistical theory, basics of R, continuous and categorical data, hypothesis testing, statistical modeling, and the concept of linear models.
In this section, we will look at how we can use simple linear regression to analyze data with a continuous numeric response and a numeric explanatory variable. Linear regression is a type of linear model.
Linear regression has two motivations. The first is called inference, which is when you want to say something about a population from a sample. Very often data are only available for a subset of the population (a sample), but we want to generalize our conclusions for the whole population. Therefore, we use statistical models to infer effects at the whole population level from what we find at the sample level. The second motivation is prediction, where we use a model to predict values of the response for specific values of the explanatory variable. These predictions can either be for observed values of the explanatory (mainly used for plotting), for unobserved values of the explanatory variable within the same range as observations, or for novel values of the explanatory variable outside the range of observations (this is more risky! - more on this later).
Example questions:
inference: does the height of plants increase with increasing temperatures?
prediction: how tall will a plant be if mean temperatures increase by 2°C?
Figure 1: Panel A shows an illustration of inferring the effect of temperature on plant height. Panel B shows an illustration of the prediction of a plant height under a 2°C increase in temperature. Created by Emily G. Simmonds
In simple terms, we fit a straight line to:
estimate a relationship between \(X\) and \(Y\)
predict change in \(Y\) from change in \(X\).
Linear regression assumes a causal relationship between \(X\) and \(Y\), i.e. it assumes that \(X\) influences \(Y\), but not vice versa. It is important to understand that a regression only “quantifies” the pattern between \(X\) and \(Y\), but does not actually test for a causal relationship. To test if \(X\) causes an effect on \(Y\), you need to conduct a scientific experiment. A linear relationship between two variables does not necessarily mean that \(X\) has a causal influence on \(Y\). For example, the number of PhDs awarded in math has nothing to do with the amount of Uranium stored in the USA. However, when plotting the two variables against each other, one could assume a perfect relationship (see ??). Therefore, before statistically analyzing data it is essential to make sure there is a biological explanation or assumption for a relationship between \(X\) and \(Y\). For more examples of wrong assumptions of causality click here.
Figure 2: Example of a spurious correlation (one that does not make sense causally). Scatterplot of number of maths PhDs awarded in a year against the amount of uranium stored in US power plants annually. Smooth line indicates a regression line for this data with the 95% confidence interval as the shaded area
Example questions you can answer with a simple linear regression:
Inference
Prediction
A linear regression is used when you have continuous numeric response variable and a continuous numeric explanatory variables. A simple linear regression has only one explanatory variable, a multiple linear regression has more than one.
Examples of continuous numeric variables:
Always remember to check that the way your variables are classified in R is the same as the format you expect. It is a common mistake for a variable that should be numeric to be classified as a Factor, or a Factor as a character string etc. (This should not be a problem in the latest version of R, so might become a historical problem.)
For this worked example we will be using some data on dive depths of marine mammals. We will be answering the question:
Does body size (kg) influence maximum dive depth (m) in marine mammals?
We might expect that differences in foraging strategies, physiological processes, metabolism, and power of the marine mammals would impact how deep the animals can dive. All of these variables correlate with body size. Therefore, it could be expected that larger species can dive deeper as relative movement compared to body size is lower. Or perhaps it is less energetically demanding for smaller species to dive deeper. The direction and strength of the influence of body size on dive depth can be quantified using a linear regression.
Figure 3: Illustration of marine mammals by Emily G. Simmonds
These data have four variables; species, maximum recorded dive depth in metres (m), body weight in kilograms (kg), and taxonomic group.
The data were collected from a variety of sources and cover several taxonomic groups (including polar bear, sea otters, baleen whales, toothed whales, seals, and sea lions). You can find the dataset here if you want to try the analysis for yourself.
Full list of data sources
Sources:
First, we want to import the data and have a closer look by making a plot.
## # A tibble: 30 x 6
## species max_depth body_size_kg group Species Group
## <chr> <int> <dbl> <chr> <fct> <fct>
## 1 Orca 264 4500 Twhale Orca Twhale
## 2 WeddellSeal 741 500 Pinnip~ WeddellSeal Pinnip~
## 3 LongFinnedPilot 828 1300 Twhale LongFinnedPilot Twhale
## 4 ShortFinnedPilot 1019 2000 Twhale ShortFinnedPilot Twhale
## 5 BlainvillesBeakedW~ 1408 900 Twhale BlainvillesBeaked~ Twhale
## 6 NorthernBottlenose~ 1453 8000 Twhale NorthernBottlenos~ Twhale
## 7 SouthernElephantSe~ 1653 3000 Pinnip~ SouthernElephantS~ Pinnip~
## 8 BairdsBeakedWhale 1777 12000 Twhale BairdsBeakedWhale Twhale
## 9 SpermWhale 2035 27000 Twhale SpermWhale Twhale
## 10 CuviersBeakedWhale 2992 2500 Twhale CuviersBeakedWhale Twhale
## # ... with 20 more rows
Figure 4: Scatterplot of body size against maximum dive depth for marine mammals. Colours indicate taxonomic group
The output shows that two of the variables are factors (species and group), one is continuous numeric (max_depth), and the last one is integer (body_size_kg). We know that body size actually is a continuous variable, scientists probably rounded when they measured this variable, because these are large numbers. We will change this variable to be numeric so it better represents the characteristics of the data.
Remember, we are interested in whether body size (kg) influences maximum
dive depth (m) in marine mammals. To answer this question we will need the variables:
body_size_kg
and max_depth
.
We will not consider species
or group
at the moment,
but we might need them later.
Now we have familiarized ourselves with the data, we know our research question, and we have identified which variables we need for the analysis. We are ready to perform an analysis.
When we create a model we aim to represent mathematically the process that generated the data we observed.
When we use a linear regression model (this is also true for other types of linear models), we make an assumption that there is a linear relationship between the explanatory and response variables. Mathematically, we say that we can capture the data generation process with a straight line and some error.
The line component of the linear regression is defined by two parameters:
We can alter the position and slope of the line by these two parameters. The final part of the model is \(\varepsilon\) (Greek letter epsilon), which is the error around the regression line. This error is estimated with the parameter \(\sigma^{2}\) (Greek letter sigma – squared) that is the variance of the error. (Greek letters are used to refer to each part of the model using equations).
We can write a linear regression model as and equation as a function of \(Y\):
\[ Y_i = \color{orange}\alpha + \color{blue}\beta X_i + \color{red}\varepsilon_i \]
There are several assumptions that we make when using a linear regression model:
All of these assumptions should be met for the model to work properly and they ALWAYS need to be checked. We will check five of them after we have fit the model (see below). The last assumption, independence of \(Y\) needs to be assured before or during data collection. For example, if data were collected on leaf length, 20 leaves each from five trees, these would not be independent. It would be better to collect one leaf each from 100 trees.
To fit the simple linear regression in
we will use the lm()
function.
lm()
stands for linear model (should seem familiar) and it takes several arguments:
y ~ x
The function will fit the regression model using maximum likelihood estimation and give us the maximum likelihood estimates of \(\alpha\) and \(\beta\) as an output. It does also estimate \(\sigma^{2}\) of the error, but it does not report this.
To use the lm()
function you first need to think about the formula argument,
the y ~ x
part.
The same way as in the equation above, the letter \(Y\) always
corresponds to the response variable
(the thing you are trying to explain) and
\(X\) to an explanatory variable (the thing you assume affects the response).
Does temperature influence wing length of butterflies?
The explanatory variable (\(X\)) = temperature, it is the variable that does the influencing. The response variable (\(Y\)) = wing length, it is the result.
You can then plug these variables into the lm()
function in the below format
using the column names in place of y
and x
and
including your data frame name as the data argument.
Running the lm()
as illustrated above runs the linear regression model
and saves the output as a ‘model_object’.
I saw an
lm()
written differently, what’s that about?
You can use the lm()
function without the data argument.
If you do this, you
need to refer to your (\(X\)) and (\(Y\)) variables in the y ~ x
formula using a $
between the data name and the column name.
We do not recommend using this approach.
There are several reasons for this
but a key one is that when using the $
syntax, R sees the variable name as the
whole entry your_data$explanatory
rather than as the column name explanatory
.
This makes it difficult to use this model for other things e.g. to predict.
The results of the linear regression can be viewed using the function coef()
.
This takes the output of lm()
, the model object,
as its argument and extracts the maximum
likelihood estimates of \(\alpha\) and
\(\beta\).
\(\alpha\) will always be labelled (Intercept)
but \(\beta\) will be labelled by the name of the \(X\) variable.
## (Intercept) explanatory
## 2.7720862 0.7969848
This worked example demonstrates how to fit a linear regression model in
using the lm()
function for the dive depths example.
In this example we are asking:
Does body size influence maximum dive depth in marine mammals?
Our question is formulated to suggest a direction of causality, we assume body size has a causal effect on maximum dive depth, therefore maximum depth is our response (\(Y\)) and body size as our explanatory variable (\(X\)).
We can put these variables into the lm()
function in the below format.
Great. We have run a model and assigned it to an object name.
We can look at the maximum likelihood estimates of our model parameters
(\(\alpha\) and \(\beta\))
using the function coef()
.
## (Intercept) body_size_kg
## 755.977336598 -0.005384027
We will look at interpreting these in the next part of the worked example.
We introduced the three model parameters of a simple linear regression in the section above: \(\alpha\) = the intercept, \(\beta\) = the slope of the line (steepness/gradient), and \(\sigma^{2}\) the variance of the error.
\[ Y_i = \color{orange}\alpha + \color{blue}\beta X_i + \color{red}\varepsilon_i \]
But what do these parameters really mean?
All regression analyses are fundamentally about using straight lines to represent the relationship between a response (\(Y\)) and some explanatory variables (\(X\)), called a regression line. The parameters of the model determine the placement and gradient of the straight line, as well as representing the distribution of data points around the line.
Figure 5: Illustration of components of a linear regression. Intercept is in orange, slope is in blue, and the residuals (explained later) are in red
In this section we will go through these parameters and their meaning in terms of the relationship between \(X\) and \(Y\).
This first parameter gives the value of \(Y\) when \(X\) = 0, it is the point that the regression line crossed the y-axis. A positive value means that the value of \(Y\) when \(X\) = 0 is above 0, and a negative intercept value means it is below the value of \(Y\) is below 0, when \(X\) = 0.
This second parameter gives the amount of change in \(Y\) for every unit change in \(X\), it is the slope of the regression line. Positive values indicate a positive relationship between \(X\) and \(Y\) i.e. \(Y\) increases as \(X\) increases. Negative slope values indicate the opposite, a negative relationship where \(Y\) decreases as \(X\) increases. The higher the value of the slope, the stronger the regression relationship.
Together \(\alpha\) and \(\beta\) control the position and steepness of the regression line. They are called the systematic part of the model, the bit that links \(Y\) to the covariate \(X\).
This is the final parameter you need to estimate for the simple linear regression and it is a bit different from \(\alpha\) and \(\beta\). This parameter does not relate directly to the shape or position of the regression line. Instead, this parameter captures the variance of the data points around that line, i.e. how close or far away each data point is from the regression line. The variance is the random part of the model, or in other words the error. Higher error variance values indicate more variation around the model fit. In the case of a simple linear regression, the error is assumed to be normally distributed.
In 5, you can see a plot of a regression line through the data points, but it does not touch all of the points. In other words, it is not capturing all of the variation in the data. The regression line does not explain the exact position of all data points, something else is also going on (this is to be expected for real data).
The regression line is a fitted line that represents the best fit line of a relationship between \(X\) and \(Y\) (based on maximum likelihood estimation).
The value of \(Y\) for each \(X\) on the regression line is called a
fitted value.
The distance between the fitted values and the values of \(Y\)
that were actually observed are called residuals.
You can extract the residuals from your model object using the function
residuals()
, which is useful when checking model assumptions (see below).
Data points below the regression line have a negative the residual and
data points above the line have a positive residual.
These are highlighted red in 5.
The regression line must always pass through the point that represents the mean of \(X\) and the mean of \(Y\). (\(\bar{X}\), \(\bar{Y}\)). Therefore, if you change the intercept, the slope must change as well to keep the line going through the (\(\bar{X}\), \(\bar{Y}\)) point. You can have a go at doing this below.
Below you will see a window containing an app. The aim of this app is to try and find the straight line that best explains the data, by trying different slope values.
There is a slider on the left hand side that lets you control the \(\beta\) value, the \(\alpha\) is fixed at 0. On the right you can see the fitted line (one is already plotted for you).
In this example ‘fit’ of the line is measured using something called the sum of squared residuals. This is calculated by squaring the values of all of the residuals and then adding them up to get a single number:
\[ \Sigma (y_i - \bar{y_i})^2 \] where, \(y_i\) = the observed value of \(y\) for \(x_i\) and \(\bar{y}\) = the fitted y value for \(x_i\), and \(i\) is an index of 1 to \(n\) (sample size).
The reason the sum of squares is used to estimate the fit of a model line to data is because there are roughly as many positive residuals as negative. If you just sum them, the result will be roughly 0. Therefore, squaring them before adding them means they don’t cancel out. This measure tells you how far away the observations are from the fitted line, the lower the number, the better the fit.
Click here to open the app in full scale in a separate window.
What was the best fit you managed to get?
What was the answer?
A slope of approximately 3 should give the best answer (lowest sum of squares), which is 3423.5956.
How confident are you that you found the best line?
It is hard to know by trial and error if you have found the ‘best’ line for the data. It is much easier, repeatable, and reliable to use a simple linear regression instead. The idea is the same as the app, but instead of trying until it looks good, the equation for simple linear regression is used and values for the unknown parameters are found using maximum likelihood estimation.
Now you know what each of the three parameters in a simple linear regression mean, you can now think about interpreting them.
Which of the three parameters do you think is most important for answering the research question “Does \(X\) influence \(Y\)?”?
I had a go, now show me the answer.
The slope (\(\beta\)) tells us the strength and direction the relationship between \(X\) and \(Y\) is. While the linear regression model also estimates the intercept and the residual variance, these do not directly answer our question of interest. However, to make predictions, we will need all three parameters.
In the previous section of this worked example, we fit a simple linear
regression using the lm()
function and looked at the estimates of
some parameters using the coef()
function.
In this section,
we will use model theory to interpret what those parameters mean.
We already know that the parameters of the intercept and slope control the
position and steepness of the regression line.
It is the estimates of these two parameters that we get from the coef()
function.
For our dive depth model the estimates are:
## (Intercept) body_size_kg
## 755.977336598 -0.005384027
The intercept is 756 m and the slope of the relationship between body size and dive depth is -0.005.
In this case, the intercept is not that interesting. It tells us the expected value of \(Y\) (maximum dive depth) when \(X\) (body size) = 0. It does not make a lot of biological sense to know the expected dive depth of a marine mammal that weighs 0 kg. But, sometimes it can make sense to know the value \(Y\) when \(X\) is 0, for example if \(X\) was temperature.
The slope on the other hand is interesting. It tells us the direction and strength of the relationship between body size and maximum dive depth. In this case our model estimates a negative slope. This means that for every increase of 1 kg in body size of marine mammals the maximum dive depth decreases by 0.005 m. In other words, there is a negative relationship between body weight and dive depth and as \(X\) increases \(Y\) decreases (??).
The coef()
function can give us the maximum likelihood estimates of the
intercept and slope parameters,
but it does not give any information on the residual variance, \(\sigma^{2}\).
To get an estimation of the residual variance, we use the summary()
function with the model object as an argument.
To extract \(\sigma\) we use $sigma
and square the result to get \(\sigma^{2}\).
You cannot take var(residuals)
directly because this uses n-1
as the
denominator of the variance equation whereas to estimate \(\sigma^{2}\) for
a linear regression the denominator depends on the number of parameters being estimated. The denominator is the degrees of freedom (n-number of parameters estimated
). In a simple linear regression there are two parameters that have
been estimated (as well as \(\sigma^{2}\)) these are the intercept and the slope. Therefore, the denominator n-2
. The more explanatory variables
you add to a linear regression, the more parameters you estimate and the
fewer degrees of freedom you will have.
## [1] 526362.7
For this model the \(\sigma^{2}\) = 5.26362710^{5}. This number is abstract and does not mean anything on its own, but it could be used to compare models (though there are much better ways to do this). It does not help in terms of answering whether body size influences dive depth. But we will use it for prediction later.
As well as looking at the maximum likelihood estimates of the parameters from the simple linear regression, we can also plot the estimated regression line.
To do this, we will use ggplot()
with geom_line()
.
We will also go into the second aim of a regression: predicting.
Therefore, we need to use a new function called predict()
.
To make this first plot, we only need to use two arguments:
object
= your model objecttype
= “response”, which means predict on the response scaleOnce we have created predictions of \(Y\) from the model object, we can then plot these using geom_line()
as in the code below.
dive_model_fig <- ggplot(dive_data, aes(x=body_size_kg, y=max_depth))+
geom_point(colour = 'grey70')+
geom_line(aes(y=depth_predictions))+
labs(y="Maximum Dive Depth (m)",
x="Body Size (kg)")
Figure 6: Scatter plot of dive depths against body size. The black line is the regression line predicted from the linear model
In the next section we will look at how to add uncertainty to these plots and our interpretation.
You should already know that statistics does not give a single correct answer. When you estimate the values of parameters in our statistical model, there are many different values that could plausibly have produced the observed data. Some of these are more likely than others but several will have very similar likelihoods.
A simple linear regression is no different. And a way to cope with this, is to calculate and present the uncertainty in the parameters you estimate.
The lm()
function produces results that are equivalent to maximum likelihood estimation of the parameters.
Therefore, our consideration of uncertainty for these models follows the same
principles as discussed here.
You can quantify uncertainty using
standard errors, confidence intervals, and prediction intervals
which should be familiar to you
but head to the uncertainty pages if you need a recap.
For any regression there are two different types of uncertainty. Here, we cover at the uncertainty in the parameters of the regression line, \(\alpha\) and \(\beta\) and the uncertainty in predictions of \(Y\).
The standard error of a parameter is the standard deviation of its sampling distribution. It gives a measure of the spread of the sampling distribution
i.e. the uncertainty.
To find the standard errors for the estimates of \(\alpha\) and \(\beta\)
you can use the summary()
function.
The argument that summary()
takes is a model object, the output from lm()
.
This function gives a big table with lots of information.
The first line shows the model formula used for the model object.
The second line shows a summary of the residuals of the model and the
standard errors are shown as the second column in the third part,
Coefficients:
.
##
## Call:
## lm(formula = response ~ explanatory, data = your_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1510 -3.0636 0.3854 1.4128 8.1477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.77209 1.11970 2.476 0.0196 *
## explanatory 0.79698 0.05925 13.452 9.61e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.28 on 28 degrees of freedom
## Multiple R-squared: 0.866, Adjusted R-squared: 0.8612
## F-statistic: 181 on 1 and 28 DF, p-value: 9.611e-14
If we take the summary()
of the example model,
we can see the standard error or the intercept (\(\alpha\)) is 0.95825,
and the standard error for the slope (\(\beta\)) is 0.05483.
For interpretation of the uncertainty, it can be easier to use the standard error to calculate confidence intervals (CI). Confidence intervals indicate the range of plausible values for a parameter. They represent an interval, that if you were to collect a sample and run the analysis, then repeat that many many times AND each time draw a confidence interval, on average 95% of the time, the true population value of the parameter would be found in within the confidence interval.
If you need a reminder of this click here.
The confidence interval can be calculated using this formula:
\[ \begin{aligned} UpperCI = estimate + (1.96 SE) \\ LowerCI = estimate - (1.96 SE) \\ \end{aligned} \]
1.96 is used because in a standard normal distribution 95% of the distribution lies within 1.96 standard deviations of the mean. In this case the distribution is the sampling distribution, which is normal for \(\alpha\) and \(\beta\) and the standard deviation is the standard error.
Using the formulas above, calculate the confidence intervals for the intercept and slope from the example model.
I had a go at calculating, what is the correct answer?
# INTERCEPT
# Upper confidence interval
upper_ci_intercept <- summary(model_object)$coefficients[1,1] +
(1.96*summary(model_object)$coefficients[1,2])
# Lower confidence interval
lower_ci_intercept <- summary(model_object)$coefficients[1,1] -
(1.96*summary(model_object)$coefficients[1,2])
# Print the interval
c(upper_ci_intercept, lower_ci_intercept)
## [1] 4.9667025 0.5774699
# SLOPE
# Upper confidence interval
upper_ci_slope <- summary(model_object)$coefficients[2,1] + (1.96*summary(model_object)$coefficients[2,2])
# Lower confidence interval
lower_ci_slope <- summary(model_object)$coefficients[2,1] - (1.96*summary(model_object)$coefficients[2,2])
# Print the interval
c(upper_ci_slope, lower_ci_slope)
## [1] 0.9131062 0.6808633
It is also possible to get R to calculate the confidence intervals for you.
To do this you can use the confint()
function.
The argument is a model object.
## 2.5 % 97.5 %
## (Intercept) 0.4784802 5.065692
## explanatory 0.6756256 0.918344
Hopefully these confidence intervals look the same as those you calculated yourself.
So far, we have covered uncertainty in parameter estimates using standard errors and confidence intervals but linear regression analyses can also be used for prediction. When a linear regression is used for prediction, an interval of confidence in the prediction should also be given. This is a type of confidence interval specific for predictions and can be called a prediction interval.
A 95% prediction interval tells you, if you were to collect a sample and run the analysis, then go out an collect a new observation of the response variable (\(Y\)) with particular value of the explanatory variable (\(X\)) many many times AND each time draw a prediction interval, 95% of the time, the new observation would fall in within the prediction interval.
To find the prediction interval for a prediction you use the predict()
function with the interval="prediction"
argument.
You also set the newdata
argument to the value of \(X\) you want to predict for.
At the end of the last section, we created a plot of our dive depth data and the estimated linear regression line. Now, we will add uncertainty to that plot.
First, we should look at the confidence intervals of our parameter estimates.
## 2.5 % 97.5 %
## (Intercept) 449.03 1062.93
## body_size_kg -0.02 0.01
The confidence intervals have been rounded to 2 decimal places to make them easier to read. The intercept interval spans from approx 450 to 1060. The slope interval crosses 0, going from -0.02 to +0.01.
To add these intervals to the plot, we need to make new predictions including the confidence interval. In this case we are using confidence intervals even though we are predicting, because we want to show the uncertainty in \(\alpha\) and \(\beta\) rather than in a novel prediction. We have to make predictions in order to plot rather than because we are interested in them.
Once we have created predictions of \(Y\) from the model object,
we can then plot these using geom_line()
and geom_ribbon()
for the confidence interval as in the code below.
depth_predictions <- depth_predictions %>%
mutate(x=dive_data$body_size_kg)
fig_7 <- ggplot()+
geom_ribbon(data=depth_predictions, aes(x=x, ymin=lwr, ymax=upr), fill='grey50')+
geom_point(data=dive_data, aes(x=body_size_kg, y=max_depth),colour = 'grey70')+
geom_line(data=depth_predictions, aes(y=fit, x=x))+
labs(y="Maximum Dive Depth (m)",
x="Body Size (kg)")
Figure 7: Scatter plot of dive depths against body size. Line is the regression line from a linear model. Shaded area is 95% confidence interval
You will notice that the confidence interval is narrower in the middle and wider at the ends. This is partly to do with the constraint that the regression line must go through the (\(\bar{X}\), \(\bar{Y}\)) point so the intercept and slope are not independent. Therefore, the confidence interval will be narrowest close to that point.
The plot shows that the uncertainty in the estimated relationship gets increasingly uncertain as you get to higher body sizes.
A colleague as just found a new species of whale (fictional). The whale washed up on shore in Tromsø, it weighed 75000kg. Based on our linear regression analysis, how deep would we expect it to dive?
predict(dive_model, newdata=data.frame(body_size_kg=75000),
type="response", interval = "prediction")
## fit lwr upr
## 1 352.1753 -1328.376 2032.727
The mean prediction is 352m deep. This seems ok. But when we look at the prediction interval, we see that when we include uncertainty, we are not even sure if they whale will dive below the surface by 2km or jump into the air by 1.3km. When we include uncertainty, it is clear that based on the current data and model, we cannot say anything about the possible dive depth of the new whale. We even get biologically unrealistic predictions.
This is something we will look at in the next section.
You have now have a model, estimates of parameters, and have calculated the uncertainty of the parameters. But how can we know if a model is any good?
To find out if the model is any good from a theoretical perspective, you need to check if the model meets the five assumptions of the linear regression that are stated in the Model details section above.
For this, you can use graphs called diagnostic plots.
There are four key diagnostic plots that we use for simple linear regression
and each plot tests whether a different assumption has been met.
To make the diagnostic plots, you use the plot()
function with
our linear model object as the first argument and which = number
as the second.
The number should be replaced by the number corresponding to the plot you want.
For more on what each plot means, go here: Model checking. On this page, you can find some example plots for a simple linear regression.
Note: for all of these plots, we do not expect perfection, especially for biological data.
There are two examples of residuals vs fitted plots shown below. If the assumptions are met, you can expect a few characteristics of the plot:
Figure 8: Example of a good (left) and a bad (right) residuals vs fitted plot for a simple linear regression
In reality, it will not be as clear cut as the examples above. You will need to use your own judgment to decide if the assumption is sufficiently met. Do not expect perfection.
There are two examples of normal QQ plots shown below. If the assumptions are met, you can expect:
Figure 9: Example of a good (left) and a bad (right) normal QQ diagnostic plot
Notice that the y-axes have different values.
There are two examples of scale-leverage plots shown below. If the assumptions are met, you can expect:
Figure 10: Example of a good (left) and a bad (right) scale-location plot
There are two examples of Cook’s Distance plots shown below. If the assumptions are met, you can expect:
Figure 11: Example of a good (left) and a bad (right) Cook’s Distance plot
Notice that the y-axes have different values.
Using the theory covered in the previous section, we can now check our dive_model to ensure that it meets the assumptions of a simple linear regression.
We will use one plot at a time to test specific assumptions.
Figure 12: Residuals vs fitted plot for dive depths model
12 looks quite unusual.
There are a few assumptions we are checking with this plot:
At this point, it might be hard to say how problematic the low variance caused by the lack of data for low fitted values is. But, we can have a look at the points causing the pattern (those with a maximum dive depth < 600m).
## # A tibble: 19 x 6
## species max_depth body_size_kg group Species Group
## <chr> <int> <dbl> <chr> <fct> <fct>
## 1 Orca 264 4500 Twhale Orca Twhale
## 2 HomoSapien 112 85 Other HomoSapien Other
## 3 BlueWhale 172 100000 Bwhale BlueWhale Bwhale
## 4 FinWhale 128 60000 Bwhale FinWhale Bwhale
## 5 HumpbackWhale 240 36000 Bwhale HumpbackWhale Bwhale
## 6 SouthernRightWhale 180 54000 Bwhale SouthernRightWhale Bwhale
## 7 BrydesWhale 300 41000 Bwhale BrydesWhale Bwhale
## 8 Walrus 450 1500 Pinniped Walrus Pinnip~
## 9 LeopardSeal 16 400 Pinniped LeopardSeal Pinnip~
## 10 HarbourSeal 481 85 Pinniped HarbourSeal Pinnip~
## 11 CaliforniaSeaLion 536 226 Pinniped CaliforniaSeaLion Pinnip~
## 12 NZFurSeal 238 100 Pinniped NZFurSeal Pinnip~
## 13 NZSeaLion 550 110 Pinniped NZSeaLion Pinnip~
## 14 MinkeWhale 106 8000 Bwhale MinkeWhale Bwhale
## 15 CommonDolphin 260 80 Twhale CommonDolphin Twhale
## 16 HarbourPorpoise 100 55 Twhale HarbourPorpoise Twhale
## 17 PolarBear 13 500 Other PolarBear Other
## 18 SeaOtter 101 30 Other SeaOtter Other
## 19 BottlenoseDolphin 300 450 Twhale BottlenoseDolphin Twhale
From these data, we can see that five of the six baleen whales (Group = Bwhale) are included in this subset. This is quite interesting. It could be biologically reasonable that these whales follow a different pattern to the other species because their physiology is very different.
Keep that last point in mind when we get on to interpreting the results.
Figure 13: Normal QQ plot for dive depth model
The assumption we are checking with this plot is: are the residuals are normally distributed?
As expected, there is not a perfect match between the theoretical normal distribution and the distribution of the residuals. There is some deviation at both tails of the distribution. At lower quantiles, this seems ok. At higher values, points 9 and 10 deviate quite a lot. These points also stood out in 12. We will need to look into them more in 15.
Figure 14: Scale-location plot for dive depth model
The assumption we are checking with this plot is: Do the residuals have equal variance across fitted values?
?? shows a very similar picture to ??. While there is a slight increase in variance as the amount of data increases, the amount of change is < 0.5 and there is not much structure in the points. So, this looks like things are ok.
Figure 15: Cook’s distance plot for the dive depth model
The assumption we are checking with this plot is: Are there any outliers?
?? shows that the Cook’s distances of this model are not very high (max = 0.2). So, it does not seem that any points have that large an influence on the fitted values. However, point 10 does seem to be quite different from the others. This might be worth looking into.
We can find point 10 by looking at the 10th row of our data frame. It is the entry for the Cuvier’s beaked whale. While this is a rare and unusual whale, it is not the only beaked whale in our dataset and we have no reason to believe this data is a typo. Therefore, we would not consider this an outlier and would not remove from the data.
## # A tibble: 1 x 6
## species max_depth body_size_kg group Species Group
## <chr> <int> <dbl> <chr> <fct> <fct>
## 1 CuviersBeakedWhale 2992 2500 Twhale CuviersBeakedWhale Twhale
Overall, it seems that most of the model assumptions are met well. The only red flag is in the equal variance assumption. However, this seems to be caused in part by a lower amount of data available at the extremes of dive depths. Therefore, we think it is ok to proceed with this model.
In the next section we will interpret our results.
In the previous sections you learned how to run a simple linear regression models, what the parameters of the model mean, how to quantify uncertainty in the parameters, and how to check the assumptions of the model. Now, you can bring everything together to draw some conclusions.
There are several components required in drawing a conclusion:
One measure that can be useful for conclusions is to know how much of the variation in \(Y\) is explained by \(X\). To answer this you can use a measure called the \(R^2\). It is calculated using the following equation:
\[ R^2 = 1 - \frac{\color{darkblue}{\text{Residual Variance}} }{\text{Total Variance}} \]
and can be found in R using summary(your_model)$r.squared
.
The value is a proportion,
so between 0 (no variance explained) and 1 (all variance explained).
A good value is subjective, but > 0.5 is usually considered good, > 0.7
is very good and a value of 1 is suspicious. For biological data, achieving an
\(R^{2}\) of 0.5 or higher can be challenging because in the real world there are
lots of variables influencing each response, more than we could measure and
include in an analysis. In these cases, \(R^{2}\) could be as low as e.g. 0.3 or
0.1,
but this does not mean the model is not valid. It only means that other
processes are causing variation in the response on top of those included in the
model.
This is the final section of our analysis of the data on marine mammal maximum dive depths. We will now bring together all of the results we have obtained and draw a conclusion following the same format as in the theory section.
A reminder, we were asking: Does body size influence maximum dive depth in marine mammals?
The maximum likelihood estimate of the relationship between body size and maximum dive depth in marine mammals was -0.005. In other words, for every 1 kg increase in body weight, marine mammals dived 0.005 m less deep. Given that some marine mammals can dive 1000s of metres, this increase per kg is very low.
When we look at the uncertainty in this estimate, we see the 95% confidence interval is -0.017 to 0.006. The confidence interval limits are different signs, meaning that 0 is included as a plausible value for the strength of the relationship. Therefore, we cannot conclude that body size has any impact on maximum dive depth in marine mammals. This is supported by the \(R^2\) value, which is 0.03, suggesting only 3% of the variation in maximum dive depth is explained by body size. This is a good example of the lower \(R^2\) we often get in biology.
We should remember here, that when we predicted a dive depth for the fictional whale, it gave us a negative result. Linear models fit a straight line, which can extend beyond the realistic values for some variables. This should be noted and predictions should not be made outside of values that are plausible. This is one reason why predicting outside of the range of your data can be problematic. We can also fit models that are not linear, there is more on those here.
In model checking, there was an interesting pattern shown, of little data at the lowest maximum dive depths and a disproportionate representation of baleen whales. If we plot the data by group and allow ggplot to fit a regression line per group, we can see that there seems to be a different pattern for different taxonomic groups. This would make sense based on physiology and might explain why the variance assumption was not met well AND why the estimated relationship was so weak.
final_dive_fig<- ggplot(dive_data, aes(x=body_size_kg, y=max_depth, color=group))+
geom_point()+
facet_wrap(.~group)+
geom_smooth(method='lm')
Figure 16: Plot of the relationship between dive depth and body size faceted by taxonomic group
While the baleen whales show no relationship between dive depth and body size, both the pinnipeds (seals and sea lions) and the toothed whales (orca, dolphins, porpoises) show quite strong positive relationships between dive depth and body size. By treating all groups as the same in one analysis, we might have been masking the true effects.
It might be more appropriate to consider an effect of Group as well as body size. We can do this using Linear models for categorical explanatory variables/ANOVA, check out these pages to continue the analysis of this data.