17 Handling missing values
Many datasets have missing values. These could represent data that were not measured, perhaps because of instrument failure, or were impossible to measure, for example water temperature at 200m where the sea is only 100m deep.
Missing values in R are represented by NA
.
x <- c(0, 5, NA)
x
## [1] 0 5 NA
This tutorial shows you how to cope with missing values in R, focusing on manipulating data with the tidyverse
package, running statistical analyses, and making figures with ggplot2
.
The penguins
dataset from palmerpenguins
is used as an example.
17.1 Data handling with missing values
17.1.1 Detecting missing values
NA
values can be detected with the function is.na
(NB lower case).
## [1] FALSE FALSE TRUE
We could use this, for example, to find the number of missing values in a column of the penguins
data.
## # A tibble: 1 × 1
## n_missing
## <int>
## 1 2
17.1.2 Importing data with missing values
When you import a text file (e.g., a csv file) any blank cells, or cells with “NA” will be treated as NA
.
If you have coded missing values as something else, you can use the na
argument to read_delim
# set blank cells, "NA" or "missing" to NA
read_delim(file = "my_file.csv", na = c("", "NA", "missing"))
In readxl::read_excel
, the default for the na
argument is just for blank cells to be made NA
, but other values can be added in the same way as in read_delim
.
17.1.3 Removing rows with missing values
We can remove rows with NA
in particular columns from a data frame using drop_na
.
For example, to remove rows in the penguins
data set with an NA
in the bill_length_mm
or bill_depth_mm
columns, we could use
penguins |>
drop_na(bill_length_mm, bill_depth_mm)
## # A tibble: 342 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 339 more rows, and 1 more variable: year <int>
Alternatively, we can use filter
and is.na
.
This is most useful when removing NA
is one of several arguments to filter
.
## # A tibble: 123 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Gentoo Biscoe 46.1 13.2 211 4500 fema…
## 2 Gentoo Biscoe 50 16.3 230 5700 male
## 3 Gentoo Biscoe 48.7 14.1 210 4450 fema…
## # … with 120 more rows, and 1 more variable: year <int>
You might see some code using na.omit
to remove rows with missing values from a data frame.
This can be dangerous as it will remove rows with an NA
in any column, not just the columns you are interested in.
17.1.4 Replacing missing values
Sometimes you want to replace NA
with another value.
Perhaps you want to make a plot where NA
values are labelled unknown, or you know that the NA
values are actually zeros, or the NA
are values below the detection limit of an instrument and you want to replace these with half the detection limit.
You can use tidyr::replace_na
to do this.
Here, the code replaces missing values for bill length with the mean value for bill length.
penguins |>
mutate(
bill_length_mm = replace_na(
data = bill_length_mm,
replace = mean(bill_length_mm, na.rm = TRUE))
)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 341 more rows, and 1 more variable: year <int>
17.1.5 Replacing NA
in factors
The code in the previous section won’t work if we try to replace an NA
in a factor.
penguins |>
mutate(sex = replace_na(sex, "missing")) |>
distinct(sex)
## Warning in `[<-.factor`(`*tmp*`, !is_complete(data), value = "missing"): invalid
## factor level, NA generated
## # A tibble: 3 × 1
## sex
## <fct>
## 1 male
## 2 female
## 3 <NA>
As the warning indicates, the problem is that “missing” is not one of the levels of the factor sex.
Instead we need to use the function forcats::fct_explicit_na
(the forcats
package is part of tidyverse
for manipulating factors).
## # A tibble: 3 × 2
## sex n
## <fct> <int>
## 1 female 165
## 2 male 168
## 3 missing 11
17.1.6 Replacing NA
with values from another vector
Sometimes you have a vector with NA
and you want to replace the missing values with values from a second vector.
The coalesce
can do this.
## [1] -1 2 3 -4
Here, the first value of x
is NA
, so the first value of y
is used.
The second value of x
is not NA
, and so can be used.
And so on.
17.1.7 Setting mising values
Some data sets use a number to represent a missing value, for example -9999.
Obviously, if you do any calculations with a data set containing -9999 as a missing value, the results could be seriously wrong.
Instead we need to replace these values with NA
.
We can do this with dplyr::na_if
.
## [1] 1 7 NA
17.1.8 Missing values and dplyr filter
dplyr::filter
returns rows where the condition is strictly TRUE.
This is usually what you want: if we want to filter penguins with long bills from the penguins dataset we don’t want the birds with unknown bill length.
Occasionally we do want to keep the rows with NA
.
Perhaps we have a column of comments, many of which are NA
, and we only want to remove rows where the non-NA
values meet some criterion.
## # A tibble: 11 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen NA NA NA NA
## 2 Adelie Torgersen 34.1 18.1 193 3475
## 3 Adelie Torgersen 42 20.2 190 4250
## 4 Adelie Torgersen 37.8 17.1 186 3300
## 5 Adelie Torgersen 37.8 17.3 180 3700
## 6 Adelie Dream 37.5 18.9 179 2975
## 7 Gentoo Biscoe 44.5 14.3 216 4100
## 8 Gentoo Biscoe 46.2 14.4 214 4650
## 9 Gentoo Biscoe 47.3 13.8 216 4725
## 10 Gentoo Biscoe 44.5 15.7 217 4875
## 11 Gentoo Biscoe NA NA NA NA
## # … with 2 more variables: sex <fct>, year <int>
Here the |
means OR in Boolean logic.
17.1.9 Missing values and dplyr if_else
and case_when
Sometimes is it useful to set values to NA
with if_else
(when there is a choice of two outcomes) or case_when
(when there is a choice of two or more outcomes).
All possible values returned by if_else
and case_when
need to be of the same type (character, numeric, integer, logical).
If the types are inconsistent, an error is produced.
This is useful as forcing consistency helps avoid unexpected behaviour.
x <- 0:2
case_when(
x == 0 ~ NA,
x == 1 ~ "One",
x == 2 ~ "Two"
)
## Error: must be a logical vector, not a character vector.
The problem occurs here because NA
is treated as a logical vector, while the other values are characters.
The solution is to use NA_character_
which is an NA
with the correct type.
x <- 0:2
case_when(
x == 0 ~ NA_character_,
x == 1 ~ "One",
x == 2 ~ "Two"
)
## [1] NA "One" "Two"
Other typed NA
include NA_real_
for numeric values and NA_integer
for integer values.
17.2 Statistics with missing values
17.2.1 NA
arithemetic
What is five plus an unknown number? The answer is, of course, unknown.
5 + NA
## [1] NA
NA are contagious in calculations: if one value is NA
the result is NA
.
This effects many descriptive statistics.
## [1] NA
mean(x)
## [1] NA
min(x)
## [1] NA
The solution is to use the na.rm
argument to these functions to exclude the NA
from the calculation.
sum(x, na.rm = TRUE)
## [1] 8
mean(x, na.rm = TRUE)
## [1] 4
min(x, na.rm = TRUE)
## [1] 1
17.2.2 NA
in correlations and covariances
The functions for calculating correlation, cor
, and covariance, cov
, work a little differently as these functions can work on two vectors or on a matrix or data frame.
The use
argument is used to control how NA
are treated.
By default, if any values are NA
in either vector, the result is also NA
.
If you want to find the correlation between two vectors without the NA
, then use
cor(x = penguins$bill_length_mm,
y = penguins$bill_depth_mm,
use = "pairwise.complete.obs")
## [1] -0.2350529
If you have a matrix (or data frame), and want to calculate a correlation matrix, then use = "complete.obs"
will calculate this using just the rows that have no NA
, and use = "pairwise.complete.obs"
will calculate the correlation between each pair of variables using all complete pairs of observations on those variables.
penguins |>
select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
cor(use = "pairwise.complete.obs")
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.0000000 -0.2350529 0.6561813 0.5951098
## bill_depth_mm -0.2350529 1.0000000 -0.5838512 -0.4719156
## flipper_length_mm 0.6561813 -0.5838512 1.0000000 0.8712018
## body_mass_g 0.5951098 -0.4719156 0.8712018 1.0000000
17.2.3 Missing values and regession models
By default regression models such as lm
, glm
, and lmer
remove any case that has an NA
in either the response or predictors.
This behaviour is controlled by the na.action
argument.
This is great if we are interested in the model coefficients, but it can cause problems if there are NA
in the data and we want to add the residuals, fitted values, or predictions into the original data frame for plotting.
## Error: Problem with `mutate()` column `fit`.
## ℹ `fit = fitted(mod)`.
## ℹ `fit` must be size 344 or 1, not 342.
The problem is that there are fewer observation in the model (which omitted the NA
) than the original data frame (which still has them).
We can make this work by using na.action = na.exclude
which will pad the fitted values with NA
so that it is the same length as the original data.
mod <- lm(bill_length_mm ~ body_mass_g, data = penguins, na.action = na.exclude)
penguins |> mutate(fit = fitted(mod))
## # A tibble: 344 × 9
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 341 more rows, and 2 more variables: year <int>, fit <dbl>
With na.exclude
the NA
are still excluded from the model fitting - most models do not allow NA
values.
When comparing models, for example with anova
, all models need to have been fit to the same dataset.
This can cause problems if NA
have caused different numbers of observations to be removed.
mod2 <- lm(bill_length_mm ~ body_mass_g + sex, data = penguins, na.action = na.exclude)
anova(mod, mod2)
## Error in anova.lmlist(object, ...): models were not all fitted to the same size of dataset
The easiest solution is to remove rows with NA
in any predictor before fitting any of the models.
17.2.4 Imputing missing values
As shown above, observations with missing values are omitted from the model. If a predictor has many missing values, it may be better to exclude the predictor from the model to avoid losing too many observations.
An alternative is to impute the missing data. This should be done with caution as it can bias the results, especially if a substantial proportion of the data are imputed. On the other hand, if missing data are not randomly distributed, omitting observations with missing data can also bias the results.
There are several ways that can be used to impute missing values.
The simplest is to replace a the missing value with the mean or median of the variable as shown in section 17.1.4.
More complex methods use the multivariate relationship between predictors to estimate the missing values.
Several R packages can help with this, e.g., mice
.
Yadav and Roychoudhury (2018) compare the performance of some popular methods.
17.3 Missing values and ggplot2
By default, missing values in the x
or y
aesthetics are dropped by ggplot
with a warning, whereas missing values in the colour
or fill
aesthetics are shown in grey.
This behaviour can be controlled with the na.value
argument to the relevant scale_*_*
function.
#x or y NA
p <- tibble(
x = 1:5,
y = c(1, 2, NA, 4, 5),
colour = c(1, 2, 3, 4, NA)
) |>
ggplot(aes(x = x, y = y, colour = colour)) +
geom_point(size = 3)
p
## Warning: Removed 1 rows containing missing values (geom_point).
# change defaults
p +
scale_y_continuous(na.value = 0) +
scale_colour_continuous(na.value = "hotpink")