17 Handling missing values

Many datasets have missing values. These could represent data that were not measured, perhaps because of instrument failure, or were impossible to measure, for example water temperature at 200m where the sea is only 100m deep.

Missing values in R are represented by NA.

x <- c(0, 5, NA)
x
## [1]  0  5 NA

This tutorial shows you how to cope with missing values in R, focusing on manipulating data with the tidyverse package, running statistical analyses, and making figures with ggplot2.

The penguins dataset from palmerpenguins is used as an example.

17.1 Data handling with missing values

17.1.1 Detecting missing values

NA values can be detected with the function is.na (NB lower case).

x <- c(0, 7, NA)
is.na(x)
## [1] FALSE FALSE  TRUE

We could use this, for example, to find the number of missing values in a column of the penguins data.

penguins |> 
  summarise(n_missing = sum(is.na(bill_length_mm)))
## # A tibble: 1 × 1
##   n_missing
##       <int>
## 1         2

17.1.2 Importing data with missing values

When you import a text file (e.g., a csv file) any blank cells, or cells with “NA” will be treated as NA. If you have coded missing values as something else, you can use the na argument to read_delim

# set blank cells, "NA" or "missing" to NA  
read_delim(file = "my_file.csv", na = c("", "NA", "missing"))

In readxl::read_excel, the default for the na argument is just for blank cells to be made NA, but other values can be added in the same way as in read_delim.

17.1.3 Removing rows with missing values

We can remove rows with NA in particular columns from a data frame using drop_na. For example, to remove rows in the penguins data set with an NA in the bill_length_mm or bill_depth_mm columns, we could use

penguins |> 
  drop_na(bill_length_mm, bill_depth_mm)
## # A tibble: 342 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## # … with 339 more rows, and 1 more variable: year <int>

Alternatively, we can use filter and is.na. This is most useful when removing NA is one of several arguments to filter.

penguins |> 
  filter(species == "Gentoo", !is.na(bill_length_mm))
## # A tibble: 123 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Gentoo  Biscoe           46.1          13.2              211        4500 fema…
## 2 Gentoo  Biscoe           50            16.3              230        5700 male 
## 3 Gentoo  Biscoe           48.7          14.1              210        4450 fema…
## # … with 120 more rows, and 1 more variable: year <int>

You might see some code using na.omit to remove rows with missing values from a data frame. This can be dangerous as it will remove rows with an NA in any column, not just the columns you are interested in.

17.1.4 Replacing missing values

Sometimes you want to replace NA with another value. Perhaps you want to make a plot where NA values are labelled unknown, or you know that the NA values are actually zeros, or the NA are values below the detection limit of an instrument and you want to replace these with half the detection limit.

You can use tidyr::replace_na to do this.

Here, the code replaces missing values for bill length with the mean value for bill length.

penguins |> 
  mutate(
    bill_length_mm = replace_na(
      data = bill_length_mm, 
      replace = mean(bill_length_mm, na.rm = TRUE))
  )
## # A tibble: 344 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## # … with 341 more rows, and 1 more variable: year <int>

17.1.5 Replacing NA in factors

The code in the previous section won’t work if we try to replace an NA in a factor.

penguins |> 
  mutate(sex = replace_na(sex, "missing")) |> 
  distinct(sex)
## Warning in `[<-.factor`(`*tmp*`, !is_complete(data), value = "missing"): invalid
## factor level, NA generated
## # A tibble: 3 × 1
##   sex   
##   <fct> 
## 1 male  
## 2 female
## 3 <NA>

As the warning indicates, the problem is that “missing” is not one of the levels of the factor sex.

Instead we need to use the function forcats::fct_explicit_na (the forcats package is part of tidyverse for manipulating factors).

penguins |> 
  mutate(sex = fct_explicit_na(sex, na_level = "missing")) |> 
  count(sex)
## # A tibble: 3 × 2
##   sex         n
##   <fct>   <int>
## 1 female    165
## 2 male      168
## 3 missing    11

17.1.6 Replacing NA with values from another vector

Sometimes you have a vector with NA and you want to replace the missing values with values from a second vector. The coalesce can do this.

x <- c(NA, 2, 3, NA)
y <- c(-1, -2, -3, -4)
coalesce(x, y)
## [1] -1  2  3 -4

Here, the first value of x is NA, so the first value of y is used. The second value of x is not NA, and so can be used. And so on.

17.1.7 Setting mising values

Some data sets use a number to represent a missing value, for example -9999. Obviously, if you do any calculations with a data set containing -9999 as a missing value, the results could be seriously wrong. Instead we need to replace these values with NA. We can do this with dplyr::na_if.

x <- c(1, 7, -9999)
na_if(x, y = -9999)
## [1]  1  7 NA

17.1.8 Missing values and dplyr filter

dplyr::filter returns rows where the condition is strictly TRUE. This is usually what you want: if we want to filter penguins with long bills from the penguins dataset we don’t want the birds with unknown bill length.

Occasionally we do want to keep the rows with NA. Perhaps we have a column of comments, many of which are NA, and we only want to remove rows where the non-NA values meet some criterion.

penguins |> 
  filter(sex == "Female" | is.na(sex))
## # A tibble: 11 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           NA            NA                  NA          NA
##  2 Adelie  Torgersen           34.1          18.1               193        3475
##  3 Adelie  Torgersen           42            20.2               190        4250
##  4 Adelie  Torgersen           37.8          17.1               186        3300
##  5 Adelie  Torgersen           37.8          17.3               180        3700
##  6 Adelie  Dream               37.5          18.9               179        2975
##  7 Gentoo  Biscoe              44.5          14.3               216        4100
##  8 Gentoo  Biscoe              46.2          14.4               214        4650
##  9 Gentoo  Biscoe              47.3          13.8               216        4725
## 10 Gentoo  Biscoe              44.5          15.7               217        4875
## 11 Gentoo  Biscoe              NA            NA                  NA          NA
## # … with 2 more variables: sex <fct>, year <int>

Here the | means OR in Boolean logic.

17.1.9 Missing values and dplyr if_else and case_when

Sometimes is it useful to set values to NA with if_else (when there is a choice of two outcomes) or case_when (when there is a choice of two or more outcomes).

All possible values returned by if_else and case_when need to be of the same type (character, numeric, integer, logical). If the types are inconsistent, an error is produced. This is useful as forcing consistency helps avoid unexpected behaviour.

x <- 0:2
case_when(
  x == 0 ~ NA,
  x == 1 ~ "One",
  x == 2 ~ "Two"
)
## Error: must be a logical vector, not a character vector.

The problem occurs here because NA is treated as a logical vector, while the other values are characters. The solution is to use NA_character_ which is an NA with the correct type.

x <- 0:2
case_when(
  x == 0 ~ NA_character_,
  x == 1 ~ "One",
  x == 2 ~ "Two"
)
## [1] NA    "One" "Two"

Other typed NA include NA_real_ for numeric values and NA_integer for integer values.

17.2 Statistics with missing values

17.2.1 NA arithemetic

What is five plus an unknown number? The answer is, of course, unknown.

5 + NA
## [1] NA

NA are contagious in calculations: if one value is NA the result is NA. This effects many descriptive statistics.

x <- c(1, 7, NA)
sum(x)
## [1] NA
mean(x)
## [1] NA
min(x)
## [1] NA

The solution is to use the na.rm argument to these functions to exclude the NA from the calculation.

sum(x, na.rm = TRUE)
## [1] 8
mean(x, na.rm = TRUE)
## [1] 4
min(x, na.rm = TRUE)
## [1] 1

17.2.2 NA in correlations and covariances

The functions for calculating correlation, cor, and covariance, cov, work a little differently as these functions can work on two vectors or on a matrix or data frame. The use argument is used to control how NA are treated.

By default, if any values are NA in either vector, the result is also NA. If you want to find the correlation between two vectors without the NA, then use

cor(x = penguins$bill_length_mm,
    y = penguins$bill_depth_mm, 
    use = "pairwise.complete.obs")
## [1] -0.2350529

If you have a matrix (or data frame), and want to calculate a correlation matrix, then use = "complete.obs" will calculate this using just the rows that have no NA, and use = "pairwise.complete.obs" will calculate the correlation between each pair of variables using all complete pairs of observations on those variables.

penguins |> 
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |> 
  cor(use = "pairwise.complete.obs") 
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2350529         0.6561813   0.5951098
## bill_depth_mm         -0.2350529     1.0000000        -0.5838512  -0.4719156
## flipper_length_mm      0.6561813    -0.5838512         1.0000000   0.8712018
## body_mass_g            0.5951098    -0.4719156         0.8712018   1.0000000

17.2.3 Missing values and regession models

By default regression models such as lm, glm, and lmer remove any case that has an NA in either the response or predictors. This behaviour is controlled by the na.action argument. This is great if we are interested in the model coefficients, but it can cause problems if there are NA in the data and we want to add the residuals, fitted values, or predictions into the original data frame for plotting.

mod <- lm(bill_length_mm ~ body_mass_g, data = penguins)

penguins |> mutate(fit = fitted(mod))
## Error: Problem with `mutate()` column `fit`.
## ℹ `fit = fitted(mod)`.
## ℹ `fit` must be size 344 or 1, not 342.

The problem is that there are fewer observation in the model (which omitted the NA) than the original data frame (which still has them).

We can make this work by using na.action = na.exclude which will pad the fitted values with NA so that it is the same length as the original data.

mod <- lm(bill_length_mm ~ body_mass_g, data = penguins, na.action = na.exclude)

penguins |> mutate(fit = fitted(mod))
## # A tibble: 344 × 9
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## # … with 341 more rows, and 2 more variables: year <int>, fit <dbl>

With na.exclude the NA are still excluded from the model fitting - most models do not allow NA values.

When comparing models, for example with anova, all models need to have been fit to the same dataset. This can cause problems if NA have caused different numbers of observations to be removed.

mod2 <- lm(bill_length_mm ~ body_mass_g + sex, data = penguins, na.action = na.exclude)

anova(mod, mod2)
## Error in anova.lmlist(object, ...): models were not all fitted to the same size of dataset

The easiest solution is to remove rows with NA in any predictor before fitting any of the models.

17.2.4 Imputing missing values

As shown above, observations with missing values are omitted from the model. If a predictor has many missing values, it may be better to exclude the predictor from the model to avoid losing too many observations.

An alternative is to impute the missing data. This should be done with caution as it can bias the results, especially if a substantial proportion of the data are imputed. On the other hand, if missing data are not randomly distributed, omitting observations with missing data can also bias the results.

There are several ways that can be used to impute missing values.

The simplest is to replace a the missing value with the mean or median of the variable as shown in section 17.1.4.

More complex methods use the multivariate relationship between predictors to estimate the missing values. Several R packages can help with this, e.g., mice. Yadav and Roychoudhury (2018) compare the performance of some popular methods.

17.3 Missing values and ggplot2

By default, missing values in the x or y aesthetics are dropped by ggplot with a warning, whereas missing values in the colour or fill aesthetics are shown in grey. This behaviour can be controlled with the na.value argument to the relevant scale_*_* function.

#x or y NA
p <- tibble(
  x = 1:5, 
  y = c(1, 2, NA, 4, 5),
  colour = c(1, 2, 3, 4, NA)
) |> 
  ggplot(aes(x = x, y = y, colour = colour)) +
  geom_point(size = 3)

p
## Warning: Removed 1 rows containing missing values (geom_point).
# change defaults
p + 
  scale_y_continuous(na.value = 0) +
  scale_colour_continuous(na.value = "hotpink")