7 Working with single tables in dplyr
Data frames are usually the most convenient objects for storing, plotting or analysing data in R. We also need to be able to manipulate data in data frames.
This tutorial will show you how to manipulate data frames using the dplyr
package, part of tidyverse
.
Everything that can be done with dplyr
can be done with base R.
The dplyr
solution is usually easier to write and understand, and can be faster.
There are many function in dplyr
, this tutorial focuses on the functions I find most useful.
7.0.1 Load the package
You can load dplyr
with
But is is usually more convenient to load tidyverse
so that you get ggplot2
, dplyr
, readr
and other useful packages with one command.
7.0.2 More feedback tidylog
We can get more feedback on what dplyr
functions have done to the data by loading the tidylog
package.
More than once, tidylog
has helped identified bugs in my code.
7.1 The penguins
dataset
This tutorial will use the penguins
dataset from the palmerpenguins
package.
This dataset includes measurements of three species of penguin.
# Load the data
data("penguins", package = "palmerpenguins")
# Show the data
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 341 more rows, and 1 more variable: year <int>
7.2 Selecting columns with select()
You can choose which columns of the data frame you want with select()
.
The first argument is the data, which is supplied by the pipe |>
, the next arguments are the names of the columns you want.
The names do not need quote marks.
#select species, bill_length_mm & bill_depth_mm
penguins |>
select(species, bill_length_mm, bill_depth_mm)
## select: dropped 5 variables (island, flipper_length_mm, body_mass_g, sex, year)
## # A tibble: 344 × 3
## species bill_length_mm bill_depth_mm
## <fct> <dbl> <dbl>
## 1 Adelie 39.1 18.7
## 2 Adelie 39.5 17.4
## 3 Adelie 40.3 18
## # … with 341 more rows
This is equivalent to the base R code
#select species, bill_length_mm & bill_depth_mm
penguins[, c("species", "bill_length_mm", "bill_depth_mm")]
## # A tibble: 344 × 3
## species bill_length_mm bill_depth_mm
## <fct> <dbl> <dbl>
## 1 Adelie 39.1 18.7
## 2 Adelie 39.5 17.4
## 3 Adelie 40.3 18
## # … with 341 more rows
Remember that if you want to use the output of this code in a further analysis, you need to assign it to an object name with <-
.
Exercise
From the penguins
data, select
species
-
species
andbill_length_mm
- all columns except year
Hint
|>
penguins select(___, ___)
7.2.1 select()
helpers
Sometimes we don’t want to write out the names of all the columns we want to select. We might not even know them all in advance. Fortunately there are some helper functions.
If you want to select()
adjacent columns, you can use the notation first:last
.
#select species to bill_depth_mm
penguins |> select(species:bill_depth_mm)
## select: dropped 4 variables (flipper_length_mm, body_mass_g, sex, year)
## # A tibble: 344 × 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7
## 2 Adelie Torgersen 39.5 17.4
## 3 Adelie Torgersen 40.3 18
## # … with 341 more rows
Sometimes it is easier to remove the columns you don’t want.
You can do this by putting a -
in front of the column name.
#select everything but year and sex
penguins |> select(-year, -sex)
## select: dropped 2 variables (sex, year)
## # A tibble: 344 × 6
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## # … with 341 more rows
If there is a pattern to the column names that we want to select (or remove), there are some helper functions.
For example, to select columns that start with “bill”, we can use starts_with()
.
#select bill_length_mm & bill_depth_mm"
penguins |> select(starts_with("bill"))
## select: dropped 6 variables (species, island, flipper_length_mm, body_mass_g, sex, …)
## # A tibble: 344 × 2
## bill_length_mm bill_depth_mm
## <dbl> <dbl>
## 1 39.1 18.7
## 2 39.5 17.4
## 3 40.3 18
## # … with 341 more rows
Conversely, if we want to select all columns that end with “mm”, we can use ends_with()
.
contains()
is more flexible and matches()
is the most powerful of the helper functions, using regular expressions to identify the columns (see the regular expression tutorial).
Sometimes, you might want to select all the columns of a certain type.
For example, to select all the numeric columns we can use the is.numeric
function inside select()
with the helper where()
.
penguins |> select(where(is.numeric)) # No brackets on the function
## select: dropped 3 variables (species, island, sex)
## # A tibble: 344 × 5
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <dbl> <dbl> <int> <int> <int>
## 1 39.1 18.7 181 3750 2007
## 2 39.5 17.4 186 3800 2007
## 3 40.3 18 195 3250 2007
## # … with 341 more rows
Other is.*
functions exist, for example, is.character
for text.
You can also select columns by number (1 being the first column), but this is generally a bad idea because it makes the code difficult to understand and if a new column is added, or the column order is changed, the code will break.
Which of these strategies works best is context dependent.
7.3 Renaming columns with rename
You can use rename()
to rename columns
penguins |> rename(Species = species)
## rename: renamed one variable (Species)
## # A tibble: 344 × 8
## Species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 341 more rows, and 1 more variable: year <int>
The syntax is new_name = current_name
.
You can also rename a column when selecting.
This is convenient if you are using select()
anyway.
penguins |> select(Species = species)
## select: renamed one variable (Species) and dropped 7 variables
## # A tibble: 344 × 1
## Species
## <fct>
## 1 Adelie
## 2 Adelie
## 3 Adelie
## # … with 341 more rows
7.4 Moving columns with relocate()
Sometimes it is useful to reorder the columns. This is never necessary for data analysis or plotting, but can be needed when making a table for presentation.
penguins |> relocate(island)
## relocate: columns reordered (island, species, bill_length_mm, bill_depth_mm, flipper_length_mm, …)
## # A tibble: 344 × 8
## island species bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Torge… Adelie 39.1 18.7 181 3750 male
## 2 Torge… Adelie 39.5 17.4 186 3800 fema…
## 3 Torge… Adelie 40.3 18 195 3250 fema…
## # … with 341 more rows, and 1 more variable: year <int>
The default is to move the named column first, the .before
and .after
arguments let you move the column into any position.
7.5 Filtering rows with filter()
Filtering rows that meet some condition is a very common task.
For example, to filter rows of penguins
that have a bill length greater than 40 mm, we can use
penguins |> filter(bill_length_mm > 40)
## filter: removed 102 rows (30%), 242 rows remaining
## # A tibble: 242 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 40.3 18 195 3250 fema…
## 2 Adelie Torge… 42 20.2 190 4250 <NA>
## 3 Adelie Torge… 41.1 17.6 182 3200 fema…
## # … with 239 more rows, and 1 more variable: year <int>
This will filter out each row where the condition is TRUE
.
The base R equivalent of this is
penguins[penguins$bill_length_mm > 40, ]
## # A tibble: 244 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 40.3 18 195 3250 fema…
## 2 <NA> <NA> NA NA NA NA <NA>
## 3 Adelie Torge… 42 20.2 190 4250 <NA>
## # … with 241 more rows, and 1 more variable: year <int>
Generally, filter
makes it easier to understand, especially with more complex criteria.
Other tests include
-
==
exactly equals. Often a bad idea to use with numeric data -
near
safe function for testing equality of numeric data as it has a tolerance for rounding errors.
sqrt(2) ^ 2 == 2 # should be true, but rounding errors
## [1] FALSE
sqrt(2) ^ 2 - 2 # the difference
## [1] 4.440892e-16
## [1] TRUE
-
!=
not equal to -
<
less than -
<=
less than or equal to -
>
greater than -
>=
greater than or equal to -
is.na()
for filtering by missing values. -
between()
for filtering values with a range -
%in%
is used when you want to test if a value is in a vector
## filter: removed 124 rows (36%), 220 rows remaining
## # A tibble: 220 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 217 more rows, and 1 more variable: year <int>
#equivalent to
penguins |>
filter(species == "Adelie" | species == "Chinstrap") # with many alternatives, this gets long
## filter: removed 124 rows (36%), 220 rows remaining
## # A tibble: 220 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 217 more rows, and 1 more variable: year <int>
7.5.1 Filtering on multiple criteria
If we want to filter on multiple criteria, we need to decide whether we want all criteria to be TRUE
(AND in Boolean logic), or for one or more to be TRUE
(OR in Boolean logic).
If we want all criteria to be TRUE
, we can separate them by a comma (or by an &
if you want to be explicit).
penguins |>
filter(bill_length_mm > 40, bill_depth_mm > 18)
## filter: removed 263 rows (76%), 81 rows remaining
## # A tibble: 81 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 42 20.2 190 4250 <NA>
## 2 Adelie Torge… 42.5 20.7 197 4500 male
## 3 Adelie Torge… 46 21.5 194 4200 male
## # … with 78 more rows, and 1 more variable: year <int>
If we want rows where any of the criteria is TRUE
, we can separate them by a |
.
penguins |>
filter(bill_length_mm > 40 | bill_depth_mm > 18)
## filter: removed 53 rows (15%), 291 rows remaining
## # A tibble: 291 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 40.3 18 195 3250 fema…
## 3 Adelie Torge… 36.7 19.3 193 3450 fema…
## # … with 288 more rows, and 1 more variable: year <int>
We can negate a criterion by putting !
in front of it.
So to filter rows that do not have bills longer than 40 mm we can use
penguins |> filter(!bill_length_mm > 40)
## filter: removed 244 rows (71%), 100 rows remaining
## # A tibble: 100 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 36.7 19.3 193 3450 fema…
## # … with 97 more rows, and 1 more variable: year <int>
Of course, in this example, we could also use <=
as the test.
7.5.2 Common errors
The commonest error is to use a single =
rather than ==
.
Only the latter is a test of equality.
If you do this, the error message is quite helpful.
penguins |> filter(species = "Chinstrap")
## Error: Problem with `filter()` input `..1`.
## x Input `..1` is named.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `species == "Chinstrap"`?
Another common error is to forget to quote any strings.
penguins |> filter(species == Chinstrap)
## Error: Problem with `filter()` input `..1`.
## ℹ Input `..1` is `species == Chinstrap`.
## x object 'Chinstrap' not found
Exercise
From the penguins
data, filter
- Gentoo penguins
- Gentoo or Adelie penguins
- penguins with a mass greater than or equal to 5000g
- penguins with a bill length between 45 and 50 mm
- Gentoo penguins not from from Dream Island
Hint
#hint 1
|> filter(___ = ___)
penguins
#hint 2
`%in%`
?
#hint 3
|> filter(___ >= ___)
penguins
#hint 4
?between
#hint 5
`!` ?
7.6 Slicing the data with slice()
Sometimes it is useful to extract rows by row number.
penguins |> slice(3:7)
## slice: removed 339 rows (99%), 5 rows remaining
## # A tibble: 5 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 40.3 18 195 3250 fema…
## 2 Adelie Torge… NA NA NA NA <NA>
## 3 Adelie Torge… 36.7 19.3 193 3450 fema…
## 4 Adelie Torge… 39.3 20.6 190 3650 male
## 5 Adelie Torge… 38.9 17.8 181 3625 fema…
## # … with 1 more variable: year <int>
You can use negative numbers to remove rows.
Be careful using slice()
as if the data change, different rows may be returned.
7.7 Distinct rows with distinct()
If there are duplicates in the data, we can remove these with distinct()
.
distinct()
with no extra arguments will remove duplicate rows.
penguins |> distinct()
## distinct: no rows removed
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 341 more rows, and 1 more variable: year <int>
If we are only interested in some of the columns, we can supply the names of these columns.
penguins |> distinct(island)
## distinct: removed 341 rows (99%), 3 rows remaining
## # A tibble: 3 × 1
## island
## <fct>
## 1 Torgersen
## 2 Biscoe
## 3 Dream
Other columns will be removed unless the argument .keep_all = TRUE
is used.
7.8 Random rows
Sometimes you want to sample rows at random from a data.frame.
This can be done with slice_sample()
.
This can either sample a constant n rows or constant fraction of the rows depending on whether the n
or prop
argument is used.
penguins |> slice_sample(n = 10)
## slice_sample: removed 334 rows (97%), 10 rows remaining
## # A tibble: 10 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 38.6 21.2 191 3800
## 2 Gentoo Biscoe 48.7 15.7 208 5350
## 3 Gentoo Biscoe 47.8 15 215 5650
## 4 Adelie Biscoe 35.5 16.2 195 3350
## 5 Adelie Biscoe 38.1 17 181 3175
## 6 Chinstrap Dream 49 19.6 212 4300
## 7 Adelie Dream 40.9 18.9 184 3900
## 8 Adelie Biscoe 37.7 18.7 180 3600
## 9 Adelie Dream 36 17.9 190 3450
## 10 Adelie Torgersen 45.8 18.9 197 4150
## # … with 2 more variables: sex <fct>, year <int>
7.9 Mutating and adding columns with mutate
The function mutate()
can add an new column or replace an existing one.
To make a new column called body_mass_kg
we can use
penguins |>
mutate(body_mass_kg = body_mass_g / 1000)
## mutate: new variable 'body_mass_kg' (double) with 95 unique values and 1% NA
## # A tibble: 344 × 9
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## # … with 341 more rows, and 2 more variables: year <int>, body_mass_kg <dbl>
There are lots of functions that are useful to use with mutate. Any function that returns either a single value or as many values as are in the data can be used.
mutate()
is very useful when cleaning data.
- See text manipulation tutorial for cleaning text with the
stringr
package. - See date and time tutorial for cleaning dates and times with the
lubridate
package.
7.10 Summarising data with summarise()
summarise()
lets us summarise data.
We can use it if we want to calculate a summary statistic of the data.
Remember to separate arguments with a comma.
penguins |> summarise(
flipper_len_mean = mean(flipper_length_mm, na.rm = TRUE),
flipper_len_sd = sd(flipper_length_mm, na.rm = TRUE)
)
## summarise: now one row and 2 columns, ungrouped
## # A tibble: 1 × 2
## flipper_len_mean flipper_len_sd
## <dbl> <dbl>
## 1 201. 14.1
Only the columns created in the summarise()
and any grouping columns (see below) will be kept.
Exercise
With the penguins
data, find
- the maximum and minimum bill length
Hint
|>
penguins summarise(___ = ___(___))
7.10.1 Summarising multiple columns
Sometimes you want to summarise multiple columns at the same time.
This can be done with the across()
helper function.
across()
needs to be told which columns to process and what function or functions to use.
## summarise: now one row and 2 columns, ungrouped
## # A tibble: 1 × 2
## bill_length_mm bill_depth_mm
## <dbl> <dbl>
## 1 43.9 17.2
#using a list of functions
penguins |>
summarise(across(.cols = starts_with("bill"), .fns = list(sd = sd, mean = mean), na.rm = TRUE))
## summarise: now one row and 4 columns, ungrouped
## # A tibble: 1 × 4
## bill_length_mm_sd bill_length_mm_mean bill_depth_mm_sd bill_depth_mm_mean
## <dbl> <dbl> <dbl> <dbl>
## 1 5.46 43.9 1.97 17.2
You can also use across()
with mutate()
to mutate several columns at the same time.
7.11 Grouping data with group_by
group_by()
changes the way that many of the dplyr
functions work.
Instead of working on the entire dataset, they now work on each group in the data
To find the mean flipper length for each species, we need to group_by()
species and then summarise()
.
## group_by: one grouping variable (species)
## summarise: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
## species mean_flipper_length
## <fct> <dbl>
## 1 Adelie NA
## 2 Chinstrap 196.
## 3 Gentoo NA
Grouped data can be ungrouped with ungroup()
.
This can help prevent surprises!
7.12 Sorting with arrange()
To sort the data frame by one or more of the variables we can use arrange()
.
penguins |> arrange(bill_length_mm, bill_depth_mm)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Dream 32.1 15.5 188 3050 fema…
## 2 Adelie Dream 33.1 16.1 178 2900 fema…
## 3 Adelie Torge… 33.5 19 190 3600 fema…
## # … with 341 more rows, and 1 more variable: year <int>
This will sort smallest first.
To reverse the sort order, use desc()
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Gentoo Biscoe 59.6 17 230 6050 male
## 2 Chinst… Dream 58 17.8 181 3700 fema…
## 3 Gentoo Biscoe 55.9 17 228 5600 male
## # … with 341 more rows, and 1 more variable: year <int>
7.13 Counting rows with count()
and n()
The function n
can count how many rows there are in the each group (or the entire data frame if it is not grouped).
It can be used with either mutate()
or summarise()
.
## group_by: one grouping variable (species)
## summarise: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
Or with count()
penguins |>
count(species)
## count: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
count()
is more concise, but group_by()
and summarise()
is useful when you need to calculate more variables (for example mean and standard deviation).
7.14 Common problems
7.14.1 Non standard names
Ideally column names should follow the standard rules for naming objects in R - UPPER and lower case letters, numbers, “.” and "_" with the first character being a letter (or a dot if you want an invisible object). Sometimes when you import data, it has non-standard names with spaces or extra characters. If you need to refer to a column name that doesn’t follow the rules, you need to enclose it with back-ticks.
df <- tibble(`Region/Country` = "Norway", value = 42)
df
## # A tibble: 1 × 2
## `Region/Country` value
## <chr> <dbl>
## 1 Norway 42
df |> rename(region_country = `Region/Country`)
## rename: renamed one variable (region_country)
## # A tibble: 1 × 2
## region_country value
## <chr> <dbl>
## 1 Norway 42
It is sometimes best to rename these columns to make them easier to refer to.
janitor::clean_names()
is very efficient for making easy-to-use names.
7.15 Resources
- stat545
- R for data science (Online copy of book)
dplyr
cheatsheet