Before we start

You must be familiar with tibbles and data types (such as numeric, character, etc).

In this chapter, you will learn how to display tabular data from an external text file as a tibble. We will use functions from the package readr which is part of the tidyverse package, so make sure that you have activated it with library(tidyverse).

1 Tabular data

1.1 What is tabular data?

Tabular data is data that is organized in the form of a table with rows and columns. A table often has a header, i.e. an additional row that displays variable names. Here is an example of a formatted table with header:

Table 1.1: *A simple table with header*
month	temperature (Celsius)	precipitation (mm)	Wind (m/s)
January	-3.4	29.4	1.3
February	-0.4	15.4	0.9
March	-1.7	57.6	1.7
april	3.8	3.2	0.8
May	5.1	25.2	1.4
June	10.6	52.4	1.1
July	13.1	65.0	1.0
August	12.8	67.4	0.7
September	6.9	79.0	1.2
October	2.5	18.2	0.8
November	-2.2	7.8	0.8
December	1.5	92.0	1.5

In this example, the table consists of 4 variables (columns) and 12 observations (rows) in addition to the header (top row). All cell contents are clearly delimited.

Below is an example of a non-tabular dataset. This dataset is a list of profiles with recurrent fields (Name, Position, Institution). Each profile contains three lines with a field:value pair.

## Name: Aud Halbritter
## Position: Researcher
## Institution: UiB
## -----------------------------
## Name: Jonathan Soule
## Position: Senior Engineer
## Institution: UiB
## -----------------------------
## Name: Richard J. Telford
## Position: Associate Professor
## Institution: UiB
## -----------------------------

1.2 About tidy data

Sooner or later, you will read about tidy data which is a standard, consistent way to organize tabular data in R. Briefly, tidy data follows a short series of rules:

each variable in the data set is presented in a specific column,
each observation in the data set is presented in a specific row,
each cell at the intersection of a row and a column contains a single value.

The following figure illustrates well these rules.

_In a tidy dataset, variables are in columns, observations are in rows, and values are in cells._ --- Source: [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) - [CC BY-NC-ND 3.0 US](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)

Figure 1.1: In a tidy dataset, variables are in columns, observations are in rows, and values are in cells. — Source: R for Data Science - CC BY-NC-ND 3.0 US

The dataset presented in Table 1.1 respects all three rules, and is thus a tidy dataset.

On the contrary, the following dataset is not tidy:

Indeed, at least one of the rules is broken since columns display data matching several variables (Date, time, Weather, etc).

Importing data from a file containing tidy data is a great way to start your work, but it is not a prerequisite to data import. As long as your data is tabular, you will be able to import it in R, and later you will have the chance to tidy it (see chapter Data handling and transforming).

1.3 File formats

Tabular data may be stored in files using various formats, spreadsheets, etc. The most common spreadsheets store data in their own, proprietary file format, e.g. MS Excel which produces .xls and .xlsx files. Such formats may be a limitation to data management in R. Simpler formats such as plain text files with .txt or .csv should always be preferred when saving or exporting data from spreadsheets.

1.3.1 The CSV format

One of the most common format for storing tabular data in plain text files is CSV, which stands for Comma-Separated Values. It is this format that we are going to deal with here. Note that CSV-formatted files often have (but not always) a .csv extension. They may also have a regular .txt extension.

In a CSV-formatted file, the data is stored in a similar manner to this:

Figure 1.2: Contents of a CSV file viewed in the text editor Notepad

For information, this file matches the example in Table 1.1. Each line corresponds to a row in the table (including header) and cell contents are separated with a comma ,. Note that the period symbol . is used as decimal separator.

The use of commas in the CSV format is however not universal. Other symbols such as a semi-colon ; may be used as a delimiter. This is the case in several European countries where commas are decimal separator. Here is the same data set as above, this time in the European format:

Figure 1.3: Contents of another CSV file

Finally, you should know that there exist other delimiter-separated formats, in which delimiters like | or tabs replace commas.

1.4 Know your data!

There are several reasons why different symbols are used in CSV files. Among these reasons are:

locale, i.e. the parameters that define the language and regional settings (currency, time and date, number format setting) in use on your machine,
software-based preferences, the settings which are defined by default by the software that you have used to punch your data,
user-based preferences, the settings that you choose when punching or saving data.

It is thus very important to know what your CSV file is made of. We therefore recommend to systematically inspect your data file before importing in R. One way to do it is to open the file with cat(readLines('file.txt'), sep = '\n') and define:

which symbol is used as decimal separator (, or .),
which symbol is used as delimiter (, or ;).

Here is our previous example:

cat(readLines('data/weather.csv'), sep = '\n')

## month,temperature (Celsius),precipitation (mm),Wind (m/s)
## January,-3.4,29.4,1.3
## February,-0.4,15.4,0.9
## March,-1.7,57.6,1.7
## april,3.8,3.2,0.8
## May,5.1,25.2,1.4
## June,10.6,52.4,1.1
## July,13.1,65.0,1.0
## August,12.8,67.4,0.7
## September,6.9,79.0,1.2
## October,2.5,18.2,0.8
## November,-2.2,7.8,0.8
## December,1.5,92.0,1.5

In this file, the decimal separator is . and the delimiter is ,.

2 Import with readr

readr comes with several functions that read data files. Among them are:

read_delim(),
read_csv(),
read_csv2().

We will focus essentially on read_delim() as it is a general function that can import all the formats. read_csv() and read_csv2() are more specific functions which are preset to import CSV-formatted files in which a comma , or a semi-colon ; is defined as delimiter, respectively.

2.1 read_delim()

read_delim() is a simple function that does several things at once. First, it finds and reads the file to import. Then it looks at the first 1000 rows of the file and guesses the number of variables and rows, and the data type for each variable. Finally, it displays the data as a tibble keeping track of the order of both the columns and rows, and the nature of the data.

NB: the function does not automatically store the data after reading the file. This is why the tibble produced by read_delim() must be assigned to an object using <-.

read_delim() has two main mandatory arguments:

file = " " which defines where to find the data file,
delim = " " which defines the symbol used as delimiter.

When searching for the file, R/RStudio uses the location of the current RStudio project as a root. As a consequence, you may simply write the file name if the file is in the same folder as your .Rproj file. Alternatively, write [subfolder]/[filename] if the file is in a subfolder. For more info about projects, read section 5 of the chapter Getting started with R.

In the following example, we import the file weather.csv located in the subfolder data of the current RStudio project into the object weather. In weather.csv, the delimiter is , and the decimal separator is .

weather <- read_delim(file = "data/weather.csv", 
                      delim = ",")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   month = col_character(),
##   `temperature (Celsius)` = col_double(),
##   `precipitation (mm)` = col_double(),
##   `Wind (m/s)` = col_double()
## )

When the function is executed, the console shows a short series of warnings and messages. In the frame above, the message Column specification tells you that the content of the column month has been recognized as of data type character, and that the three other columns have been recognized as double (which is similar to numeric).

Now the tibble weather is available in R as you may see in the tab Environment (see figure 2.1).

Figure 2.1: The tibble weather is now stored in R.

You can display the table as follows:

weather

## # A tibble: 12 x 4
##    month     `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
##    <chr>                       <dbl>                <dbl>        <dbl>
##  1 January                      -3.4                 29.4          1.3
##  2 February                     -0.4                 15.4          0.9
##  3 March                        -1.7                 57.6          1.7
##  4 april                         3.8                  3.2          0.8
##  5 May                           5.1                 25.2          1.4
##  6 June                         10.6                 52.4          1.1
##  7 July                         13.1                 65            1  
##  8 August                       12.8                 67.4          0.7
##  9 September                     6.9                 79            1.2
## 10 October                       2.5                 18.2          0.8
## 11 November                     -2.2                  7.8          0.8
## 12 December                      1.5                 92            1.5

2.1.1 Choose the right delimiter

The example above is rather simple and straight forward, but you may experience some issues depending on the content of the CSV file. Relatively often, one tries to import CSV files wrongly assuming that , has been used as a delimiter. Here is what happens when doing this assumption for a file (weather2.csv) where ; was in fact used as delimiter:

weather2 <- read_delim(file = "data/weather2.csv", 
                       delim = ",")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   `month;temperature (Celsius);precipitation (mm);Wind (m/s)` = col_character()
## )

The message Column specification tells you that the content of the whole CSV file has been transferred into one single column named month;temperature (Celsius);precipitation (mm);Wind (m/s) and recognized as of data type character. Accordingly, the object weather2, now visible in the tab Environment (see figure 2.2), is described as a tibble with 12 observations and only one variable.

Figure 2.2: The tibble weather2 is now stored, but is wrongly formatted.

The content of the resulting tibble is as follows:

weather2

## # A tibble: 12 x 1
##    `month;temperature (Celsius);precipitation (mm);Wind (m/s)`
##    <chr>                                                      
##  1 January;-3                                                 
##  2 February;-0                                                
##  3 March;-1                                                   
##  4 april;3                                                    
##  5 May;5                                                      
##  6 June;10                                                    
##  7 July;13                                                    
##  8 August;12                                                  
##  9 September;6                                                
## 10 October;2                                                  
## 11 November;-2                                                
## 12 December;1

This shows how important it is to review the data before importing it. A quick check with cat(readLines('data/weather2.csv'), sep = '\n') would have revealed the use of ; as delimiter in weather2.csv, and you would have certainly used the follow code to import it.

weather2 <- read_delim(file = "data/weather2.csv", 
                       delim = ";")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   month = col_character(),
##   `temperature (Celsius)` = col_number(),
##   `precipitation (mm)` = col_number(),
##   `Wind (m/s)` = col_character()
## )

Here, the message Column specification tells you that the content of the columns month and Wind (m/s) has been recognized as of data type character, and that the two other columns have been recognized as number. While read_delim() got things right about the number of variables, something went wrong with the variables as we could expect Wind (m/s) to be recognized as double. To find out about this issue we need to review the imported data in the object weather2 and compare it to the original file weather2.csv.

2.1.2 Review the imported data

The example above shows the importance of carefully verifying that the imported data in R matches the original data set. The following figure compares the content of the file weather2.csv (Fig. 2.3 left) to the content of the object weather2 (Fig. 2.3 right):

Figure 2.3: The data in the object does not match the data in the file.

Looking at the rows in both pictures, one can understand that all commas have simply been ignored, excepted those in the last column. This is a problem that may occur when the decimal separator used in the file is different from the decimal separator defined by your system (a.k.a locale). To solve that issue, you must impose the separator using locale = locale(decimal_mark = ","):

weather3 <- read_delim(file = "data/weather2.csv", 
                       delim = ";", 
                       locale = locale(decimal_mark = ","))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   month = col_character(),
##   `temperature (Celsius)` = col_double(),
##   `precipitation (mm)` = col_double(),
##   `Wind (m/s)` = col_double()
## )

The last column is now recognized as double and the data in the object weather3 matches the data in the file:

Figure 2.4: The data in the object now matches the data in the file.

2.2 read_csv()

As we have seen in the section above, read_delim() can read CSV files where , and ; are used as delimiters. readr also comes with other functions such as read_csv() and its twin read_csv2(). These are “clones” of read_delim() where the argument delim = is preset to delim = "," and delim = ";", respectively. Thus, the only argument that is strictly required is file = " " to determine the path of the file.

Here are two examples. The first one shows the use of read_csv():

weather_csv <- read_csv(file = "data/weather.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   month = col_character(),
##   `temperature (Celsius)` = col_double(),
##   `precipitation (mm)` = col_double(),
##   `Wind (m/s)` = col_double()
## )

The resulting tibble looks correct:

weather_csv

## # A tibble: 12 x 4
##    month     `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
##    <chr>                       <dbl>                <dbl>        <dbl>
##  1 January                      -3.4                 29.4          1.3
##  2 February                     -0.4                 15.4          0.9
##  3 March                        -1.7                 57.6          1.7
##  4 april                         3.8                  3.2          0.8
##  5 May                           5.1                 25.2          1.4
##  6 June                         10.6                 52.4          1.1
##  7 July                         13.1                 65            1  
##  8 August                       12.8                 67.4          0.7
##  9 September                     6.9                 79            1.2
## 10 October                       2.5                 18.2          0.8
## 11 November                     -2.2                  7.8          0.8
## 12 December                      1.5                 92            1.5

The second one shows the use of read_csv2():

weather_csv2 <- read_csv2(file = "data/weather2.csv")

## i Using ',' as decimal and '.' as grouping mark. Use `read_delim()` for more control.

## 
## -- Column specification --------------------------------------------------------
## cols(
##   month = col_character(),
##   `temperature (Celsius)` = col_double(),
##   `precipitation (mm)` = col_double(),
##   `Wind (m/s)` = col_double()
## )

Note the message above Column specification that indicates which symbols read_csv2() has considered when importing the data. The resulting tibble looks correct:

weather_csv2

## # A tibble: 12 x 4
##    month     `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
##    <chr>                       <dbl>                <dbl>        <dbl>
##  1 January                      -3.4                 29.4          1.3
##  2 February                     -0.4                 15.4          0.9
##  3 March                        -1.7                 57.6          1.7
##  4 april                         3.8                  3.2          0.8
##  5 May                           5.1                 25.2          1.4
##  6 June                         10.6                 52.4          1.1
##  7 July                         13.1                 65            1  
##  8 August                       12.8                 67.4          0.7
##  9 September                     6.9                 79            1.2
## 10 October                       2.5                 18.2          0.8
## 11 November                     -2.2                  7.8          0.8
## 12 December                      1.5                 92            1.5

2.3 Useful arguments

Both read_delim(), read_csv() and read_csv2() have a multitude of arguments that allow for adjusting the way data is read and displayed. Here we review a handful of them.

2.3.1 n_max

n_max sets a limit to the number of observations to be read. Note that n_max does not consider the first row (header) of the data file as an observation.

weather <- read_delim(file = "data/weather.csv", 
                      delim = ",",
                      n_max = 6)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   month = col_character(),
##   `temperature (Celsius)` = col_double(),
##   `precipitation (mm)` = col_double(),
##   `Wind (m/s)` = col_double()
## )

weather

## # A tibble: 6 x 4
##   month    `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
##   <chr>                      <dbl>                <dbl>        <dbl>
## 1 January                     -3.4                 29.4          1.3
## 2 February                    -0.4                 15.4          0.9
## 3 March                       -1.7                 57.6          1.7
## 4 april                        3.8                  3.2          0.8
## 5 May                          5.1                 25.2          1.4
## 6 June                        10.6                 52.4          1.1

2.3.2 skip

The argument skip may be used to skip rows when reading the data file.
Be aware that:

the header in the original data file counts as one row,
the first row that comes after those which have been skipped becomes the header for the resulting tibble.

In the following example, we read the data file weather.csv and skip the first row:

weather <- read_delim(file = "data/weather.csv", 
                      delim = ",", 
                      skip = 1)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   January = col_character(),
##   `-3.4` = col_double(),
##   `29.4` = col_double(),
##   `1.3` = col_double()
## )

weather

## # A tibble: 11 x 4
##    January   `-3.4` `29.4` `1.3`
##    <chr>      <dbl>  <dbl> <dbl>
##  1 February    -0.4   15.4   0.9
##  2 March       -1.7   57.6   1.7
##  3 april        3.8    3.2   0.8
##  4 May          5.1   25.2   1.4
##  5 June        10.6   52.4   1.1
##  6 July        13.1   65     1  
##  7 August      12.8   67.4   0.7
##  8 September    6.9   79     1.2
##  9 October      2.5   18.2   0.8
## 10 November    -2.2    7.8   0.8
## 11 December     1.5   92     1.5

As expected, the first row (with the header) is skipped, and the data from the observation January have become the header of the tibble. The argument col_names() introduced below will be useful for solving this issue.

2.3.3 col_names

The argument col_names may be used to define the name of the variables/columns of the tibble. If you use col_names = TRUE, the first row of the data file defines the name of the variables.

weather <- read_delim(file = "data/weather.csv", 
                      delim = ",", 
                      col_names = TRUE)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   month = col_character(),
##   `temperature (Celsius)` = col_double(),
##   `precipitation (mm)` = col_double(),
##   `Wind (m/s)` = col_double()
## )

weather

## # A tibble: 12 x 4
##    month     `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
##    <chr>                       <dbl>                <dbl>        <dbl>
##  1 January                      -3.4                 29.4          1.3
##  2 February                     -0.4                 15.4          0.9
##  3 March                        -1.7                 57.6          1.7
##  4 april                         3.8                  3.2          0.8
##  5 May                           5.1                 25.2          1.4
##  6 June                         10.6                 52.4          1.1
##  7 July                         13.1                 65            1  
##  8 August                       12.8                 67.4          0.7
##  9 September                     6.9                 79            1.2
## 10 October                       2.5                 18.2          0.8
## 11 November                     -2.2                  7.8          0.8
## 12 December                      1.5                 92            1.5

If you use col_names = FALSE, the first row of the data file will be the first observation in the resulting tibble, and the variables will be named X1, X2, etc.

weather <- read_delim(file = "data/weather.csv", 
                      delim = ",", 
                      col_names = FALSE)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )

weather

## # A tibble: 13 x 4
##    X1        X2                    X3                 X4        
##    <chr>     <chr>                 <chr>              <chr>     
##  1 month     temperature (Celsius) precipitation (mm) Wind (m/s)
##  2 January   -3.4                  29.4               1.3       
##  3 February  -0.4                  15.4               0.9       
##  4 March     -1.7                  57.6               1.7       
##  5 april     3.8                   3.2                0.8       
##  6 May       5.1                   25.2               1.4       
##  7 June      10.6                  52.4               1.1       
##  8 July      13.1                  65.0               1.0       
##  9 August    12.8                  67.4               0.7       
## 10 September 6.9                   79.0               1.2       
## 11 October   2.5                   18.2               0.8       
## 12 November  -2.2                  7.8                0.8       
## 13 December  1.5                   92.0               1.5

col_names = may also be followed by the actual variable names that you want to use. In that case, write col_names = c() and list the names between the parentheses. Be aware that:

the names that you input will be placed on top of the tibble,
the header in the original data file will become the first row of the resulting tibble. Here is an example:

weather <- read_delim(file = "data/weather.csv", 
                      delim = ",", 
                      col_names = c("Month", "Temp", "Precip", "Wind"))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Month = col_character(),
##   Temp = col_character(),
##   Precip = col_character(),
##   Wind = col_character()
## )

weather

## # A tibble: 13 x 4
##    Month     Temp                  Precip             Wind      
##    <chr>     <chr>                 <chr>              <chr>     
##  1 month     temperature (Celsius) precipitation (mm) Wind (m/s)
##  2 January   -3.4                  29.4               1.3       
##  3 February  -0.4                  15.4               0.9       
##  4 March     -1.7                  57.6               1.7       
##  5 april     3.8                   3.2                0.8       
##  6 May       5.1                   25.2               1.4       
##  7 June      10.6                  52.4               1.1       
##  8 July      13.1                  65.0               1.0       
##  9 August    12.8                  67.4               0.7       
## 10 September 6.9                   79.0               1.2       
## 11 October   2.5                   18.2               0.8       
## 12 November  -2.2                  7.8                0.8       
## 13 December  1.5                   92.0               1.5

2.3.4 col_types

The argument col_types may be used to modify the data type of the variables. This is useful for example when you need to set a variable as factor whereas R has defined it as character. Here is an example with three simple variables.

counts <- read_delim(file = "data/groups.csv", 
                     delim = ",")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ID = col_double(),
##   group = col_character(),
##   count = col_double()
## )

As the message clearly indicates, the first and last variables are recognized as double (i.e. numeric) while the second one is recognized as character. The tibble in the object counts displays as follows:

counts

## # A tibble: 9 x 3
##      ID group count
##   <dbl> <chr> <dbl>
## 1     1 A         5
## 2     2 A         3
## 3     3 A         2
## 4     4 B         4
## 5     5 B         5
## 6     6 B         8
## 7     7 C         1
## 8     8 C         2
## 9     9 C         9

Let’s use col_types = cols() to manually set the data types to double, factor and integer, respectively.

counts <- read_delim(file = "data/groups.csv", 
                     delim = ",", 
                     col_types = cols(col_double(), col_factor(), col_integer()))

Now the tibble displays like this, with the correct data types for each column:

counts

## # A tibble: 9 x 3
##      ID group count
##   <dbl> <fct> <int>
## 1     1 A         5
## 2     2 A         3
## 3     3 A         2
## 4     4 B         4
## 5     5 B         5
## 6     6 B         8
## 7     7 C         1
## 8     8 C         2
## 9     9 C         9

If you only need to modify one or a few variables, you must name it/them when writing the code:

counts <- read_delim(file = "data/groups.csv", 
                     delim = ",", 
                     col_types = cols(group = col_factor()))

The tibble displays like this:

counts

## # A tibble: 9 x 3
##      ID group count
##   <dbl> <fct> <dbl>
## 1     1 A         5
## 2     2 A         3
## 3     3 A         2
## 4     4 B         4
## 5     5 B         5
## 6     6 B         8
## 7     7 C         1
## 8     8 C         2
## 9     9 C         9

Note that col_types is particularly useful when importing tabular data that includes formatted dates. Dates are usually recognized as character when their format does not match the expected format set as default (locale, etc). In the following example, dates entered as yyyy-mm-dd in the last column are recognized as of type date:

countsdates <- read_delim(file = "data/groups-dates1.csv", 
                     delim = ",")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ID = col_double(),
##   group = col_character(),
##   count = col_double(),
##   date = col_date(format = "")
## )

However, in the example below, dates entered as dd-mm-yyyy are converted to character:

countsdates <- read_delim(file = "data/groups-dates2.csv", 
                     delim = ",")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ID = col_double(),
##   group = col_character(),
##   count = col_double(),
##   date = col_character()
## )

To solve this issue, add the function col_date() in cols() to help R understand how dates are formatted and shall be read.

countsdates <- read_delim(file = "data/groups-dates2.csv", 
                     delim = ",",
                     col_types = cols(date = col_date("%d-%m-%Y")))

The result is the following tibble now correctly formatted:

countsdates

## # A tibble: 9 x 4
##      ID group count date      
##   <dbl> <chr> <dbl> <date>    
## 1     1 A         5 2021-03-01
## 2     2 A         3 2021-03-02
## 3     3 A         2 2021-03-03
## 4     4 B         4 2021-03-01
## 5     5 B         5 2021-03-02
## 6     6 B         8 2021-03-03
## 7     7 C         1 2021-03-01
## 8     8 C         2 2021-03-02
## 9     9 C         9 2021-03-03

You will find the whole list of cols() parameters for col_types here.

Contributors

Jonathan Soulé
Aud Halbritter
Richard Telford

Importing data in R