You must be familiar with tibbles and data types (such as numeric, character, etc).
In this chapter, you will learn how to display tabular data from an external text file as a tibble.
We will use functions from the package readr
which is part of the tidyverse
package, so make sure that you have activated it with library(tidyverse)
.
Tabular data is data that is organized in the form of a table with rows and columns. A table often has a header, i.e. an additional row that displays variable names. Here is an example of a formatted table with header:
month | temperature (Celsius) | precipitation (mm) | Wind (m/s) |
---|---|---|---|
January | -3.4 | 29.4 | 1.3 |
February | -0.4 | 15.4 | 0.9 |
March | -1.7 | 57.6 | 1.7 |
april | 3.8 | 3.2 | 0.8 |
May | 5.1 | 25.2 | 1.4 |
June | 10.6 | 52.4 | 1.1 |
July | 13.1 | 65.0 | 1.0 |
August | 12.8 | 67.4 | 0.7 |
September | 6.9 | 79.0 | 1.2 |
October | 2.5 | 18.2 | 0.8 |
November | -2.2 | 7.8 | 0.8 |
December | 1.5 | 92.0 | 1.5 |
In this example, the table consists of 4 variables (columns) and 12 observations (rows) in addition to the header (top row). All cell contents are clearly delimited.
Below is an example of a non-tabular dataset.
This dataset is a list of profiles with recurrent fields (Name
, Position
, Institution
).
Each profile contains three lines with a field:value
pair.
## Name: Aud Halbritter
## Position: Researcher
## Institution: UiB
## -----------------------------
## Name: Jonathan Soule
## Position: Senior Engineer
## Institution: UiB
## -----------------------------
## Name: Richard J. Telford
## Position: Associate Professor
## Institution: UiB
## -----------------------------
Sooner or later, you will read about tidy data which is a standard, consistent way to organize tabular data in R. Briefly, tidy data follows a short series of rules:
The dataset presented in Table 1.1 respects all three rules, and is thus a tidy dataset.
On the contrary, the following dataset is not tidy:Indeed, at least one of the rules is broken since columns display data matching several variables (Date, time, Weather, etc).
Importing data from a file containing tidy data is a great way to start your work, but it is not a prerequisite to data import. As long as your data is tabular, you will be able to import it in R, and later you will have the chance to tidy it (see chapter Data handling and transforming).
Tabular data may be stored in files using various formats, spreadsheets, etc.
The most common spreadsheets store data in their own, proprietary file format, e.g. MS Excel which produces .xls
and .xlsx
files.
Such formats may be a limitation to data management in R.
Simpler formats such as plain text files with .txt
or .csv
should always be preferred when saving or exporting data from spreadsheets.
One of the most common format for storing tabular data in plain text files is CSV, which stands for Comma-Separated Values.
It is this format that we are going to deal with here.
Note that CSV-formatted files often have (but not always) a .csv
extension. They may also have a regular .txt
extension.
In a CSV-formatted file, the data is stored in a similar manner to this:
For information, this file matches the example in Table 1.1.
Each line corresponds to a row in the table (including header) and cell contents are separated with a comma ,
.
Note that the period symbol .
is used as decimal separator.
The use of commas in the CSV format is however not universal.
Other symbols such as a semi-colon ;
may be used as a delimiter.
This is the case in several European countries where commas are decimal separator.
Here is the same data set as above, this time in the European format:
Finally, you should know that there exist other delimiter-separated formats, in which delimiters like |
or tabs replace commas.
There are several reasons why different symbols are used in CSV files. Among these reasons are:
It is thus very important to know what your CSV file is made of.
We therefore recommend to systematically inspect your data file before importing in R.
One way to do it is to open the file with cat(readLines('file.txt'), sep = '\n')
and define:
,
or .
),,
or ;
).Here is our previous example:
## month,temperature (Celsius),precipitation (mm),Wind (m/s)
## January,-3.4,29.4,1.3
## February,-0.4,15.4,0.9
## March,-1.7,57.6,1.7
## april,3.8,3.2,0.8
## May,5.1,25.2,1.4
## June,10.6,52.4,1.1
## July,13.1,65.0,1.0
## August,12.8,67.4,0.7
## September,6.9,79.0,1.2
## October,2.5,18.2,0.8
## November,-2.2,7.8,0.8
## December,1.5,92.0,1.5
In this file, the decimal separator is .
and the delimiter is ,
.
readr
comes with several functions that read data files.
Among them are:
read_delim()
,read_csv()
,read_csv2()
.We will focus essentially on read_delim()
as it is a general function that can import all the formats.
read_csv()
and read_csv2()
are more specific functions which are preset to import CSV-formatted files in which a comma ,
or a semi-colon ;
is defined as delimiter, respectively.
read_delim()
is a simple function that does several things at once.
First, it finds and reads the file to import.
Then it looks at the first 1000 rows of the file and guesses the number of variables and rows, and the data type for each variable.
Finally, it displays the data as a tibble keeping track of the order of both the columns and rows, and the nature of the data.
NB: the function does not automatically store the data after reading the file.
This is why the tibble produced by read_delim()
must be assigned to an object using <-
.
read_delim()
has two main mandatory arguments:
file = " "
which defines where to find the data file,delim = " "
which defines the symbol used as delimiter.When searching for the file, R/RStudio uses the location of the current RStudio project as a root.
As a consequence, you may simply write the file name if the file is in the same folder as your .Rproj
file.
Alternatively, write [subfolder]/[filename]
if the file is in a subfolder.
For more info about projects, read section 5 of the chapter Getting started with R.
In the following example, we import the file weather.csv
located in the subfolder data
of the current RStudio project into the object weather
.
In weather.csv
, the delimiter is ,
and the decimal separator is .
##
## -- Column specification --------------------------------------------------------
## cols(
## month = col_character(),
## `temperature (Celsius)` = col_double(),
## `precipitation (mm)` = col_double(),
## `Wind (m/s)` = col_double()
## )
When the function is executed, the console shows a short series of warnings and messages.
In the frame above, the message Column specification
tells you that the content of the column month
has been recognized as of data type character
, and that the three other columns have been recognized as double
(which is similar to numeric
).
Now the tibble weather
is available in R as you may see in the tab Environment
(see figure 2.1).
You can display the table as follows:
## # A tibble: 12 x 4
## month `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
## <chr> <dbl> <dbl> <dbl>
## 1 January -3.4 29.4 1.3
## 2 February -0.4 15.4 0.9
## 3 March -1.7 57.6 1.7
## 4 april 3.8 3.2 0.8
## 5 May 5.1 25.2 1.4
## 6 June 10.6 52.4 1.1
## 7 July 13.1 65 1
## 8 August 12.8 67.4 0.7
## 9 September 6.9 79 1.2
## 10 October 2.5 18.2 0.8
## 11 November -2.2 7.8 0.8
## 12 December 1.5 92 1.5
The example above is rather simple and straight forward, but you may experience some issues depending on the content of the CSV file.
Relatively often, one tries to import CSV files wrongly assuming that ,
has been used as a delimiter.
Here is what happens when doing this assumption for a file (weather2.csv
) where ;
was in fact used as delimiter:
##
## -- Column specification --------------------------------------------------------
## cols(
## `month;temperature (Celsius);precipitation (mm);Wind (m/s)` = col_character()
## )
The message Column specification
tells you that the content of the whole CSV file has been transferred into one single column named month;temperature (Celsius);precipitation (mm);Wind (m/s)
and recognized as of data type character
.
Accordingly, the object weather2
, now visible in the tab Environment
(see figure 2.2), is described as a tibble with 12 observations and only one variable.
The content of the resulting tibble is as follows:
## # A tibble: 12 x 1
## `month;temperature (Celsius);precipitation (mm);Wind (m/s)`
## <chr>
## 1 January;-3
## 2 February;-0
## 3 March;-1
## 4 april;3
## 5 May;5
## 6 June;10
## 7 July;13
## 8 August;12
## 9 September;6
## 10 October;2
## 11 November;-2
## 12 December;1
This shows how important it is to review the data before importing it.
A quick check with cat(readLines('data/weather2.csv'), sep = '\n')
would have revealed the use of ;
as delimiter in weather2.csv
, and you would have certainly used the follow code to import it.
##
## -- Column specification --------------------------------------------------------
## cols(
## month = col_character(),
## `temperature (Celsius)` = col_number(),
## `precipitation (mm)` = col_number(),
## `Wind (m/s)` = col_character()
## )
Here, the message Column specification
tells you that the content of the columns month
and Wind (m/s)
has been recognized as of data type character
, and that the two other columns have been recognized as number
.
While read_delim()
got things right about the number of variables, something went wrong with the variables as we could expect Wind (m/s)
to be recognized as double
.
To find out about this issue we need to review the imported data in the object weather2
and compare it to the original file weather2.csv
.
The example above shows the importance of carefully verifying that the imported data in R matches the original data set.
The following figure compares the content of the file weather2.csv
(Fig. 2.3 left) to the content of the object weather2
(Fig. 2.3 right):
Looking at the rows in both pictures, one can understand that all commas have simply been ignored, excepted those in the last column.
This is a problem that may occur when the decimal separator used in the file is different from the decimal separator defined by your system (a.k.a locale).
To solve that issue, you must impose the separator using locale = locale(decimal_mark = ",")
:
weather3 <- read_delim(file = "data/weather2.csv",
delim = ";",
locale = locale(decimal_mark = ","))
##
## -- Column specification --------------------------------------------------------
## cols(
## month = col_character(),
## `temperature (Celsius)` = col_double(),
## `precipitation (mm)` = col_double(),
## `Wind (m/s)` = col_double()
## )
The last column is now recognized as double
and the data in the object weather3
matches the data in the file:
As we have seen in the section above, read_delim()
can read CSV files where ,
and ;
are used as delimiters.
readr
also comes with other functions such as read_csv()
and its twin read_csv2()
.
These are “clones” of read_delim()
where the argument delim =
is preset to delim = ","
and delim = ";"
, respectively.
Thus, the only argument that is strictly required is file = " "
to determine the path of the file.
Here are two examples. The first one shows the use of read_csv()
:
##
## -- Column specification --------------------------------------------------------
## cols(
## month = col_character(),
## `temperature (Celsius)` = col_double(),
## `precipitation (mm)` = col_double(),
## `Wind (m/s)` = col_double()
## )
The resulting tibble looks correct:
## # A tibble: 12 x 4
## month `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
## <chr> <dbl> <dbl> <dbl>
## 1 January -3.4 29.4 1.3
## 2 February -0.4 15.4 0.9
## 3 March -1.7 57.6 1.7
## 4 april 3.8 3.2 0.8
## 5 May 5.1 25.2 1.4
## 6 June 10.6 52.4 1.1
## 7 July 13.1 65 1
## 8 August 12.8 67.4 0.7
## 9 September 6.9 79 1.2
## 10 October 2.5 18.2 0.8
## 11 November -2.2 7.8 0.8
## 12 December 1.5 92 1.5
The second one shows the use of read_csv2()
:
## i Using ',' as decimal and '.' as grouping mark. Use `read_delim()` for more control.
##
## -- Column specification --------------------------------------------------------
## cols(
## month = col_character(),
## `temperature (Celsius)` = col_double(),
## `precipitation (mm)` = col_double(),
## `Wind (m/s)` = col_double()
## )
Note the message above Column specification
that indicates which symbols read_csv2()
has considered when importing the data. The resulting tibble looks correct:
## # A tibble: 12 x 4
## month `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
## <chr> <dbl> <dbl> <dbl>
## 1 January -3.4 29.4 1.3
## 2 February -0.4 15.4 0.9
## 3 March -1.7 57.6 1.7
## 4 april 3.8 3.2 0.8
## 5 May 5.1 25.2 1.4
## 6 June 10.6 52.4 1.1
## 7 July 13.1 65 1
## 8 August 12.8 67.4 0.7
## 9 September 6.9 79 1.2
## 10 October 2.5 18.2 0.8
## 11 November -2.2 7.8 0.8
## 12 December 1.5 92 1.5
Both read_delim()
, read_csv()
and read_csv2()
have a multitude of arguments that allow for adjusting the way data is read and displayed. Here we review a handful of them.
n_max
sets a limit to the number of observations to be read.
Note that n_max
does not consider the first row (header) of the data file as an observation.
##
## -- Column specification --------------------------------------------------------
## cols(
## month = col_character(),
## `temperature (Celsius)` = col_double(),
## `precipitation (mm)` = col_double(),
## `Wind (m/s)` = col_double()
## )
## # A tibble: 6 x 4
## month `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
## <chr> <dbl> <dbl> <dbl>
## 1 January -3.4 29.4 1.3
## 2 February -0.4 15.4 0.9
## 3 March -1.7 57.6 1.7
## 4 april 3.8 3.2 0.8
## 5 May 5.1 25.2 1.4
## 6 June 10.6 52.4 1.1
The argument skip
may be used to skip rows when reading the data file.
Be aware that:
In the following example, we read the data file weather.csv
and skip the first row:
##
## -- Column specification --------------------------------------------------------
## cols(
## January = col_character(),
## `-3.4` = col_double(),
## `29.4` = col_double(),
## `1.3` = col_double()
## )
## # A tibble: 11 x 4
## January `-3.4` `29.4` `1.3`
## <chr> <dbl> <dbl> <dbl>
## 1 February -0.4 15.4 0.9
## 2 March -1.7 57.6 1.7
## 3 april 3.8 3.2 0.8
## 4 May 5.1 25.2 1.4
## 5 June 10.6 52.4 1.1
## 6 July 13.1 65 1
## 7 August 12.8 67.4 0.7
## 8 September 6.9 79 1.2
## 9 October 2.5 18.2 0.8
## 10 November -2.2 7.8 0.8
## 11 December 1.5 92 1.5
As expected, the first row (with the header) is skipped, and the data from the observation January
have become the header of the tibble.
The argument col_names()
introduced below will be useful for solving this issue.
The argument col_names
may be used to define the name of the variables/columns of the tibble.
If you use col_names = TRUE
, the first row of the data file defines the name of the variables.
##
## -- Column specification --------------------------------------------------------
## cols(
## month = col_character(),
## `temperature (Celsius)` = col_double(),
## `precipitation (mm)` = col_double(),
## `Wind (m/s)` = col_double()
## )
## # A tibble: 12 x 4
## month `temperature (Celsius)` `precipitation (mm)` `Wind (m/s)`
## <chr> <dbl> <dbl> <dbl>
## 1 January -3.4 29.4 1.3
## 2 February -0.4 15.4 0.9
## 3 March -1.7 57.6 1.7
## 4 april 3.8 3.2 0.8
## 5 May 5.1 25.2 1.4
## 6 June 10.6 52.4 1.1
## 7 July 13.1 65 1
## 8 August 12.8 67.4 0.7
## 9 September 6.9 79 1.2
## 10 October 2.5 18.2 0.8
## 11 November -2.2 7.8 0.8
## 12 December 1.5 92 1.5
If you use col_names = FALSE
, the first row of the data file will be the first observation in the resulting tibble, and the variables will be named X1
, X2
, etc.
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character()
## )
## # A tibble: 13 x 4
## X1 X2 X3 X4
## <chr> <chr> <chr> <chr>
## 1 month temperature (Celsius) precipitation (mm) Wind (m/s)
## 2 January -3.4 29.4 1.3
## 3 February -0.4 15.4 0.9
## 4 March -1.7 57.6 1.7
## 5 april 3.8 3.2 0.8
## 6 May 5.1 25.2 1.4
## 7 June 10.6 52.4 1.1
## 8 July 13.1 65.0 1.0
## 9 August 12.8 67.4 0.7
## 10 September 6.9 79.0 1.2
## 11 October 2.5 18.2 0.8
## 12 November -2.2 7.8 0.8
## 13 December 1.5 92.0 1.5
col_names =
may also be followed by the actual variable names that you want to use.
In that case, write col_names = c()
and list the names between the parentheses.
Be aware that:
weather <- read_delim(file = "data/weather.csv",
delim = ",",
col_names = c("Month", "Temp", "Precip", "Wind"))
##
## -- Column specification --------------------------------------------------------
## cols(
## Month = col_character(),
## Temp = col_character(),
## Precip = col_character(),
## Wind = col_character()
## )
## # A tibble: 13 x 4
## Month Temp Precip Wind
## <chr> <chr> <chr> <chr>
## 1 month temperature (Celsius) precipitation (mm) Wind (m/s)
## 2 January -3.4 29.4 1.3
## 3 February -0.4 15.4 0.9
## 4 March -1.7 57.6 1.7
## 5 april 3.8 3.2 0.8
## 6 May 5.1 25.2 1.4
## 7 June 10.6 52.4 1.1
## 8 July 13.1 65.0 1.0
## 9 August 12.8 67.4 0.7
## 10 September 6.9 79.0 1.2
## 11 October 2.5 18.2 0.8
## 12 November -2.2 7.8 0.8
## 13 December 1.5 92.0 1.5
The argument col_types
may be used to modify the data type of the variables.
This is useful for example when you need to set a variable as factor
whereas R has defined it as character
.
Here is an example with three simple variables.
##
## -- Column specification --------------------------------------------------------
## cols(
## ID = col_double(),
## group = col_character(),
## count = col_double()
## )
As the message clearly indicates, the first and last variables are recognized as double
(i.e. numeric
) while the second one is recognized as character
.
The tibble in the object counts
displays as follows:
## # A tibble: 9 x 3
## ID group count
## <dbl> <chr> <dbl>
## 1 1 A 5
## 2 2 A 3
## 3 3 A 2
## 4 4 B 4
## 5 5 B 5
## 6 6 B 8
## 7 7 C 1
## 8 8 C 2
## 9 9 C 9
Let’s use col_types = cols()
to manually set the data types to double
, factor
and integer
, respectively.
counts <- read_delim(file = "data/groups.csv",
delim = ",",
col_types = cols(col_double(), col_factor(), col_integer()))
Now the tibble displays like this, with the correct data types for each column:
## # A tibble: 9 x 3
## ID group count
## <dbl> <fct> <int>
## 1 1 A 5
## 2 2 A 3
## 3 3 A 2
## 4 4 B 4
## 5 5 B 5
## 6 6 B 8
## 7 7 C 1
## 8 8 C 2
## 9 9 C 9
If you only need to modify one or a few variables, you must name it/them when writing the code:
The tibble displays like this:
## # A tibble: 9 x 3
## ID group count
## <dbl> <fct> <dbl>
## 1 1 A 5
## 2 2 A 3
## 3 3 A 2
## 4 4 B 4
## 5 5 B 5
## 6 6 B 8
## 7 7 C 1
## 8 8 C 2
## 9 9 C 9
Note that col_types
is particularly useful when importing tabular data that includes formatted dates.
Dates are usually recognized as character
when their format does not match the expected format set as default (locale, etc).
In the following example, dates entered as yyyy-mm-dd
in the last column are recognized as of type date
:
##
## -- Column specification --------------------------------------------------------
## cols(
## ID = col_double(),
## group = col_character(),
## count = col_double(),
## date = col_date(format = "")
## )
However, in the example below, dates entered as dd-mm-yyyy
are converted to character
:
##
## -- Column specification --------------------------------------------------------
## cols(
## ID = col_double(),
## group = col_character(),
## count = col_double(),
## date = col_character()
## )
To solve this issue, add the function col_date()
in cols()
to help R understand how dates are formatted and shall be read.
countsdates <- read_delim(file = "data/groups-dates2.csv",
delim = ",",
col_types = cols(date = col_date("%d-%m-%Y")))
The result is the following tibble now correctly formatted:
## # A tibble: 9 x 4
## ID group count date
## <dbl> <chr> <dbl> <date>
## 1 1 A 5 2021-03-01
## 2 2 A 3 2021-03-02
## 3 3 A 2 2021-03-03
## 4 4 B 4 2021-03-01
## 5 5 B 5 2021-03-02
## 6 6 B 8 2021-03-03
## 7 7 C 1 2021-03-01
## 8 8 C 2 2021-03-02
## 9 9 C 9 2021-03-03
You will find the whole list of cols()
parameters for col_types
here.
You may find the following links useful: