Welcome to this introduction to using R/Rstudio for performing data analysis in BIO241 - Behavioural ecology.
As a part of the course, you will (or might have already) run a behavioural experiment involving a finite number of beetle individuals placed in different contexts or setups.
Here is a quick recap of the background for this experiment: the aim of the study was to explore whether habitat/food type affected the reproductive outcome for bean beetles.
That’s to say, to relate the individual reproductive outcome (Egg_Number
) with the type of habitat/food where bean beetles were born (Origin
) and the type of habitat/food where they were laying eggs (Type
).
In the experiment (Fig. 0.1), bean beetles reared in different habitats (Origin
) were transferred to different types of bean (Type
) until they laid eggs.
Then, the number of eggs was counted.
Our question is: does the type of bean affect the reproductive outcome of the beetles?
The data set obtained in the experiment includes the following variables such as:
Petri
),Egg_Number
),Origin
),Type
).Of course, your data set may include other variables if you have decided to extend the scope of the experiment to other parameters.
In the following sections, you will learn to use R/RStudio to manage this data set, run a statistical analysis using the appropriate statistical model, and draw a plot representing the results.
We at bioST@TS do not expect that you already know how to use R/RStudio. This is why we have prepared three pages that help you start working with the software. On this first page, you will find all you need to know about R/RStudio, where to download it, how to use the interface, etc. The second page shows the basics of coding in R, and describes the different types of variables and objects. Finally on the third page, you will find all you need to know about importing data into R/RStudio.
As for now, we will proceed step-by-step and do the following:
When planning, performing and completing the lab experiment, it is strongly recommended to use a spreadsheet such as Excel to write down variables and observations. Of course, you may write everything on a sheet of paper or your lab journal before transferring to the spreadsheet, but this increases the chance of making mistakes, or even losing the whole data set.
It is smart to organize the data in a table, with observations in rows and variables in columns. You may use colors, lines and other formatting tools if it helps better visualize the data, but keep in mind that anything else than text and numbers will disappear when importing the data set into R.
Figure 1.1 shows an example of a data set with 4 variables and 48 observations registered in Excel:When punching data in Excel, you must think “tidy”. By producing a tidy data set, you make the whole process of data import and analysis in R much simpler as your data set will need minimal (or no) modification prior to analysis. If your data set is not tidy, you may need to transpose parts of the table, remove/rename variables, split cell contents, etc. That may cost you time and energy, and is often a source for errors.
Briefly, tidy data follows this short series of rules:
The data set presented in Figure 1.1 respects all three rules, and is thus a tidy data set.
On the contrary, the data set presented in the figure 1.3 is not tidy:Even though the content of the table is exactly the same as in Fig. 1.1 (but only transposed), the table does not display the variables in columns and observations in rows. This is suboptimal for the work to come as the R functions we will use will not operate properly.
The most practical way to safely transfer data from Excel to R consists in:
In a CSV file, all the values are separated with symbols, often a comma ,
, but also a semi-colon ;
especially in European countries.
To save the data in a CSV file, simply click on File > Save As in the top menu:
CSV UTF-8
, give the file a name (for example beetle_data
) and save it in the location of your choice:
Now that you have created a CSV file (beetle_data.csv
), open it in any text editor such as Notepad and check its content.
Figure 1.6 shows the CSV file that contains the data set introduced in Figure 1.1:
Each line in this file corresponds to a row in the original table (including header).
All cell contents are separated with a semi-colon ;
.
Again, remember that the CSV format may use commas ,
, semi-colons ;
and other symbols as delimiter.
The use of a comma ,
is not universal.
Here starts our work in RStudio and R. Make sure that you have familiarized yourself with the interface of RStudio (see this page).
RStudio allows you to divide your work into projects which are independent from each other. A project has its on working directory in which you can create specific scripts, load data sets, add external files, activate packages, etc. For each project, you will thus be guaranteed to work in a dedicated workspace.
Let’s create such a project for BIO241. To create a project, go to the main menu and select File > New Project… as shown in Figure 2.1:Feel free to import in the project folder all the files that you will need later on, such as original data sets, CSV files, etc.
You may for example create a new folder called data
where you will store your CSV file.
Before we go any further, let’s load the packages that will be necessary in the present project.
Write the following code in the script and use CTRL
+ Enter
(⌘
+ Enter
) to run it:
library("tidyverse")
To import data from the CSV file in R, we will use functions in the readr
package which is part of the tidyverse
package.
readr
comes with several functions that open and read data files. Among them is read_delim()
.
read_delim()
has two main mandatory arguments:
file = " "
which defines where to find the data file,delim = " "
which defines the symbol used as delimiter (i.e. ,
or ;
).With the line of code below, R imports the content of beetle_data.csv
into the object beetle
.
It retrieves the file from the subfolder data
of the current RStudio project and recognizes ;
as a delimiter.
<- read_delim(file = "data/beetle_data.csv", delim = ";") beetle
## Rows: 48 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (2): Origin, Type
## dbl (2): Petri, Egg_Number
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
NB: if your CSV file uses ,
as a delimiter, then you must write delim = ","
.
When the function is executed, the console shows a short series of warnings and messages. In the frame above, the message Column specification
tells you that the contents of the columns Origin
and Type
have been recognized as of data type character (chr
), and that the two other columns have been recognized as double (dbl
, which is similar to numeric).
beetle
is now visible in the tab Environment
in RStudio, as shown in the figure below.
To view the content of the object beetle
, simply write its name:
beetle
## # A tibble: 48 x 4
## Petri Origin Type Egg_Number
## <dbl> <chr> <chr> <dbl>
## 1 1 Mung Mung 100
## 2 2 Mung Mung 87
## 3 3 Mung Mung 99
## 4 1 Mung Black 74
## 5 2 Mung Black 35
## 6 3 Mung Black 43
## 7 4 Black Mung 48
## 8 5 Black Mung 70
## 9 6 Black Mung 89
## 10 4 Black Black 45
## # ... with 38 more rows
A basic table with rows and columns shows up.
The table displays the 4 variables and the top 10 rows of the data set.
It also tells you which variables are recognized as of type chr
and dbl
.
Note that things may look different (and thus wrong) if you fail to use the correct delimiter in the function readr
.
The following example shows you what happens if you use delim = ","
whereas ;
is the actual delimiter in the original CSV file:
<- read_delim(file = "data/beetle_data.csv", delim = ",") beetle
## Rows: 48 Columns: 1
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Petri;Origin;Type;Egg_Number
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
As you may see, a message tells you that the table is made of 48 rows, but only one column, which is already a sign that things went wrong.
In addition the line starting with chr (1)
means that the only variable that R recognized is of type chr
and its label is Petri;Origin;Type;Egg_Number
which corresponds to all our variable names compressed in one single cell.
Looking carefully at the content of the object beetle
(see below), you can see that the data set does not look like a table with organized columns.
beetle
## # A tibble: 48 x 1
## `Petri;Origin;Type;Egg_Number`
## <chr>
## 1 1;Mung;Mung;100
## 2 2;Mung;Mung;87
## 3 3;Mung;Mung;99
## 4 1;Mung;Black;74
## 5 2;Mung;Black;35
## 6 3;Mung;Black;43
## 7 4;Black;Mung;48
## 8 5;Black;Mung;70
## 9 6;Black;Mung;89
## 10 4;Black;Black;45
## # ... with 38 more rows
This means that readr
has not found the delimiter in the CSV file and thus failed to split the cells correctly.
You will find more info about importing data in R on this page.
It is always a good idea to review your data set and check the integrity of your data before starting the analysis. As we just saw in the previous section, things may go wrong when importing data. Also things may have gone wrong when punching data. Typos are often your worst enemy.
Here are a series of things to check before going further with data analysis.
Are the variable types correctly set or recognized by R?
Have a quick look at the imported data set and check that numerical variables are indeed in a column labeled <dbl>
and that names or text values are labeled <chr>
.
beetle
## # A tibble: 48 x 4
## Petri Origin Type Egg_Number
## <dbl> <chr> <chr> <chr>
## 1 1 Mung Mung 100
## 2 2 Mung Mung 87
## 3 3 Mung Mung 99
## 4 1 Mung Black 74
## 5 2 Mung Black 35
## 6 3 Mung Black 43a
## 7 4 Black Mung 48
## 8 5 Black Mung 70
## 9 6 Black Mung 89
## 10 4 Black Black 45
## # ... with 38 more rows
In the example above, the variable Egg_Number
is incorrectly defined as a text variable (<chr>
) instead of numerical (<dbl>
).
This is the direct consequence of a typo. The 6th observation shows 43a
instead of 43
.
As a result, R thinks the whole column is text…
In such a case, correct the mistake in the CSV file and reimport the data.
Are text values just names? Are the numbers just counts? Or do these text/numerical values define categories such as treatment, bean type, Petri dish number, etc. ? If the values help categorizing observations, then they are factors. This is often very important for data analysis of specific statistical models, and must be defined properly before going further.
For instance, in the data set beetle
, all the variables except for Egg_Number
are in fact factors: Petri
is nothing more than the label of the petri dish in which the beetles are hosted, Origin
is the type of beans the beetles originate from, etc.
To define variables as factors, use the function as.factor()
as follows:
$Petri <- as.factor(beetle$Petri)
beetle$Origin <- as.factor(beetle$Origin)
beetle$Type <- as.factor(beetle$Type) beetle
NB: The $
is used to connect a variable to an object. for example, beetle$Petri
means “the variable Petri
contained in the object beetle
”.
As a result of these operations, the data set looks now like this:
beetle
## # A tibble: 48 x 4
## Petri Origin Type Egg_Number
## <fct> <fct> <fct> <dbl>
## 1 1 Mung Mung 100
## 2 2 Mung Mung 87
## 3 3 Mung Mung 99
## 4 1 Mung Black 74
## 5 2 Mung Black 35
## 6 3 Mung Black 43
## 7 4 Black Mung 48
## 8 5 Black Mung 70
## 9 6 Black Mung 89
## 10 4 Black Black 45
## # ... with 38 more rows
All the variables are now factors (marked with <fct>
), except for Egg_Number
which remains as <dbl>
.
Levels are the distinct values that a factor can take.
For instance, in the case of the variable Type
in beetle
, there are 2 levels which are “Mung” and “Black”.
It is good practice to always check that the levels of a factor in your data set are those expected.
To review the levels of a given factor, simple write the full name of the factor.
In our case, we can use beetle$Type
, beetle$Origin
and beetle$Petri
to get a description of the variables:
$Type beetle
## [1] Mung Mung Mung Black Black Black Mung Mung mung Black Black Black
## [13] Mung Mung Mung Black Black Black Mung Mung Mung Black Black Black
## [25] Mung Mung Mung Black Black Black Mung Mung Mung Black Black Black
## [37] Mung Mung Mung Black Black Black Mung Mung Mung Black Black Black
## Levels: Black mung Mung
$Origin beetle
## [1] Mung Mung Mung Mung Mung Mung Black Black Black Black Black Black
## [13] Mung Mung Mung Mung Mung Mung Black Black Black Black Black Black
## [25] Mung Mung Mung Mung Mung Mung Black Black Black Black Black Black
## [37] Mung Mung Mung Mung Mung Mung Black Black Black Black Black Black
## Levels: Black Mung
$Petri beetle
## [1] 1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6 1 2
## [39] 3 1 2 3 4 5 6 4 5 6
## Levels: 1 2 3 4 5 6
Evreything loooks great, except for the last line of the result for beetle$Type
which shows Levels: Black mung Mung
while we expect only 2 levels (i.e. Levels: Black Mung
).
“mung” seems to be the unexpected level and you may certainly understand that “Mung” was spelled without capital letter by mistake when punching the data in Excel.
A quick examination of the values in Type
shows that 9th entry is wrong.
That typo has unexpectedly created a third level which could have impacted the whole data analysis.
Once again, go back to the original data set, correct it and reimport it.
When managing numerical variables, you should also check whether the values are in the expected range, providing that you have some degree of knowledge about what to expect.
Here we will visually check the content of the variable Egg_Number
using plot()
, a function that plots all the values on a graph following their order in the data set.
plot(beetle$Egg_Number)
You clearly see that the 5th entry is largely above the others, and is certainly an outlier. Whether this value is a typo or a real/correct data entry that must be kept must be evaluated carefully. If it is a mistake, correct the data set and reimport. In all cases, it is worth your attention before going further with analysis. If it is not a mistake, try to investigate that data point further… there is maybe something interesting or unusual about it.
Now that we have checked the integrity of the data set, we may proceed with statistical analysis. Both the design of your experiment, the type of variables and your working hypothesis will define which type of model and test you shall run.
Fitting the wrong kind of model to your data may have disastrous consequences for your study, results, conclusion and possibly your sanity. If you wonder which model is best to use with a specific data set, try this key.
Let’s recap some important facts about the current experiment:
All of this tells us that the model to consider here is a Generalized Linear Model (GLM) with Poisson distribution.
If you want to fit a GLM with one predictive variable and one response variable, here is the code to apply:
<- glm(formula = response_variable ~ predictive_variable,
model family = poisson,
data = object)
where object
is the object that contains the data set.
NB: we have written family = poisson
since count data follow a Poisson distribution.
If your data is based on proportions, you should write family = binomial
as proportions usually follow a binomial distribution.
If you want to fit a GLM with two predictive variables and one response variable, here is the code to apply:
<- glm(formula = response_variable ~ predictive_variable1 + predictive_variable2,
model family = poisson,
data = object)
If you want to fit a GLM with two predictive variables and one response variable, AND you want to test for an interaction between the two predictive variables, here is the code to apply:
<- glm(formula = response_variable ~ predictive_variable1 * predictive_variable2,
model family = poisson,
data = object)
NB: the only difference between this code and the previous one is the use of *
instead of +
.
Note that the lines of code above store the model in the object model
.
They do not print or display anything.
To get the results of model fitting (and thus know whether there is an effect of the predictive variable(s) on the response variable), we have to use the function summary()
:
summary(model)
With summary()
comes a whole page of results including Deviance Residuals
and Coefficients
.
Even though you should have a good look at the whole document and evaluate carefully the various properties of the model fitting, you will certainly be more interested in the section Coefficients
as it tells you whether there is an effect of one or several of the levels/variables.
This will be clearer with practical examples.
Back to our beetle experiment!
Let’s say that the working hypothesis in this example is that the number of eggs laid by beetles (Egg_Number
) varies with the type of beans the eggs are laid on (Type
) and the origin of the beetles (Origin
).
Let’s write it this way:
<- glm(formula = Egg_Number ~ Origin + Type,
model family = poisson,
data = beetle)
Let’s now get the results of model fitting with summary()
:
summary(model)
##
## Call:
## glm(formula = Egg_Number ~ Origin + Type, family = poisson, data = beetle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.2169 -5.4331 -0.2997 4.6233 9.9556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.38898 0.02738 160.287 < 2e-16 ***
## OriginMung 0.01397 0.03052 0.458 0.647
## TypeMung 0.18685 0.03065 6.095 1.09e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1885.0 on 47 degrees of freedom
## Residual deviance: 1847.4 on 45 degrees of freedom
## AIC: 2140.8
##
## Number of Fisher Scoring iterations: 5
Let’s look at Coefficients
.
The yellow and the green lines report on the separate effect of Origin
and Type
.
First you notice that not all levels seem to be present under the line “(Intercept)”.
The variable Type
has 2 levels; we find “TypeMung”, but not “TypeBlack” (yellow line).
Also, the variable Origin
has 2 levels; we find “OriginMung” but not “OriginBlack” (green line).
In fact, the table omits the first alphabetical level of each variable because these are taken as intercept. Thus the absence of both TypeBlack
and OriginBlack
.
Now, let’s check whether the analysis supports our hypotheses.
The absence of stars and a p-value greater than 0.05 at the end of the green line “OriginMung” tell us that we must accept the null hypothesis: there is no influence of the origin of the beetles on the number of eggs laid.
The presence of three stars and a p-value less than 0.001 at the end of the yellow line “TypeMung” tell us on the contrary that we can reject the null hypothesis and accept our working hypothesis: the egg count is affected by the type of bean they eggs are laid on.
Let’s repeat the analysis, considering this time that there might be an interaction between the variables Origin
and Type
.
Let’s write it this way:
<- glm(formula = Egg_Number ~ Origin * Type,
model family = poisson,
data = beetle)
Let’s now get the results of model fitting with summary()
:
summary(model)
##
## Call:
## glm(formula = Egg_Number ~ Origin * Type, family = poisson, data = beetle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -12.9783 -5.3054 -0.2157 4.9845 10.3748
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.42784 0.03154 140.370 < 2e-16 ***
## OriginMung -0.06474 0.04535 -1.428 0.15343
## TypeMung 0.11457 0.04339 2.641 0.00827 **
## OriginMung:TypeMung 0.14403 0.06135 2.348 0.01889 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1885.0 on 47 degrees of freedom
## Residual deviance: 1841.9 on 44 degrees of freedom
## AIC: 2137.3
##
## Number of Fisher Scoring iterations: 5
Let’s have a look at Coefficients
.
Here we find 3 lines under “(Intercept)”.
Both the yellow and the green lines report on the effect of Origin
and Type
separately, just like in the previous example. In addition, the blue line reports on the effect of an interaction of Type
with Origin
.
As there is an interaction, the intercept now represents “TypeBlack” and “OriginBlack” dishes, so the low p-value on the blue line tells us that the average egg count in “TypeMung:OriginMung” dishes is different from the count in “TypeBlack:OriginBlack”. The high p-value on the green line (“OriginMung”) shows that the average count in “TypeBlack:OriginMung” dishes is not different from “TypeBlack:OriginBlack”. The low p-value on the yellow line (“TypeMung”) shows that the average count in “TypeMung:OriginBlack” dishes is different from “TypeBlack:OriginBlack”.
But we don’t know whether the other combinations are different, for instance “TypeMung:OriginBlack” vs “TypeMung:OriginMung”, “TypeMung:OriginBlack” vs “TypeBlack:OriginMung”, and “TypeMung:OriginBlack” vs “TypeBlack:OriginMung”. That’s why we need to run a post-hoc test.
Until now, we have been able to define whether there exists a significant effect (or interaction) between variables. However, we do not know which groups are significantly different from each other. We thus need to run post hoc pairwise comparisons between groups to come to a valid conclusion to our analysis.
NB: remember that you may run post hoc pairwise comparisons between levels of a variable only if that variable has been found to have a significant effect on the response variable. In our case, only Type
has a significant effect on Egg_Number
, not Origin
. However, we are still allowed to conduct pairwise comparisons considering the interaction between Type
and Origin
.
Here we will use the function emmeans()
in the package emmeans
to perform these pairwise comparisons.
First, we need to load the package:
library(emmeans)
Now let’s use the following code to perform pairwise comparisons considering the levels of a single predictive variable:
emmeans(object,
~ predictive_variable,
pairwise type = "response",
adjust = "none")
where object
is the GLM that we have build in the previous analysis.
Applied to our example, and considering only Type
, the code looks like this:
emmeans(model,
~ Type,
pairwise type = "response",
adjust = "none")
## NOTE: Results may be misleading due to involvement in interactions
## $emmeans
## Type rate SE df asymp.LCL asymp.UCL
## Black 81.1 1.84 Inf 77.6 84.8
## Mung 97.7 2.02 Inf 93.8 101.8
##
## Results are averaged over the levels of: Origin
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df null z.ratio p.value
## Black / Mung 0.83 0.0255 Inf 1 -6.083 <.0001
##
## Results are averaged over the levels of: Origin
## Tests are performed on the log scale
The top of the output gives the average for the levels Black
and Mung
of Type
.
For instance, the average count for Mung
(97.7) is higher than for Black
(81.1).
You will certainly recognize these values when we will draw the first plot in Section 5.1.
The lower part of the output gives the contrasts, with ratio and SE, and shows a p-value less than 0.0001 for the (only) pair Black / Mung
.
Actually, the ratio referred to here is nothing else than the ratio Black / Mung calculated from the rates presented in the top part of the output:
\(\frac{Black}{Mung}\) = \(\frac{81.1}{97.7}\) = 0.83
About NOTE: Results may be misleading due to involvement in interactions
: this reminds you that you may get an incomplete overview of the situation because this analysis does not consider the interaction found in the model.
Let’s thus rerun emmeans()
, this time considering the interaction.
The syntax is as follows:
emmeans(object,
~ predictive_variable1*predictive_variable2,
pairwise type = "response",
adjust = "none")
Applied to Type
and Origin
, the code looks like this:
emmeans(model,
~ Origin*Type,
pairwise type = "response",
adjust = "none")
## $emmeans
## Origin Type rate SE df asymp.LCL asymp.UCL
## Black Black 83.8 2.64 Inf 78.7 89.1
## Mung Black 78.5 2.56 Inf 73.6 83.7
## Black Mung 93.9 2.80 Inf 88.6 99.6
## Mung Mung 101.7 2.91 Inf 96.1 107.5
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df null z.ratio p.value
## Black Black / Mung Black 1.067 0.0484 Inf 1 1.428 0.1534
## Black Black / Black Mung 0.892 0.0387 Inf 1 -2.641 0.0083
## Black Black / Mung Mung 0.824 0.0351 Inf 1 -4.551 <.0001
## Mung Black / Black Mung 0.836 0.0369 Inf 1 -4.062 <.0001
## Mung Black / Mung Mung 0.772 0.0335 Inf 1 -5.962 <.0001
## Black Mung / Mung Mung 0.924 0.0382 Inf 1 -1.919 0.0550
##
## Tests are performed on the log scale
The top of the output gives the average for the 4 groups/combinations “Type Origin”.
The lower part of the output gives the contrasts with ratio, SE, and p-value for each of the pairs.
This part may be a bit confusing in this particular example since the levels of Type
and Origin
are identical.
Remember that the groups are named following this pattern “Type Origin”.
For example “Black Mung” corresponds to “TypeBlack OriginMung”.
Let’s now look at the section $Contrasts
The first three lines (yellow, green and blue) give you similar results to those displayed in the section Coefficients
in Fig. 4.2.
The next three lines (orange-red) give you the results which you were missing in summary()
.
So in fact, you find almost all the results you need in the output of emmeans()
.
Looking at the actual results, the output tells you that there are significant differences between groups tested in line 1, 3, 4 and 6.
However, be careful as this output does not filter out the pairs that you are not allowed to consider due to the lack of significant effect of Origin
on Egg_Number
.
ggplot2
is a package that helps produce clear and reproducible graphs.
Here, we review a few solutions based on ggplot2
that create simple but meaningful graphs based on the data analysis above.
Note that all of the solutions introduced below are boxplots since this type of plot is possibly the most relevant in the present experiment.
Coding for the most basic boxplot in ggplot2
looks like this:
ggplot(dataset, aes(x = predictive_variable, y = response_variable)) +
geom_boxplot()
In brief, ggplot()
uses the function aes()
to map the variables and the function geom_boxplot()
to draw the figure.
Applied to our example and considering the variables Type
and Number
in the data set beetle
, the code and its result look like this:
ggplot(beetle, aes(x = Type, y = Egg_Number)) +
geom_boxplot()
NB: this plot matches the analysis that we performed previously with this code in section 4.5:
emmeans(model,
~ Type,
pairwise type = "response",
adjust = "none")
We may also produce a boxplot that takes two predictive variables into account.
Here we use fill =
to add the second predictive variable in aes()
ggplot(dataset, aes(x = predictive_variable, y = response_variable, )) +
geom_boxplot()
The variable in fill =
will be represented with colors, and the corresponding legend will be added automatically to the right.
To use both Type
and Origin
as predictive variable in the plot, we write:
ggplot(beetle, aes(x = Type, y = Egg_Number, fill = Origin)) +
geom_boxplot()
Note that swapping the predictive variables in aes()
produces a slightly different output:
ggplot(beetle, aes(x = Origin, y = Egg_Number, fill = Type)) +
geom_boxplot()
If you want to split the figure into two panels, each of which shows one level of the second variable, you may add facet_grid(~ predictive_variable2)
to the code. NB: do not forget the +
sign at the end of the previous line.
ggplot(beetle, aes(x = Type, y = Egg_Number)) +
geom_boxplot() +
facet_grid(~ Origin)
Again, you may swap the predictive variables:
ggplot(beetle, aes(x = Origin, y = Egg_Number)) +
geom_boxplot() +
facet_grid(~ Type)
OK, that was quite a lot of information at once! Here is a quick summary of the code that we have use step by step to perform the whole analysis, starting from the CSV file and ending with a plot.
Further below, you will find the same code with all the results.
# Let's start with loading the package `tidyverse`
library("tidyverse")
# We import the data from the CSV file in the object beetle
<- read_delim(file = "data/beetle_data.csv", delim = ";")
beetle
# We set the variables `Petri`, `Origin` and `Type` as factors
$Petri <- as.factor(beetle$Petri)
beetle$Origin <- as.factor(beetle$Origin)
beetle$Type <- as.factor(beetle$Type)
beetle
# We check the factors for abnormal levels
$Type
beetle$Origin
beetle$Petri
beetle
# We plot the count data to check for abnormal values
plot(beetle$Egg_Number)
# Let's fit a GLM to the data
<- glm(formula = Egg_Number ~ Origin * Type,
model family = poisson,
data = beetle)
# Now we get the results of model fitting
summary(model)
# Here we run all the necessary pairwise comparisons
emmeans(model,
~ Origin * Type,
pairwise type = "response",
adjust = "none")
# And finally we draw a plot
ggplot(beetle, aes(x = Type, y = Egg_Number, fill = Origin)) +
geom_boxplot()
# Let's start with loading the package `tidyverse`
library("tidyverse")
# We import the data from the CSV file in the object beetle
<- read_delim(file = "data/beetle_data.csv", delim = ";") beetle
## Rows: 48 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (2): Origin, Type
## dbl (2): Petri, Egg_Number
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# We set the variables `Petri`, `Origin` and `Type` as factors
$Petri <- as.factor(beetle$Petri)
beetle$Origin <- as.factor(beetle$Origin)
beetle$Type <- as.factor(beetle$Type)
beetle
# We check the factors for abnormal levels
$Type beetle
## [1] Mung Mung Mung Black Black Black Mung Mung Mung Black Black Black
## [13] Mung Mung Mung Black Black Black Mung Mung Mung Black Black Black
## [25] Mung Mung Mung Black Black Black Mung Mung Mung Black Black Black
## [37] Mung Mung Mung Black Black Black Mung Mung Mung Black Black Black
## Levels: Black Mung
$Origin beetle
## [1] Mung Mung Mung Mung Mung Mung Black Black Black Black Black Black
## [13] Mung Mung Mung Mung Mung Mung Black Black Black Black Black Black
## [25] Mung Mung Mung Mung Mung Mung Black Black Black Black Black Black
## [37] Mung Mung Mung Mung Mung Mung Black Black Black Black Black Black
## Levels: Black Mung
$Petri beetle
## [1] 1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6 1 2
## [39] 3 1 2 3 4 5 6 4 5 6
## Levels: 1 2 3 4 5 6
# We plot the count data to check for abnormal values
plot(beetle$Egg_Number)
# Let's fit a GLM to the data
<- glm(formula = Egg_Number ~ Origin * Type,
model family = poisson,
data = beetle)
# Now we get the results of model fitting
summary(model)
##
## Call:
## glm(formula = Egg_Number ~ Origin * Type, family = poisson, data = beetle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -12.9783 -5.3054 -0.2157 4.9845 10.3748
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.42784 0.03154 140.370 < 2e-16 ***
## OriginMung -0.06474 0.04535 -1.428 0.15343
## TypeMung 0.11457 0.04339 2.641 0.00827 **
## OriginMung:TypeMung 0.14403 0.06135 2.348 0.01889 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1885.0 on 47 degrees of freedom
## Residual deviance: 1841.9 on 44 degrees of freedom
## AIC: 2137.3
##
## Number of Fisher Scoring iterations: 5
# Here we run all the necessary pairwise comparisons
emmeans(model,
~ Origin * Type,
pairwise type = "response",
adjust = "none")
## $emmeans
## Origin Type rate SE df asymp.LCL asymp.UCL
## Black Black 83.8 2.64 Inf 78.7 89.1
## Mung Black 78.5 2.56 Inf 73.6 83.7
## Black Mung 93.9 2.80 Inf 88.6 99.6
## Mung Mung 101.7 2.91 Inf 96.1 107.5
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df null z.ratio p.value
## Black Black / Mung Black 1.067 0.0484 Inf 1 1.428 0.1534
## Black Black / Black Mung 0.892 0.0387 Inf 1 -2.641 0.0083
## Black Black / Mung Mung 0.824 0.0351 Inf 1 -4.551 <.0001
## Mung Black / Black Mung 0.836 0.0369 Inf 1 -4.062 <.0001
## Mung Black / Mung Mung 0.772 0.0335 Inf 1 -5.962 <.0001
## Black Mung / Mung Mung 0.924 0.0382 Inf 1 -1.919 0.0550
##
## Tests are performed on the log scale
# And finally we draw a plot
ggplot(beetle, aes(x = Type, y = Egg_Number, fill = Origin)) +
geom_boxplot()