Welcome to this introduction to using R/Rstudio for performing data analysis in BIO241 - Behavioural ecology.

As a part of the course, you will (or might have already) run a behavioural experiment involving a finite number of beetle individuals placed in different contexts or setups.

Here is a quick recap of the background for this experiment: the aim of the study was to explore whether habitat/food type affected the reproductive outcome for bean beetles. That’s to say, to relate the individual reproductive outcome (Egg_Number) with the type of habitat/food where bean beetles were born (Origin) and the type of habitat/food where they were laying eggs (Type).

In the experiment (Fig. 0.1), bean beetles reared in different habitats (Origin) were transferred to different types of bean (Type) until they laid eggs. Then, the number of eggs was counted. Our question is: does the type of bean affect the reproductive outcome of the beetles?

_With this experimental setup we aim to explore the combined effects of two variables: the type of bean where the beetles were originally hosted (`Origin`) and the type of bean where the beetles laid eggs (`Type`). Origin has 2 levels (i.e., 2 types of beans: mung and black). Type has 2 levels (i.e., 2 types of beans: mung and black). Note that all individuals with the same Origin WERE NOT hosted in the same Petri. For each combination of Petri:Origin:Type we have 4 replicates_

Figure 0.1: With this experimental setup we aim to explore the combined effects of two variables: the type of bean where the beetles were originally hosted (Origin) and the type of bean where the beetles laid eggs (Type). Origin has 2 levels (i.e., 2 types of beans: mung and black). Type has 2 levels (i.e., 2 types of beans: mung and black). Note that all individuals with the same Origin WERE NOT hosted in the same Petri. For each combination of Petri:Origin:Type we have 4 replicates

The data set obtained in the experiment includes the following variables such as:

petri dish number (Petri),
number of eggs (Egg_Number),
type of habitat/food where bean beetles were born (Origin),
type of habitat/food where eggs were laid (Type).

Of course, your data set may include other variables if you have decided to extend the scope of the experiment to other parameters.

In the following sections, you will learn to use R/RStudio to manage this data set, run a statistical analysis using the appropriate statistical model, and draw a plot representing the results.

We at bioST@TS do not expect that you already know how to use R/RStudio. This is why we have prepared three pages that help you start working with the software. On this first page, you will find all you need to know about R/RStudio, where to download it, how to use the interface, etc. The second page shows the basics of coding in R, and describes the different types of variables and objects. Finally on the third page, you will find all you need to know about importing data into R/RStudio.

As for now, we will proceed step-by-step and do the following:

punch and organize your data in an Excel sheet,
import the data set from Excel to R/RStudio,
check the integrity of the imported data set,
perform a statistical analysis,
make a plot or figure that summarizes the results of the experiment.

1 Punch and organize the data

When planning, performing and completing the lab experiment, it is strongly recommended to use a spreadsheet such as Excel to write down variables and observations. Of course, you may write everything on a sheet of paper or your lab journal before transferring to the spreadsheet, but this increases the chance of making mistakes, or even losing the whole data set.

1.1 Punch your data in a table

It is smart to organize the data in a table, with observations in rows and variables in columns. You may use colors, lines and other formatting tools if it helps better visualize the data, but keep in mind that anything else than text and numbers will disappear when importing the data set into R.

Figure 1.1 shows an example of a data set with 4 variables and 48 observations registered in Excel:

Figure 1.1: Example of a simple data set in Excel

1.2 Work tidy

When punching data in Excel, you must think “tidy”. By producing a tidy data set, you make the whole process of data import and analysis in R much simpler as your data set will need minimal (or no) modification prior to analysis. If your data set is not tidy, you may need to transpose parts of the table, remove/rename variables, split cell contents, etc. That may cost you time and energy, and is often a source for errors.

Briefly, tidy data follows this short series of rules:

each variable in the data set is presented in a specific column,
each observation in the data set is presented in a specific row,
each cell at the intersection of a row and a column contains a single value.

Figure 1.2 illustrates well these rules.

_In a tidy dataset, variables are in columns, observations are in rows, and values are in cells._ --- Source: [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) - [CC BY-NC-ND 3.0 US](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)

Figure 1.2: In a tidy dataset, variables are in columns, observations are in rows, and values are in cells. — Source: R for Data Science - CC BY-NC-ND 3.0 US

The data set presented in Figure 1.1 respects all three rules, and is thus a tidy data set.

On the contrary, the data set presented in the figure 1.3 is not tidy:

Figure 1.3: Example of a non tidy data set

Even though the content of the table is exactly the same as in Fig. 1.1 (but only transposed), the table does not display the variables in columns and observations in rows. This is suboptimal for the work to come as the R functions we will use will not operate properly.

1.3 Save the data in a CSV file

The most practical way to safely transfer data from Excel to R consists in:

saving the data in a CSV (Comma-Separated Values)-formatted file,
importing the CSV file into R.

In a CSV file, all the values are separated with symbols, often a comma ,, but also a semi-colon ; especially in European countries. To save the data in a CSV file, simply click on File > Save As in the top menu:

Figure 1.4: Use Save As to save your data in Excel.

Then choose CSV UTF-8, give the file a name (for example beetle_data) and save it in the location of your choice:

Figure 1.5: Choose CSV UTF-8 and give your file a name.

1.4 Check the CSV file

Now that you have created a CSV file (beetle_data.csv), open it in any text editor such as Notepad and check its content. Figure 1.6 shows the CSV file that contains the data set introduced in Figure 1.1:

Figure 1.6: This is how a CSV file looks like in a text editor.

Each line in this file corresponds to a row in the original table (including header). All cell contents are separated with a semi-colon ;. Again, remember that the CSV format may use commas ,, semi-colons ; and other symbols as delimiter. The use of a comma , is not universal.

2 Import the data set in R

Here starts our work in RStudio and R. Make sure that you have familiarized yourself with the interface of RStudio (see this page).

2.1 Create an R project

RStudio allows you to divide your work into projects which are independent from each other. A project has its on working directory in which you can create specific scripts, load data sets, add external files, activate packages, etc. For each project, you will thus be guaranteed to work in a dedicated workspace.

Let’s create such a project for BIO241. To create a project, go to the main menu and select File > New Project… as shown in Figure 2.1:

Figure 2.1: Creating a new project – step 1.

Click again on New Directory > New Project, choose a project name (for example “BIO241”) and a destination on your disk, and click the button “Create Project” as shown in Figure 2.2:

Figure 2.2: Creating a new project – step 2.

Feel free to import in the project folder all the files that you will need later on, such as original data sets, CSV files, etc. You may for example create a new folder called data where you will store your CSV file.

2.2 Load the necessary packages

Before we go any further, let’s load the packages that will be necessary in the present project. Write the following code in the script and use CTRL + Enter (⌘+ Enter) to run it:

library("tidyverse")

2.3 Import with read_delim()

To import data from the CSV file in R, we will use functions in the readr package which is part of the tidyverse package. readr comes with several functions that open and read data files. Among them is read_delim().

read_delim() has two main mandatory arguments:

file = " " which defines where to find the data file,
delim = " " which defines the symbol used as delimiter (i.e. , or ;).

With the line of code below, R imports the content of beetle_data.csv into the object beetle. It retrieves the file from the subfolder data of the current RStudio project and recognizes ; as a delimiter.

beetle <- read_delim(file = "data/beetle_data.csv", delim = ";")

## Rows: 48 Columns: 4

## -- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (2): Origin, Type
## dbl (2): Petri, Egg_Number

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

NB: if your CSV file uses , as a delimiter, then you must write delim = ",".

When the function is executed, the console shows a short series of warnings and messages. In the frame above, the message Column specification tells you that the contents of the columns Origin and Type have been recognized as of data type character (chr), and that the two other columns have been recognized as double (dbl, which is similar to numeric).

The object beetle is now visible in the tab Environment in RStudio, as shown in the figure below.

Figure 2.3: The object beetle is now visible in the tab Environment.

2.4 Check the imported data in R

To view the content of the object beetle, simply write its name:

beetle

## # A tibble: 48 x 4
##    Petri Origin Type  Egg_Number
##    <dbl> <chr>  <chr>      <dbl>
##  1     1 Mung   Mung         100
##  2     2 Mung   Mung          87
##  3     3 Mung   Mung          99
##  4     1 Mung   Black         74
##  5     2 Mung   Black         35
##  6     3 Mung   Black         43
##  7     4 Black  Mung          48
##  8     5 Black  Mung          70
##  9     6 Black  Mung          89
## 10     4 Black  Black         45
## # ... with 38 more rows

A basic table with rows and columns shows up. The table displays the 4 variables and the top 10 rows of the data set. It also tells you which variables are recognized as of type chr and dbl.

2.4.1 Incorrect delimiter

Note that things may look different (and thus wrong) if you fail to use the correct delimiter in the function readr. The following example shows you what happens if you use delim = "," whereas ; is the actual delimiter in the original CSV file:

beetle <- read_delim(file = "data/beetle_data.csv", delim = ",")

## Rows: 48 Columns: 1

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Petri;Origin;Type;Egg_Number

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

As you may see, a message tells you that the table is made of 48 rows, but only one column, which is already a sign that things went wrong. In addition the line starting with chr (1) means that the only variable that R recognized is of type chr and its label is Petri;Origin;Type;Egg_Number which corresponds to all our variable names compressed in one single cell.

Looking carefully at the content of the object beetle (see below), you can see that the data set does not look like a table with organized columns.

beetle

## # A tibble: 48 x 1
##    `Petri;Origin;Type;Egg_Number`
##    <chr>                         
##  1 1;Mung;Mung;100               
##  2 2;Mung;Mung;87                
##  3 3;Mung;Mung;99                
##  4 1;Mung;Black;74               
##  5 2;Mung;Black;35               
##  6 3;Mung;Black;43               
##  7 4;Black;Mung;48               
##  8 5;Black;Mung;70               
##  9 6;Black;Mung;89               
## 10 4;Black;Black;45              
## # ... with 38 more rows

This means that readr has not found the delimiter in the CSV file and thus failed to split the cells correctly.

2.5 More about importing data

You will find more info about importing data in R on this page.

3 Explore the data

It is always a good idea to review your data set and check the integrity of your data before starting the analysis. As we just saw in the previous section, things may go wrong when importing data. Also things may have gone wrong when punching data. Typos are often your worst enemy.

Here are a series of things to check before going further with data analysis.

3.1 Check the variable types

Are the variable types correctly set or recognized by R?
Have a quick look at the imported data set and check that numerical variables are indeed in a column labeled <dbl> and that names or text values are labeled <chr>.

beetle

## # A tibble: 48 x 4
##    Petri Origin Type  Egg_Number
##    <dbl> <chr>  <chr> <chr>     
##  1     1 Mung   Mung  100       
##  2     2 Mung   Mung  87        
##  3     3 Mung   Mung  99        
##  4     1 Mung   Black 74        
##  5     2 Mung   Black 35        
##  6     3 Mung   Black 43a       
##  7     4 Black  Mung  48        
##  8     5 Black  Mung  70        
##  9     6 Black  Mung  89        
## 10     4 Black  Black 45        
## # ... with 38 more rows

In the example above, the variable Egg_Number is incorrectly defined as a text variable (<chr>) instead of numerical (<dbl>). This is the direct consequence of a typo. The 6^th observation shows 43a instead of 43. As a result, R thinks the whole column is text… In such a case, correct the mistake in the CSV file and reimport the data.

3.2 Set variables as factors

Are text values just names? Are the numbers just counts? Or do these text/numerical values define categories such as treatment, bean type, Petri dish number, etc. ? If the values help categorizing observations, then they are factors. This is often very important for data analysis of specific statistical models, and must be defined properly before going further.

For instance, in the data set beetle, all the variables except for Egg_Number are in fact factors: Petri is nothing more than the label of the petri dish in which the beetles are hosted, Origin is the type of beans the beetles originate from, etc.

To define variables as factors, use the function as.factor() as follows:

beetle$Petri <- as.factor(beetle$Petri) 
beetle$Origin <- as.factor(beetle$Origin)
beetle$Type <- as.factor(beetle$Type)

NB: The $ is used to connect a variable to an object. for example, beetle$Petri means “the variable Petri contained in the object beetle”.

As a result of these operations, the data set looks now like this:

beetle

## # A tibble: 48 x 4
##    Petri Origin Type  Egg_Number
##    <fct> <fct>  <fct>      <dbl>
##  1 1     Mung   Mung         100
##  2 2     Mung   Mung          87
##  3 3     Mung   Mung          99
##  4 1     Mung   Black         74
##  5 2     Mung   Black         35
##  6 3     Mung   Black         43
##  7 4     Black  Mung          48
##  8 5     Black  Mung          70
##  9 6     Black  Mung          89
## 10 4     Black  Black         45
## # ... with 38 more rows

All the variables are now factors (marked with <fct>), except for Egg_Number which remains as <dbl>.

3.3 Check factor levels

Levels are the distinct values that a factor can take. For instance, in the case of the variable Type in beetle, there are 2 levels which are “Mung” and “Black”.

It is good practice to always check that the levels of a factor in your data set are those expected. To review the levels of a given factor, simple write the full name of the factor. In our case, we can use beetle$Type, beetle$Origin and beetle$Petri to get a description of the variables:

beetle$Type

##  [1] Mung  Mung  Mung  Black Black Black Mung  Mung  mung  Black Black Black
## [13] Mung  Mung  Mung  Black Black Black Mung  Mung  Mung  Black Black Black
## [25] Mung  Mung  Mung  Black Black Black Mung  Mung  Mung  Black Black Black
## [37] Mung  Mung  Mung  Black Black Black Mung  Mung  Mung  Black Black Black
## Levels: Black mung Mung

beetle$Origin

##  [1] Mung  Mung  Mung  Mung  Mung  Mung  Black Black Black Black Black Black
## [13] Mung  Mung  Mung  Mung  Mung  Mung  Black Black Black Black Black Black
## [25] Mung  Mung  Mung  Mung  Mung  Mung  Black Black Black Black Black Black
## [37] Mung  Mung  Mung  Mung  Mung  Mung  Black Black Black Black Black Black
## Levels: Black Mung

beetle$Petri

##  [1] 1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6 1 2
## [39] 3 1 2 3 4 5 6 4 5 6
## Levels: 1 2 3 4 5 6

Evreything loooks great, except for the last line of the result for beetle$Type which shows Levels: Black mung Mung while we expect only 2 levels (i.e. Levels: Black Mung). “mung” seems to be the unexpected level and you may certainly understand that “Mung” was spelled without capital letter by mistake when punching the data in Excel. A quick examination of the values in Type shows that 9^th entry is wrong. That typo has unexpectedly created a third level which could have impacted the whole data analysis. Once again, go back to the original data set, correct it and reimport it.

3.4 Check the range of data

When managing numerical variables, you should also check whether the values are in the expected range, providing that you have some degree of knowledge about what to expect. Here we will visually check the content of the variable Egg_Number using plot(), a function that plots all the values on a graph following their order in the data set.

plot(beetle$Egg_Number)

You clearly see that the 5^th entry is largely above the others, and is certainly an outlier. Whether this value is a typo or a real/correct data entry that must be kept must be evaluated carefully. If it is a mistake, correct the data set and reimport. In all cases, it is worth your attention before going further with analysis. If it is not a mistake, try to investigate that data point further… there is maybe something interesting or unusual about it.

4 Statistical analysis

Now that we have checked the integrity of the data set, we may proceed with statistical analysis. Both the design of your experiment, the type of variables and your working hypothesis will define which type of model and test you shall run.

4.1 Which model to fit the data?

Fitting the wrong kind of model to your data may have disastrous consequences for your study, results, conclusion and possibly your sanity. If you wonder which model is best to use with a specific data set, try this key.

Let’s recap some important facts about the current experiment:

you will be (or have been) counting eggs or beetles in petri dishes; count data follows a non-normal (non-Gaussian) distribution, most likely a Poisson distribution. You may read more about distributions here.
the predictive variables in the data set are categorical.
the working hypothesis is that the egg/beetle counts are affected by one of several of the predictive variables; the null hypothesis H₀ is that there is no difference between counts.

All of this tells us that the model to consider here is a Generalized Linear Model (GLM) with Poisson distribution.

4.2 How to code that in R?

If you want to fit a GLM with one predictive variable and one response variable, here is the code to apply:

model <- glm(formula = response_variable ~ predictive_variable,
             family = poisson,
             data = object)

where object is the object that contains the data set.

NB: we have written family = poisson since count data follow a Poisson distribution. If your data is based on proportions, you should write family = binomial as proportions usually follow a binomial distribution.

If you want to fit a GLM with two predictive variables and one response variable, here is the code to apply:

model <- glm(formula = response_variable ~ predictive_variable1 + predictive_variable2,
             family = poisson,
             data = object)

If you want to fit a GLM with two predictive variables and one response variable, AND you want to test for an interaction between the two predictive variables, here is the code to apply:

model <- glm(formula = response_variable ~ predictive_variable1 * predictive_variable2,
             family = poisson,
             data = object)

NB: the only difference between this code and the previous one is the use of * instead of +.

Note that the lines of code above store the model in the object model. They do not print or display anything. To get the results of model fitting (and thus know whether there is an effect of the predictive variable(s) on the response variable), we have to use the function summary():

summary(model)

With summary() comes a whole page of results including Deviance Residuals and Coefficients. Even though you should have a good look at the whole document and evaluate carefully the various properties of the model fitting, you will certainly be more interested in the section Coefficients as it tells you whether there is an effect of one or several of the levels/variables.

This will be clearer with practical examples.

4.3 A simple example

Back to our beetle experiment! Let’s say that the working hypothesis in this example is that the number of eggs laid by beetles (Egg_Number) varies with the type of beans the eggs are laid on (Type) and the origin of the beetles (Origin).

Let’s write it this way:

model <- glm(formula = Egg_Number ~ Origin + Type,
             family = poisson,
             data = beetle)

Let’s now get the results of model fitting with summary():

summary(model)

## 
## Call:
## glm(formula = Egg_Number ~ Origin + Type, family = poisson, data = beetle)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -13.2169   -5.4331   -0.2997    4.6233    9.9556  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.38898    0.02738 160.287  < 2e-16 ***
## OriginMung   0.01397    0.03052   0.458    0.647    
## TypeMung     0.18685    0.03065   6.095 1.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1885.0  on 47  degrees of freedom
## Residual deviance: 1847.4  on 45  degrees of freedom
## AIC: 2140.8
## 
## Number of Fisher Scoring iterations: 5

4.3.1 How to decrypt this table?

Let’s look at Coefficients.

Figure 4.1: The section coefficients shows the existence or absence of effect of a predictive variable on the response variable.

The yellow and the green lines report on the separate effect of Origin and Type.

First you notice that not all levels seem to be present under the line “(Intercept)”. The variable Type has 2 levels; we find “TypeMung”, but not “TypeBlack” (yellow line). Also, the variable Origin has 2 levels; we find “OriginMung” but not “OriginBlack” (green line). In fact, the table omits the first alphabetical level of each variable because these are taken as intercept. Thus the absence of both TypeBlack and OriginBlack.

Now, let’s check whether the analysis supports our hypotheses.
The absence of stars and a p-value greater than 0.05 at the end of the green line “OriginMung” tell us that we must accept the null hypothesis: there is no influence of the origin of the beetles on the number of eggs laid. The presence of three stars and a p-value less than 0.001 at the end of the yellow line “TypeMung” tell us on the contrary that we can reject the null hypothesis and accept our working hypothesis: the egg count is affected by the type of bean they eggs are laid on.

4.4 A more advanced example

Let’s repeat the analysis, considering this time that there might be an interaction between the variables Origin and Type.

Let’s write it this way:

model <- glm(formula = Egg_Number ~ Origin * Type,
              family = poisson,
              data = beetle)

Let’s now get the results of model fitting with summary():

summary(model)

## 
## Call:
## glm(formula = Egg_Number ~ Origin * Type, family = poisson, data = beetle)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -12.9783   -5.3054   -0.2157    4.9845   10.3748  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          4.42784    0.03154 140.370  < 2e-16 ***
## OriginMung          -0.06474    0.04535  -1.428  0.15343    
## TypeMung             0.11457    0.04339   2.641  0.00827 ** 
## OriginMung:TypeMung  0.14403    0.06135   2.348  0.01889 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1885.0  on 47  degrees of freedom
## Residual deviance: 1841.9  on 44  degrees of freedom
## AIC: 2137.3
## 
## Number of Fisher Scoring iterations: 5

Let’s have a look at Coefficients.

_The section_ coefficients _shows the existence or absence of effect of- and interaction between the predictive variables on the response variable._

Figure 4.2: The section coefficients shows the existence or absence of effect of- and interaction between the predictive variables on the response variable.

Here we find 3 lines under “(Intercept)”. Both the yellow and the green lines report on the effect of Origin and Type separately, just like in the previous example. In addition, the blue line reports on the effect of an interaction of Type with Origin.

As there is an interaction, the intercept now represents “TypeBlack” and “OriginBlack” dishes, so the low p-value on the blue line tells us that the average egg count in “TypeMung:OriginMung” dishes is different from the count in “TypeBlack:OriginBlack”. The high p-value on the green line (“OriginMung”) shows that the average count in “TypeBlack:OriginMung” dishes is not different from “TypeBlack:OriginBlack”. The low p-value on the yellow line (“TypeMung”) shows that the average count in “TypeMung:OriginBlack” dishes is different from “TypeBlack:OriginBlack”.

But we don’t know whether the other combinations are different, for instance “TypeMung:OriginBlack” vs “TypeMung:OriginMung”, “TypeMung:OriginBlack” vs “TypeBlack:OriginMung”, and “TypeMung:OriginBlack” vs “TypeBlack:OriginMung”. That’s why we need to run a post-hoc test.

4.5 Post hoc pairwise comparisons

Until now, we have been able to define whether there exists a significant effect (or interaction) between variables. However, we do not know which groups are significantly different from each other. We thus need to run post hoc pairwise comparisons between groups to come to a valid conclusion to our analysis.

NB: remember that you may run post hoc pairwise comparisons between levels of a variable only if that variable has been found to have a significant effect on the response variable. In our case, only Type has a significant effect on Egg_Number, not Origin. However, we are still allowed to conduct pairwise comparisons considering the interaction between Type and Origin.

Here we will use the function emmeans() in the package emmeans to perform these pairwise comparisons.

First, we need to load the package:

library(emmeans)

Now let’s use the following code to perform pairwise comparisons considering the levels of a single predictive variable:

emmeans(object,
        pairwise ~ predictive_variable,
        type = "response", 
        adjust = "none")

where object is the GLM that we have build in the previous analysis. Applied to our example, and considering only Type, the code looks like this:

emmeans(model, 
        pairwise ~ Type,
        type = "response", 
        adjust = "none")

## NOTE: Results may be misleading due to involvement in interactions

## $emmeans
##  Type  rate   SE  df asymp.LCL asymp.UCL
##  Black 81.1 1.84 Inf      77.6      84.8
##  Mung  97.7 2.02 Inf      93.8     101.8
## 
## Results are averaged over the levels of: Origin 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
##  contrast     ratio     SE  df null z.ratio p.value
##  Black / Mung  0.83 0.0255 Inf    1  -6.083  <.0001
## 
## Results are averaged over the levels of: Origin 
## Tests are performed on the log scale

The top of the output gives the average for the levels Black and Mung of Type. For instance, the average count for Mung (97.7) is higher than for Black (81.1). You will certainly recognize these values when we will draw the first plot in Section 5.1.

The lower part of the output gives the contrasts, with ratio and SE, and shows a p-value less than 0.0001 for the (only) pair Black / Mung. Actually, the ratio referred to here is nothing else than the ratio Black / Mung calculated from the rates presented in the top part of the output:

$\frac{Black}{Mung}$ = $\frac{81.1}{97.7}$ = 0.83

About NOTE: Results may be misleading due to involvement in interactions: this reminds you that you may get an incomplete overview of the situation because this analysis does not consider the interaction found in the model.

Let’s thus rerun emmeans(), this time considering the interaction. The syntax is as follows:

emmeans(object, 
        pairwise ~ predictive_variable1*predictive_variable2,
        type = "response", 
        adjust = "none")

Applied to Type and Origin, the code looks like this:

emmeans(model, 
        pairwise ~ Origin*Type,
        type = "response", 
        adjust = "none")

## $emmeans
##  Origin Type   rate   SE  df asymp.LCL asymp.UCL
##  Black  Black  83.8 2.64 Inf      78.7      89.1
##  Mung   Black  78.5 2.56 Inf      73.6      83.7
##  Black  Mung   93.9 2.80 Inf      88.6      99.6
##  Mung   Mung  101.7 2.91 Inf      96.1     107.5
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
##  contrast                 ratio     SE  df null z.ratio p.value
##  Black Black / Mung Black 1.067 0.0484 Inf    1   1.428  0.1534
##  Black Black / Black Mung 0.892 0.0387 Inf    1  -2.641  0.0083
##  Black Black / Mung Mung  0.824 0.0351 Inf    1  -4.551  <.0001
##  Mung Black / Black Mung  0.836 0.0369 Inf    1  -4.062  <.0001
##  Mung Black / Mung Mung   0.772 0.0335 Inf    1  -5.962  <.0001
##  Black Mung / Mung Mung   0.924 0.0382 Inf    1  -1.919  0.0550
## 
## Tests are performed on the log scale

The top of the output gives the average for the 4 groups/combinations “Type Origin”. The lower part of the output gives the contrasts with ratio, SE, and p-value for each of the pairs. This part may be a bit confusing in this particular example since the levels of Type and Origin are identical. Remember that the groups are named following this pattern “Type Origin”. For example “Black Mung” corresponds to “TypeBlack OriginMung”.

Let’s now look at the section $Contrasts

Figure 4.3: The section contrasts shows the results of the pairwise comparisons.

The first three lines (yellow, green and blue) give you similar results to those displayed in the section Coefficients in Fig. 4.2. The next three lines (orange-red) give you the results which you were missing in summary(). So in fact, you find almost all the results you need in the output of emmeans().

Looking at the actual results, the output tells you that there are significant differences between groups tested in line 1, 3, 4 and 6. However, be careful as this output does not filter out the pairs that you are not allowed to consider due to the lack of significant effect of Origin on Egg_Number.

5 Making a plot

ggplot2 is a package that helps produce clear and reproducible graphs. Here, we review a few solutions based on ggplot2 that create simple but meaningful graphs based on the data analysis above. Note that all of the solutions introduced below are boxplots since this type of plot is possibly the most relevant in the present experiment.

5.1 Example 1

Coding for the most basic boxplot in ggplot2 looks like this:

ggplot(dataset, aes(x = predictive_variable, y = response_variable)) +
  geom_boxplot()

In brief, ggplot() uses the function aes() to map the variables and the function geom_boxplot() to draw the figure. Applied to our example and considering the variables Type and Number in the data set beetle, the code and its result look like this:

ggplot(beetle, aes(x = Type, y = Egg_Number)) +
  geom_boxplot()

NB: this plot matches the analysis that we performed previously with this code in section 4.5:

emmeans(model, 
        pairwise ~ Type,
        type = "response", 
        adjust = "none")

5.2 Example 2

We may also produce a boxplot that takes two predictive variables into account. Here we use fill = to add the second predictive variable in aes()

ggplot(dataset, aes(x = predictive_variable, y = response_variable, )) +
  geom_boxplot()

The variable in fill = will be represented with colors, and the corresponding legend will be added automatically to the right.

To use both Type and Origin as predictive variable in the plot, we write:

ggplot(beetle, aes(x = Type, y = Egg_Number, fill = Origin)) +
  geom_boxplot()

Note that swapping the predictive variables in aes() produces a slightly different output:

ggplot(beetle, aes(x = Origin, y = Egg_Number, fill = Type)) +
  geom_boxplot()

5.3 Example 3

If you want to split the figure into two panels, each of which shows one level of the second variable, you may add facet_grid(~ predictive_variable2) to the code. NB: do not forget the + sign at the end of the previous line.

ggplot(beetle, aes(x = Type, y = Egg_Number)) +
  geom_boxplot() + 
  facet_grid(~ Origin)

Again, you may swap the predictive variables:

ggplot(beetle, aes(x = Origin, y = Egg_Number)) +
  geom_boxplot() + 
  facet_grid(~ Type)

6 In summary

OK, that was quite a lot of information at once! Here is a quick summary of the code that we have use step by step to perform the whole analysis, starting from the CSV file and ending with a plot.

Further below, you will find the same code with all the results.

6.1 The code in a nutshell

# Let's start with loading the package `tidyverse`

library("tidyverse")

# We import the data from the CSV file in the object beetle

beetle <- read_delim(file = "data/beetle_data.csv", delim = ";")


# We set the variables `Petri`, `Origin` and `Type` as factors

beetle$Petri <- as.factor(beetle$Petri) 
beetle$Origin <- as.factor(beetle$Origin)
beetle$Type <- as.factor(beetle$Type)


# We check the factors for abnormal levels

beetle$Type
beetle$Origin
beetle$Petri

# We plot the count data to check for abnormal values

plot(beetle$Egg_Number)

# Let's fit a GLM to the data

model <- glm(formula = Egg_Number ~ Origin * Type,
              family = poisson,
              data = beetle)

# Now we get the results of model fitting

summary(model)


# Here we run all the necessary pairwise comparisons

emmeans(model, 
        pairwise ~ Origin * Type,
        type = "response", 
        adjust = "none")

# And finally we draw a plot

ggplot(beetle, aes(x = Type, y = Egg_Number, fill = Origin)) +
  geom_boxplot()

6.2 The code and its results

# Let's start with loading the package `tidyverse`

library("tidyverse")

# We import the data from the CSV file in the object beetle

beetle <- read_delim(file = "data/beetle_data.csv", delim = ";")

## Rows: 48 Columns: 4

## -- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (2): Origin, Type
## dbl (2): Petri, Egg_Number

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

# We set the variables `Petri`, `Origin` and `Type` as factors

beetle$Petri <- as.factor(beetle$Petri) 
beetle$Origin <- as.factor(beetle$Origin)
beetle$Type <- as.factor(beetle$Type)


# We check the factors for abnormal levels

beetle$Type

##  [1] Mung  Mung  Mung  Black Black Black Mung  Mung  Mung  Black Black Black
## [13] Mung  Mung  Mung  Black Black Black Mung  Mung  Mung  Black Black Black
## [25] Mung  Mung  Mung  Black Black Black Mung  Mung  Mung  Black Black Black
## [37] Mung  Mung  Mung  Black Black Black Mung  Mung  Mung  Black Black Black
## Levels: Black Mung

beetle$Origin

##  [1] Mung  Mung  Mung  Mung  Mung  Mung  Black Black Black Black Black Black
## [13] Mung  Mung  Mung  Mung  Mung  Mung  Black Black Black Black Black Black
## [25] Mung  Mung  Mung  Mung  Mung  Mung  Black Black Black Black Black Black
## [37] Mung  Mung  Mung  Mung  Mung  Mung  Black Black Black Black Black Black
## Levels: Black Mung

beetle$Petri

##  [1] 1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6 1 2
## [39] 3 1 2 3 4 5 6 4 5 6
## Levels: 1 2 3 4 5 6

# We plot the count data to check for abnormal values

plot(beetle$Egg_Number)

# Let's fit a GLM to the data

model <- glm(formula = Egg_Number ~ Origin * Type,
              family = poisson,
              data = beetle)

# Now we get the results of model fitting

summary(model)

## 
## Call:
## glm(formula = Egg_Number ~ Origin * Type, family = poisson, data = beetle)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -12.9783   -5.3054   -0.2157    4.9845   10.3748  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          4.42784    0.03154 140.370  < 2e-16 ***
## OriginMung          -0.06474    0.04535  -1.428  0.15343    
## TypeMung             0.11457    0.04339   2.641  0.00827 ** 
## OriginMung:TypeMung  0.14403    0.06135   2.348  0.01889 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1885.0  on 47  degrees of freedom
## Residual deviance: 1841.9  on 44  degrees of freedom
## AIC: 2137.3
## 
## Number of Fisher Scoring iterations: 5

# Here we run all the necessary pairwise comparisons

emmeans(model, 
        pairwise ~ Origin * Type,
        type = "response", 
        adjust = "none")

## $emmeans
##  Origin Type   rate   SE  df asymp.LCL asymp.UCL
##  Black  Black  83.8 2.64 Inf      78.7      89.1
##  Mung   Black  78.5 2.56 Inf      73.6      83.7
##  Black  Mung   93.9 2.80 Inf      88.6      99.6
##  Mung   Mung  101.7 2.91 Inf      96.1     107.5
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
##  contrast                 ratio     SE  df null z.ratio p.value
##  Black Black / Mung Black 1.067 0.0484 Inf    1   1.428  0.1534
##  Black Black / Black Mung 0.892 0.0387 Inf    1  -2.641  0.0083
##  Black Black / Mung Mung  0.824 0.0351 Inf    1  -4.551  <.0001
##  Mung Black / Black Mung  0.836 0.0369 Inf    1  -4.062  <.0001
##  Mung Black / Mung Mung   0.772 0.0335 Inf    1  -5.962  <.0001
##  Black Mung / Mung Mung   0.924 0.0382 Inf    1  -1.919  0.0550
## 
## Tests are performed on the log scale

# And finally we draw a plot

ggplot(beetle, aes(x = Type, y = Egg_Number, fill = Origin)) +
  geom_boxplot()

BIO241 - Experimenting with bean beetles