factorial.utf8

Whereas one-way ANOVA allows for comparison of three and more group means based on the different levels of a single factor, factorial design allows for comparison of groups based on several independent variables and their various levels. Thus, comparing observations (such as tree size, egg size…) categorized by to two predictor variables (for example location and species) may be done via a two-way ANOVA; if three predictor variables are involved (for example location, species and year), then a three-way ANOVA may be used… Further below, we will essentially talk about two-way ANOVA, but you’ll quickly understand how the functions may be adapted to more factors.

Using factorial ANOVA, we get to assess main effects, in other words the impact of single factors (predictor variables) on a given response variable (plant size, egg number, individual bodyweight…). We also get to see whether there exist interactions between factors, i.e. whether the effect of a specific factor is influenced by (one of the) other factors.

As for one-way ANOVA, the function to use is lm(), followed by anova().

Let’s see that in an example similar to the one that we looked at in one-way ANOVA. Here, we check whether the average size of blue ground beetles (Carabus intricatus) differs depending on their location. We now introduce a new factor: measurements were performed in 2005 and in 2015 at the same 3 forests A, B and C. In each location, we measured the size (in millimeters) of 10 individuals (making it a balanced design). In total, 60 individuals were measured. The two factors are location and year.

Here is the code for creating the dataframe:

# response variable
size <- c(25,22,28,24,26,24,22,21,23,25,26,30,25,24,21,27,28,23,25,24,20,22,24,23,22,24,20,19,21,22,24,27,26,24,25,27,22,28,25,24,27,29,26,27,25,27,28,24,24,26,21,23,25,20,25,23,25,19,22,21)

# predictor variables
location <- as.factor(c(rep("ForestA",10), rep("ForestB",10), rep("ForestC",10), rep("ForestA",10), rep("ForestB",10), rep("ForestC",10)))
year <- as.factor(c(rep("2005",30), rep("2015",30)))

# dataframe
my.dataframe <- data.frame(size,location,year)

We start by visualizing the data with boxplots. Since we have several factors, we’ll group some of the boxes according to these factors.

ggplot(my.dataframe, aes(x = location, y = size, fill = year)) +
  geom_boxplot()

Assumptions

The assumptions are globally the same as for one-way ANOVA:

independence of observations (each individual is represented by 1 entry/measurement ONLY)
normality of distribution (to be tested for each group, for example with the Shapiro-Wilk test)
homogeneity of variance (to be tested with, for example, Levene’s test)
groups contain no outliers.

Running the test

The syntax is lm(response ~ predictor1 * predictor2, data = dataframe) where response is the response variable, predictor1 and predictor2 are the predictor variables or factors (which categorize the observations) and dataframe the name of the dataframe that contains the data. We first need to fit the linear model with lm() and then we store it in the object model. We then compute and display the table for the analysis using anova():

model <- lm(size ~ location * year, data = my.dataframe)
anova(model)

## Analysis of Variance Table
## 
## Response: size
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## location       2 146.700  73.350 17.8178 1.142e-06 ***
## year           1  14.017  14.017  3.4049   0.07049 .  
## location:year  2   0.633   0.317  0.0769   0.92606    
## Residuals     54 222.300   4.117                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table above shows the F value and p-value for each of the main effects, location and year, as well as for the interaction location:year. It also conveniently indicates with stars where significance levels are reached. For instance, size is significantly affected by the factor location, and the p-value is very small (p<0.001, ***). There is no indication of an effect of year, nor an interaction between the two predictor variables.

Since there is no interaction, we may remove the interaction term from the model and run this syntax instead: lm(response ~ predictor1 + predictor2, data = dataframe). The only difference is that a + replaces the *:

model2 <- lm(size ~ location + year, data = my.dataframe)
anova(model2)

## Analysis of Variance Table
## 
## Response: size
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## location   2 146.700  73.350 18.4252 7.101e-07 ***
## year       1  14.017  14.017  3.5209   0.06581 .  
## Residuals 56 222.933   3.981                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here again, size is significantly affected by the factor location, but not year. Note that the p-values are different from those we obtained for the first model.

But this does not tell us anything about the groups which means are significantly different.

Indeed, this test tells you about main effects and interactions, but does not tell you which groups are significantly different from others. For that, we will need post hoc tests. We can run multiple comparisons in linear models [factorial design].

Alternative

If you need an alternative test to the factorial ANOVA because, for example, you suspect violation of the assumption of normality of distribution, you may use a non-parametric test called Friedman rank sum test (or simply Friedman’s test).