When designing a boxplot for a data set with two or more categorical variables, one may need to group/cluster some of the boxes by category. Such a clustered (grouped) boxplot is very easy to create if you know already how to draw boxplots.
Before going any further, if you are not so familiar with boxplots, have a quick look at this page:
Here we will take the following example where values
is the response variable, and category1
and category2
the categorical predictor variables. The dataframe for this tutorial is as follows:
# dataframe
df <- data.frame(values, category1, category2)
# structure of the dataframe
str(df)
## 'data.frame': 400 obs. of 3 variables:
## $ values : num 15.5 23.5 31.9 29.1 23.5 ...
## $ category1: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
## $ category2: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ...
As you may guess from the structure of the dataframe above, category1
has 4 levels (A
, B
, C
and D
) and category2
has only 2 levels (1
and 2
).
As for any boxplot, the function to use for drawing the bars is geom_boxplot()
. Since we have two categorical variables and the response variable to map, the function aes()
will look more or less like this: aes(values, category1, category2)
. However, we have to order properly the variables and ask ggplot to group and color the boxes according to one of categories. We will use fill=
to do so. Our plan is to:
values
on the Y-axis,category1
on the X-axis,category2
levels with fill=
,geom_boxplot()
.Here is the code:
ggplot(df, aes(x = category1, y = values, fill = category2)) +
geom_boxplot()
Alternatively we may replace fill=
with color=
. While fill=
colors the entire boxes, color=
changes the color of the box frames and lines only:
ggplot(df, aes(x = category1, y = values, color = category2)) +
geom_boxplot()
In this section, you will learn how to set/modify all the necessary elements that make a plot complete and comprehensible. Such elements are: