A discussion of ggplot2 terminology, and an example of iteratively refining a simple scatterplot.
ggplot2 is an R implementation of the Grammar of Graphics. The idea is to define the basic “words” from which visualizations are built, and then let users compose them in original ways. This is in contrast to systems with prespecified chart types, where the user is forced to pick from a limited dropdown menu of plots. Just like in ordinary language, the creative combination of simple building blocks can support a very wide range of expression.
These are libraries we’ll use in this lecture.
We’re going to create this plot in these notes.
Every ggplot2 plot is made from three components,
data.frame
that we want to visualize.Let’s load up the data. Each row is an observation, and each column is an attribute that describes the observation. This is important because each mark that you see on a ggplot – a line, a point, a tile, … – had to start out as a row within an R data.frame
. The visual properties of the mark (e.g., color) are determined by the values along columns. These type of data are often referred to as tidy data, and we’ll have a full week discussing this topic.
Here’s an example of the data above in tidy format,
state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
This is one example of how the same information might be stored in a non-tidy way, making visualization much harder.
non_tidy <- data.frame(t(murders))
colnames(non_tidy) <- non_tidy[1, ]
non_tidy <- non_tidy[-1, ]
non_tidy[, 1:6]
Alabama Alaska Arizona Arkansas California Colorado
abb AL AK AZ AR CA CO
region South West West South West West
population 4779736 710231 6392017 2915918 37253956 5029196
total 135 19 232 93 1257 65
Often, one of the hardest parts in making a ggplot2 plot is not coming up with the right ggplot2 commands, but reshaping the data so that it’s in a tidy format.
The words in the grammar of graphics are the geometry layers. We can associate each row of a data frame with points, lines, tiles, etc., just by referring to the appropriate geom in ggplot2. A typical plot will compose a chain of layers on top of a dataset,
ggplot(data) + [layer 1] + [layer 2] + …
For example, by deconstructing the plot above, we would expect to have point and text layers. For now, let’s just tell the plot to put all the geom
’s at the origin.
ggplot(murders) +
geom_point(x = 0, y = 0) +
geom_text(x = 0, y = 0, label = "test")
You can see all the types of geoms
in the cheat sheet. We’ll be experimenting with a few of these in a later lecture.
Aesthetic mappings make the connection between the data and the geometry. It’s the piece that translates abstract data fields into visual properties. Analyzing the original graph, we recognize these specific mappings,
To establish these mappings, we need to use the aes
function. Notice that column names don’t have to be quoted – ggplot2 knows to refer back to the murders
data frame in ggplot(murders)
.
ggplot(murders) +
geom_point(aes(x = population, y = total, col = region))
The original plot used a log-scale. To transform the x and y axes, we can use scales.
ggplot(murders) +
geom_point(aes(x = population, y = total, col = region)) +
scale_x_log10() +
scale_y_log10()
Once nuance is that scales aren’t limited to \(x\) and \(y\) transformations. They can be applied to modify any relationship between a data field and its appearance on the page. For example, this changes the mapping between the region field and circle color.
ggplot(murders) +
geom_point(aes(x = population, y = total, col = region)) +
scale_x_log10() +
scale_y_log10() +
scale_color_manual(values = c("#6a4078", "#aa1518", "#9ecaf8", "#50838c")) # exercise: find better colors using https://imagecolorpicker.com/
A problem with this graph is that it doesn’t tell us which state each point corresponds to. For that, we’ll need text labels. We can encode the coordinates for these marks again using aes
, but this time within a geom_text
layer.
ggplot(murders) +
geom_point(aes(x = population, y = total, col = region)) +
geom_text(
aes(x = population, y = total, label = abb),
nudge_x = 0.08 # what would happen if I remove this?
) +
scale_x_log10() +
scale_y_log10()
Note that each type of layer uses different visual properties to encode the data – the argument label
is only available for the geom_text
layer. You can see which aesthetic mappings are required for each type of geom
by checking that geom
’s documentation page, under the Aesthetics heading.
It’s usually a good thing to make your code as concise as possible. For ggplot2, we can achieve this by sharing elements across aes
calls (e.g., not having to type population
and total
twice). This can be done by defining a “global” aesthetic, putting it inside the initial ggplot
call.
ggplot(murders, aes(x = population, y = total)) +
geom_point(aes(col = region)) +
geom_text(aes(label = abb), nudge_x = 0.08) +
scale_x_log10() +
scale_y_log10()
How can we improve the readability of this plot? You might already have ideas,
For 1., the ggrepel
package find better state name positions, drawing links when necessary.
ggplot(murders, aes(x = population, y = total)) +
geom_text_repel(aes(label = abb), segment.size = 0.2) + # I moved it up so that the geom_point's appear on top of the lines
geom_point(aes(col = region)) +
scale_x_log10() +
scale_y_log10()
For 2., let’s first compute the national murder rate,
Now, we can use this as the slope in a geom_abline
layer, which encodes a slope and intercept as a line on a graph.
ggplot(murders, aes(x = population, y = total)) +
geom_abline(intercept = log10(r), size = 0.4, col = "#b3b3b3") +
geom_text_repel(aes(label = abb), segment.size = 0.2) +
geom_point(aes(col = region)) +
scale_x_log10() +
scale_y_log10()
For 3., we can add a labs
layer to write labels and a theme
to reposition the legend. I used unit_format
from the scales
package to change the scientific notation in the \(x\)-axis labels to something more readable.
ggplot(murders, aes(x = population, y = total)) +
geom_abline(intercept = log10(r), size = 0.4, col = "#b3b3b3") +
geom_text_repel(aes(label = abb), segment.size = 0.2) +
geom_point(aes(col = region)) +
scale_x_log10(labels = unit_format(unit = "million", scale = 1e-6)) + # used to convert scientific notation to readable labels
scale_y_log10() +
labs(
x = "Population (log scale)",
y = "Total number of murders (log scale)",
color = "region",
title = "US Gun Murders in 2010"
) +
theme(legend.position = "top")
For 5., I find the gray background with reference lines a bit distracting. We can simplify the appearance using theme_bw
. I also like the colorbrewer palette, which can be used by calling a different color scale.
ggplot(murders, aes(x = population, y = total)) +
geom_abline(intercept = log10(r), size = 0.4, col = "#b3b3b3") +
geom_text_repel(aes(label = abb), segment.size = 0.2) +
geom_point(aes(col = region)) +
scale_x_log10(labels = unit_format(unit = "million", scale = 1e-6)) +
scale_y_log10() +
scale_color_brewer(palette = "Set2") +
labs(
x = "Population (log scale)",
y = "Total number of murders (log scale)",
color = "Region",
title = "US Gun Murders in 2010"
) +
theme_bw() +
theme(
legend.position = "top",
panel.grid.minor = element_blank()
)
Some bonus exercises, which will train you to look at your graphics more carefully, as well as build your familiarity with ggplot2.
size
argument in geom_text_repel
.override.aes
within a guide
.region
field of the murders
data.frame.murders
data.frame, and use a data
field specific to the geom_text_repel
layer.