Introduction to ggplot2

A discussion of ggplot2 terminology, and an example of iteratively refining a simple scatterplot.

Kris Sankaran
01-10-2023

Reading, Recording, Rmarkdown

ggplot2 is an R implementation of the Grammar of Graphics. The idea is to define the basic “words” from which visualizations are built, and then let users compose them in original ways. This is in contrast to systems with prespecified chart types, where the user is forced to pick from a limited dropdown menu of plots. Just like in ordinary language, the creative combination of simple building blocks can support a very wide range of expression.

These are libraries we’ll use in this lecture.

Components of a Graph

We’re going to create this plot in these notes.

Every ggplot2 plot is made from three components,

The Data

Let’s load up the data. Each row is an observation, and each column is an attribute that describes the observation. This is important because each mark that you see on a ggplot – a line, a point, a tile, … – had to start out as a row within an R data.frame. The visual properties of the mark (e.g., color) are determined by the values along columns. These type of data are often referred to as tidy data, and we’ll have a full week discussing this topic.

Here’s an example of the data above in tidy format,

data(murders)
head(murders)
       state abb region population total
1    Alabama  AL  South    4779736   135
2     Alaska  AK   West     710231    19
3    Arizona  AZ   West    6392017   232
4   Arkansas  AR  South    2915918    93
5 California  CA   West   37253956  1257
6   Colorado  CO   West    5029196    65

This is one example of how the same information might be stored in a non-tidy way, making visualization much harder.

non_tidy <- data.frame(t(murders))
colnames(non_tidy) <- non_tidy[1, ]
non_tidy <- non_tidy[-1, ]
non_tidy[, 1:6]
            Alabama   Alaska  Arizona Arkansas California Colorado
abb              AL       AK       AZ       AR         CA       CO
region        South     West     West    South       West     West
population  4779736   710231  6392017  2915918   37253956  5029196
total           135       19      232       93       1257       65

Often, one of the hardest parts in making a ggplot2 plot is not coming up with the right ggplot2 commands, but reshaping the data so that it’s in a tidy format.

Geometry

The words in the grammar of graphics are the geometry layers. We can associate each row of a data frame with points, lines, tiles, etc., just by referring to the appropriate geom in ggplot2. A typical plot will compose a chain of layers on top of a dataset,

ggplot(data) + [layer 1] + [layer 2] + …

For example, by deconstructing the plot above, we would expect to have point and text layers. For now, let’s just tell the plot to put all the geom’s at the origin.

ggplot(murders) +
  geom_point(x = 0, y = 0) +
  geom_text(x = 0, y = 0, label = "test")

You can see all the types of geoms in the cheat sheet. We’ll be experimenting with a few of these in a later lecture.

Aesthetic mappings

Aesthetic mappings make the connection between the data and the geometry. It’s the piece that translates abstract data fields into visual properties. Analyzing the original graph, we recognize these specific mappings,

To establish these mappings, we need to use the aes function. Notice that column names don’t have to be quoted – ggplot2 knows to refer back to the murders data frame in ggplot(murders).

ggplot(murders) +
  geom_point(aes(x = population, y = total, col = region))

The original plot used a log-scale. To transform the x and y axes, we can use scales.

ggplot(murders) +
  geom_point(aes(x = population, y = total, col = region)) +
  scale_x_log10() +
  scale_y_log10()

Once nuance is that scales aren’t limited to \(x\) and \(y\) transformations. They can be applied to modify any relationship between a data field and its appearance on the page. For example, this changes the mapping between the region field and circle color.

ggplot(murders) +
  geom_point(aes(x = population, y = total, col = region)) +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_manual(values = c("#6a4078", "#aa1518", "#9ecaf8", "#50838c")) # exercise: find better colors using https://imagecolorpicker.com/

A problem with this graph is that it doesn’t tell us which state each point corresponds to. For that, we’ll need text labels. We can encode the coordinates for these marks again using aes, but this time within a geom_text layer.

ggplot(murders) +
  geom_point(aes(x = population, y = total, col = region)) +
  geom_text(
    aes(x = population, y = total, label = abb),
    nudge_x = 0.08 # what would happen if I remove this?
  ) +
  scale_x_log10() +
  scale_y_log10()

Note that each type of layer uses different visual properties to encode the data – the argument label is only available for the geom_text layer. You can see which aesthetic mappings are required for each type of geom by checking that geom’s documentation page, under the Aesthetics heading.

It’s usually a good thing to make your code as concise as possible. For ggplot2, we can achieve this by sharing elements across aes calls (e.g., not having to type population and total twice). This can be done by defining a “global” aesthetic, putting it inside the initial ggplot call.

ggplot(murders, aes(x = population, y = total)) +
  geom_point(aes(col = region)) +
  geom_text(aes(label = abb), nudge_x = 0.08) +
  scale_x_log10() +
  scale_y_log10()

Finishing touches

How can we improve the readability of this plot? You might already have ideas,

  1. Prevent labels from overlapping. It’s impossible to read some of the state names.
  2. Add a line showing the national rate. This serves as a point of reference, allowing us to see whether an individual state is above or below the national murder rate.
  3. Give meaningful axis / legend labels and a title.
  4. Move the legend to the top of the figure. Right now, we’re wasting a lot of visual real estate in the right hand side, just to let people know what each color means.
  5. Use a better color theme.

For 1., the ggrepel package find better state name positions, drawing links when necessary.

ggplot(murders, aes(x = population, y = total)) +
  geom_text_repel(aes(label = abb), segment.size = 0.2) + # I moved it up so that the geom_point's appear on top of the lines
  geom_point(aes(col = region)) +
  scale_x_log10() +
  scale_y_log10()

For 2., let’s first compute the national murder rate,

r <- murders %>% 
  summarize(rate = sum(total) /  sum(population)) %>%
  pull(rate)
r
[1] 3.034555e-05

Now, we can use this as the slope in a geom_abline layer, which encodes a slope and intercept as a line on a graph.

ggplot(murders, aes(x = population, y = total)) +
  geom_abline(intercept = log10(r), linewidth = 0.4, col = "#b3b3b3") +
  geom_text_repel(aes(label = abb), segment.size = 0.2) +
  geom_point(aes(col = region)) +
  scale_x_log10() +
  scale_y_log10()

For 3., we can add a labs layer to write labels and a theme to reposition the legend. I used unit_format from the scales package to change the scientific notation in the \(x\)-axis labels to something more readable.

ggplot(murders, aes(x = population, y = total)) +
  geom_abline(intercept = log10(r), linewidth = 0.4, col = "#b3b3b3") +
  geom_text_repel(aes(label = abb), segment.size = 0.2) +
  geom_point(aes(col = region)) +
  scale_x_log10(labels = unit_format(unit = "million", scale = 1e-6)) + # used to convert scientific notation to readable labels
  scale_y_log10() +
  labs(
    x = "Population (log scale)",
    y = "Total number of murders (log scale)",
    color = "region",
    title = "US Gun Murders in 2010"
  ) +
  theme(legend.position = "top")

For 5., I find the gray background with reference lines a bit distracting. We can simplify the appearance using theme_bw. I also like the colorbrewer palette, which can be used by calling a different color scale.

ggplot(murders, aes(x = population, y = total)) +
  geom_abline(intercept = log10(r), linewidth = 0.4, col = "#b3b3b3") +
  geom_text_repel(aes(label = abb), segment.size = 0.2) +
  geom_point(aes(col = region)) +
  scale_x_log10(labels = unit_format(unit = "million", scale = 1e-6)) +
  scale_y_log10() +
  scale_color_brewer(palette = "Set2") +
  labs(
    x = "Population (log scale)",
    y = "Total number of murders (log scale)",
    color = "Region",
    title = "US Gun Murders in 2010"
  ) +
  theme_bw() +
  theme(
    legend.position = "top",
    panel.grid.minor = element_blank()
  )

Some bonus exercises, which will train you to look at your graphics more carefully, as well as build your familiarity with ggplot2.

Citation

For attribution, please cite this work as

Sankaran (2023, Jan. 10). STAT 436 (Spring 2023): Introduction to ggplot2. Retrieved from https://krisrs1128.github.io/stat436_s23/website/stat436_s23/posts/2022-12-27-week01-01/

BibTeX citation

@misc{sankaran2023introduction,
  author = {Sankaran, Kris},
  title = {STAT 436 (Spring 2023): Introduction to ggplot2},
  url = {https://krisrs1128.github.io/stat436_s23/website/stat436_s23/posts/2022-12-27-week01-01/},
  year = {2023}
}