Tying together the introductions to ggplot2 and vega-lite, using the common language of encodings.
The reason that the ggplot2 and vega-lite packages feel similar is that they are based on the same underlying ideas,
This encoding of rows as marks and columns as properties of the marks is illustrated in the toy diagram below.
In fact, we can roughly translate the function names between the two packages,
geom_*
→ mark*
aes
→ vl.encode()
scale_*
→ scale()
labs
→ title()
A good visualization makes it easy to visually compare the relevant attributes across observations. A challenge is that there are often many possible marks and encodings for any given comparison. It’s also difficult to know which comparisons are actually of interest. For these reasons, we’re going to want to build up a vocabulary of marks and encodings.
To identify good encodings, it can often help to first what the types of each field are.
This is not an exhaustive list, and there are subtleties,
It’s worth highlighting that, even if a particular encoding could be used for a given data type, different encodings have different effectiveness. When trying to encoding several data fields at once, a choice of one encoding over another will implicitly prioritize certain comparisons over others.
For example, in the figure below (Figure 5.8 here), the same two numbers are encoded using many different visual properties – position on a shared \(y\)-axis, position on distinct \(y\)-axes, etc. – and study participants were asked to gauge the difference in the numbers. People were best at comparing positions on a common scale, and worst at distinguishing differences in areas.
Here is a similar example. In the left four panels, a nominal field is encoded using color (left two) and shape (middle two). The red circles are easiest to find in the left two. In the right two panels, two nominal fields are encoded, using both shape and color. It’s much harder to find the red circle – you have to scan over the entire image to find it, which you didn’t have to do at all for the first two panels.
Implicit in this discussion is that there is never any “perfect” visualization for a given dataset – the quality of a visualization is a function of its intended purpose. What comparisons do you want to facilitate? The choice of encoding will strongly affect the types of comparisons that are easy to perform.