Statistical Data Visualization: Introduction to Vega-Lite

Reading, Recording, Rmarkdown, Observable

Vega-Lite is the javascript equivalent of ggplot2. It provides a grammar of graphics, where layers can be composed on top of one another. The main reason for knowing vega-lite (opposed to only ggplot2) is that it can support flexible interaction. Also, it’s a good gateway toward understanding richer data visualization packages, like d3.

Components of a Graph

We’re going to construct this graph step-by-step, just like in the ggplot2 lecture. I realize that this is a bit repetitive, but I want to make it obvious just how similar ggplot2 and vega-lite are.

Data and Geometry

The first thing to do is import the necessary libraries. The arquero library (aq) makes it easier to manage data in javascript – the filtering and printing steps would be more difficult without it.

import {vl} from '@vega/vega-lite-api'
import { aq, op } from '@uwdata/arquero'

Now we can preview and filter the raw data.

data = aq.fromCSV(await FileAttachment("gapminder.csv").text()) // I used the file upload button to add this file
data2000 = data.filter(d => d.year == 2000) // filter to only 2000
data2000.view()

Let’s create a visual mark, without any encodings. Like in our first ggplot2 figure, the mark is collapsed into a single point at the origin. Notice the analogy betwen vega-lite’s markPoint() and ggplot2’s geom_point().

vl.markPoint()
  .data(data2000)
  .render()

Encodings

We can deconstruct the original figure to see data fields are mapped to visual properties,

Fertility → \(x\)-axis coordinate
Life expectancy → \(y\)-axis coordinate
Country cluster → Point color
Population → Point size

We can build this up one step at a time. To encode the fertility and life expectancy fields, we use vl.x() and vl.y() in an encode() call. One subtlety is that it’s necessary to specify the data type of each field. fieldQ means that the field is “Quantitative”. We’ll review data types in the next lecture.

vl.markPoint()
  .data(data2000)
  .encode(
    vl.x().fieldQ("fertility"), // what happens if you change this to fieldN()?
    vl.y().fieldQ("life_expect")
  )
  .render()

The same idea can be used to encode the country cluster and population fields. fieldN specifies that the cluster field is a Nominal (i.e., categorical) variable.

Hint: How do you know which commands encode which fields? It’s often good to keep the vega-lite-api and vega-lite documentation close at hand.

vl.markPoint()
  .data(data2000)
  .encode(
    vl.x().fieldQ("fertility"),
    vl.y().fieldQ("life_expect"),
    vl.color().fieldN("cluster"), // what happens if you change this to fieldO()?
    vl.size().fieldQ("pop")
  )
  .render()

Finishing touches

We’ve made headway in our visualization task, but a more critical eye will help us improve the display. It will also give us an initial look at interactivity. The main things we ought to improve in this display are,

The open rings are a bit distracting. They violate the principle of Smallest Effective Difference.
Some of the points are impossibly small.
The axis labels are not proper English words.
There are distractingly many values in the legend for population size.
We can’t actually tell which country each point corresponds to. It would be nice to add a tooltip (a label that appears when you mouseover the point).

For (1), let’s replace the open rings with filled circles. We can pass options to markPoint in the same way that we passed options to ggplot geom layers. For (2), we can modify the default size range by adding on a .scale() to the vl.size() encoding – this is similar to adding a custom scale in ggplot2.

vl.markPoint({filled: true})
.data(data2000)
.encode(
  vl.x().fieldQ("fertility"),
  vl.y().fieldQ("life_expect"),
  vl.color().fieldN("cluster"),
  vl.size().fieldQ("pop").scale({range: [25, 1000]}) // try changing the ranges
)
.render()

For (3), we can add a .title() to fields for which we want to customize titles, and for (4), we can modify the tick count in the .legend() for size.

vl.markPoint({filled: true})
.data(data2000)
.encode(
  vl.x().fieldQ("fertility").title("Fertility"),
  vl.y().fieldQ("life_expect").title("Life Expectancy"),
  vl.color().fieldN("cluster"),
  vl.size().fieldQ("pop").scale({range: [25, 1000]}).legend({tickCount: 3}).title("Population"),
)
.render()

Now to (5). The idea is to think of a tooltip as another type of encoding – just like coordinate positions and colors, a label is a way of transforming a field in an abstract dataset into a visual property we can observe on a screen.

vl.markPoint({filled: true})
.data(data2000)
.encode(
  vl.x().fieldQ("fertility").title("Fertility"),
  vl.y().fieldQ("life_expect").title("Life Expectancy"),
  vl.color().fieldN("cluster"),
  vl.size().fieldQ("pop").scale({range: [25, 1000]}).legend({tickCount: 3}).title("Population"),
  vl.tooltip().fieldN("country"),
  vl.order().fieldQ('pop').sort('descending') // if you remove this, large points prevent mouseover on the points below them
)
.render()

One point I want you to take away from both this and the ggplot2 introductions is that building a plot is an iterative process. Once you develop your eye for good visual design, you will be able to take any initial plot and make it more effective.

Finally, for a taste of what’s to come, here’s a version of the same plot that let’s us compare many years over time.

Don’t worry about the code implementation, but in case you’re curious, it’s only a few extra lines over what we’ve already implemented.

{
  // define a slider selector object
  let selectYear = vl.selectSingle("select").fields("year")
    .init({year: 1955})
    .bind(vl.slider().min(1955).max(2005).step(5).name("Year"))
  
  // update the plot depending on the value of the slider
  return vl.markPoint({filled: true})
    .data(data)
    .select(selectYear) // refers to the slider bar
    .transform(vl.filter(selectYear))
    .encode(
      vl.x().fieldQ("fertility").title("Fertility").scale({domain: [0, 9]}), // hard code the axes ranges
      vl.y().fieldQ("life_expect").title("Life Expectancy").scale({domain: [0, 90]}),
      vl.color().fieldN("cluster"),
      vl.size().fieldQ("pop").scale({domain: [0, 1200000000], range: [25, 1000]}).legend({tickCount: 3}).title("Population"),
      vl.tooltip().fieldN("country"),
      vl.order().fieldQ('pop').sort('descending')
    )
    .render()
}