STAT 436 (Spring 2023): tsibble Objects

Kris Sankaran

library(tidyverse)
library(tsibble)
library(feasts)
library(tsibbledata)

Tsibbles are data structures that are designed specifically for storing time series data. They are useful because they create a unified interface to various time series visualization and modeling tasks. This removes the friction of having to transform back and forth between data.frames, lists, and matrices, depending on the particular task of interest.
The key difference between a tsibble and an ordinary data.frame is that it requires a temporal key variable, specifying the frequency with which observations are collected. For example, the code below generates a tsibble with yearly observations.

tsibble(
  Year = 2015:2019,
  Observation = c(123, 39, 78, 52, 110),
  index = Year
)

# A tsibble: 5 x 2 [1Y]
   Year Observation
  <int>       <dbl>
1  2015         123
2  2016          39
3  2017          78
4  2018          52
5  2019         110

We can also create a tsibble from an ordinary data.frame by calling the as_tsibble function. The only subtlety is that we have to specify an index.

x <- data.frame(
  Year = 2015:2019,
  Observation = c(123, 39, 78, 52, 110)
)

as_tsibble(x, index = Year)

# A tsibble: 5 x 2 [1Y]
   Year Observation
  <int>       <dbl>
1  2015         123
2  2016          39
3  2017          78
4  2018          52
5  2019         110

The index is useful because it creates a data consistency check. If a few days are missing from a daily dataset, the index makes it easy to detect and fill in these gaps. Notice that when we print a tsibble object, it prints the index and guessed sampling frequency on the top right corner.

days <- seq(as_date("2021-01-01"), as_date("2021-01-31"), by = "day")
days <- days[-5] # Skip January 5

x <- tsibble(day = days, value = rnorm(30), index = day)
fill_gaps(x)

# A tsibble: 31 x 2 [1D]
   day          value
   <date>       <dbl>
 1 2021-01-01 -1.13  
 2 2021-01-02 -0.0655
 3 2021-01-03  0.707 
 4 2021-01-04  0.940 
 5 2021-01-05 NA     
 6 2021-01-06 -0.150 
 7 2021-01-07 -0.272 
 8 2021-01-08  0.267 
 9 2021-01-09  1.28  
10 2021-01-10  0.521 
# … with 21 more rows

Tsibbles can store more than one time series at a time. In this case, we have to specify key columns that distinguish between the separate time series. For example, in the olympics running times dataset,

olympic_running

# A tsibble: 312 x 4 [4Y]
# Key:       Length, Sex [14]
    Year Length Sex    Time
   <int>  <int> <chr> <dbl>
 1  1896    100 men    12  
 2  1900    100 men    11  
 3  1904    100 men    11  
 4  1908    100 men    10.8
 5  1912    100 men    10.8
 6  1916    100 men    NA  
 7  1920    100 men    10.8
 8  1924    100 men    10.6
 9  1928    100 men    10.8
10  1932    100 men    10.3
# … with 302 more rows

the keys are running distance and sex. If we were creating a tsibble from a data.frame containing these multiple time series, we would need to specify the keys. This protects against accidentally having duplicate observations at given times.

olympic_df <- as.data.frame(olympic_running)
as_tsibble(olympic_df, index = Year, key = c("Sex", "Length")) # what happens if we remove key?

# A tsibble: 312 x 4 [4Y]
# Key:       Sex, Length [14]
    Year Length Sex    Time
   <int>  <int> <chr> <dbl>
 1  1896    100 men    12  
 2  1900    100 men    11  
 3  1904    100 men    11  
 4  1908    100 men    10.8
 5  1912    100 men    10.8
 6  1916    100 men    NA  
 7  1920    100 men    10.8
 8  1924    100 men    10.6
 9  1928    100 men    10.8
10  1932    100 men    10.3
# … with 302 more rows

The usual data tidying functions from dplyr are implemented for tsibbles. Filtering rows, selecting columns, deriving variables using mutate, and summarizing groups using group_by and summarise all work as expected. One distinction to be careful about is that the results will be grouped by their index.
For example, this computes the total cost of Australian pharmaceuticals per month for a particular type of script. We simply filter to the script type and take the sum of costs.

PBS %>%
  filter(ATC2 == "A10") %>%
  summarise(TotalC = sum(Cost))

# A tsibble: 204 x 2 [1M]
      Month  TotalC
      <mth>   <dbl>
 1 1991 Jul 3526591
 2 1991 Aug 3180891
 3 1991 Sep 3252221
 4 1991 Oct 3611003
 5 1991 Nov 3565869
 6 1991 Dec 4306371
 7 1992 Jan 5088335
 8 1992 Feb 2814520
 9 1992 Mar 2985811
10 1992 Apr 3204780
# … with 194 more rows

If we had wanted the total cost by year, we would have to convert to an ordinary data.frame with a year variable. We cannot use a tsibble here because we would have multiple measurements per year, and this would violate tsibble’s policy of having no duplicates.

PBS %>%
  filter(ATC2 == "A10") %>%
  mutate(Year = year(Month)) %>%
  as_tibble() %>%
  group_by(Year) %>%
  summarise(TotalC = sum(Cost))

# A tibble: 18 × 2
    Year     TotalC
   <dbl>      <dbl>
 1  1991  21442946 
 2  1992  45686946.
 3  1993  55532688.
 4  1994  60816080.
 5  1995  67326599.
 6  1996  77397927.
 7  1997  85131672.
 8  1998  93310626.
 9  1999 105959043.
10  2000 122496586.
11  2001 136467442.
12  2002 149066136.
13  2003 156464261.
14  2004 183798935.
15  2005 199655595 
16  2006 220354676 
17  2007 265718966.
18  2008 135036513

tsibble Objects

Citation