Navigating across related time series.
We have seen ways of visualizing a single time series (seasonal plots, ACF) and small numbers of time series (Cross Correlation). In practice, it’s also common to encounter large collections of time series. These datasets tend to require more sophisticated analysis techniques, but we will review one useful approach, based on extracted features.
The high-level idea is to represent each time series by a vector of summary statistics, like the maximum value, the slope, and so on. These vector summaries can then be used to create an overview of variation seen across all time series. For example, just looking at the first few regions in the Australian tourism dataset, we can see that there might be useful features related to the overall level (Coral Coast is larger than Barkly), recent trends (increased business in North West), and seasonality (South West is especially seasonal).
tourism <- as_tsibble(tourism, index = Quarter) %>%
mutate(key = str_c(Region, Purpose, sep="-")) %>%
update_tsibble(key = c("Region", "State", "Purpose", "key"))
regions <- tourism %>%
distinct(Region) %>%
pull(Region)
ggplot(tourism %>% filter(Region %in% regions[1:9])) +
geom_line(aes(x = time(Quarter), y = Trips, col = Purpose)) +
scale_color_brewer(palette = "Set2") +
facet_wrap(~Region, scale = "free_y") +
theme(legend.position = "bottom")
tourism_features <- tourism %>%
features(Trips, feature_set(pkgs = "feasts"))
tourism_features
# A tibble: 304 x 52
Region State Purpose key trend_strength seasonal_streng…
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Adela… Sout… Busine… Adel… 0.451 0.380
2 Adela… Sout… Holiday Adel… 0.541 0.601
3 Adela… Sout… Other Adel… 0.743 0.189
4 Adela… Sout… Visiti… Adel… 0.433 0.446
5 Adela… Sout… Busine… Adel… 0.453 0.140
6 Adela… Sout… Holiday Adel… 0.512 0.244
7 Adela… Sout… Other Adel… 0.584 0.374
8 Adela… Sout… Visiti… Adel… 0.481 0.228
9 Alice… Nort… Busine… Alic… 0.526 0.224
10 Alice… Nort… Holiday Alic… 0.377 0.827
# … with 294 more rows, and 46 more variables:
# seasonal_peak_year <dbl>, seasonal_trough_year <dbl>,
# spikiness <dbl>, linearity <dbl>, curvature <dbl>,
# stl_e_acf1 <dbl>, stl_e_acf10 <dbl>, acf1 <dbl>, acf10 <dbl>,
# diff1_acf1 <dbl>, diff1_acf10 <dbl>, diff2_acf1 <dbl>,
# diff2_acf10 <dbl>, season_acf1 <dbl>, pacf5 <dbl>,
# diff1_pacf5 <dbl>, diff2_pacf5 <dbl>, season_pacf <dbl>,
# zero_run_mean <dbl>, nonzero_squared_cv <dbl>,
# zero_start_prop <dbl>, zero_end_prop <dbl>,
# lambda_guerrero <dbl>, kpss_stat <dbl>, kpss_pvalue <dbl>,
# pp_stat <dbl>, pp_pvalue <dbl>, ndiffs <int>, nsdiffs <int>,
# bp_stat <dbl>, bp_pvalue <dbl>, lb_stat <dbl>, lb_pvalue <dbl>,
# var_tiled_var <dbl>, var_tiled_mean <dbl>, shift_level_max <dbl>,
# shift_level_index <dbl>, shift_var_max <dbl>,
# shift_var_index <dbl>, shift_kl_max <dbl>, shift_kl_index <dbl>,
# spectral_entropy <dbl>, n_crossing_points <int>,
# longest_flat_spot <int>, coef_hurst <dbl>, stat_arch_lm <dbl>
This PCA makes it very clear that the different travel purposes have different time series, likely due to the heavy seasonality of holiday travel (Melbourne seems to be an interesting exception).
ggplot(pcs, aes(x = .fittedPC1, y = .fittedPC2)) +
geom_point(aes(col = Purpose)) +
geom_text_repel(
data = outliers,
aes(label = Region),
size = 2.5
) +
scale_color_brewer(palette = "Set2") +
labs(x = "PC1", y = "PC2") +
coord_fixed() +
theme(legend.position = "bottom")
outlier_series <- tourism %>%
filter(key %in% outliers$key)
ggplot(outlier_series) +
geom_line(aes(x = time(Quarter), y = Trips, col = Purpose)) +
scale_color_brewer(palette = "Set2") +
facet_wrap(~Region, scale = "free_y") +
theme(legend.position = "bottom")