Navigating across related time series.
We have seen ways of visualizing a single time series (seasonal plots, ACF) and small numbers of time series (Cross Correlation). In practice, it’s also common to encounter large collections of time series. These datasets tend to require more sophisticated analysis techniques, but we will review one useful approach, based on extracted features.
The high-level idea is to represent each time series by a vector of summary statistics, like the maximum value, the slope, and so on. These vector summaries can then be used to create an overview of variation seen across all time series. For example, just looking at the first few regions in the Australian tourism dataset, we can see that there might be useful features related to the overall level (Coral Coast is larger than Barkly), recent trends (increased business in North West), and seasonality (South West is especially seasonal).
tourism <- as_tsibble(tourism, index = Quarter) %>%
mutate(key = str_c(Region, Purpose, sep="-")) %>%
update_tsibble(key = c("Region", "State", "Purpose", "key"))
regions <- tourism %>%
distinct(Region) %>%
pull(Region)
ggplot(tourism %>% filter(Region %in% regions[1:9])) +
geom_line(aes(x = date(Quarter), y = Trips, col = Purpose)) +
scale_color_brewer(palette = "Set2") +
facet_wrap(~Region, scale = "free") +
theme(legend.position = "bottom")
tourism_features <- tourism %>%
features(Trips, feature_set(pkgs = "feasts"))
tourism_features
# A tibble: 304 × 52
Region State Purpose key trend_strength seasonal_strength_year
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Adelaide Sout… Busine… Adel… 0.464 0.407
2 Adelaide Sout… Holiday Adel… 0.554 0.619
3 Adelaide Sout… Other Adel… 0.746 0.202
4 Adelaide Sout… Visiti… Adel… 0.435 0.452
5 Adelaide… Sout… Busine… Adel… 0.464 0.179
6 Adelaide… Sout… Holiday Adel… 0.528 0.296
7 Adelaide… Sout… Other Adel… 0.593 0.404
8 Adelaide… Sout… Visiti… Adel… 0.488 0.254
9 Alice Sp… Nort… Busine… Alic… 0.534 0.251
10 Alice Sp… Nort… Holiday Alic… 0.381 0.832
# ℹ 294 more rows
# ℹ 46 more variables: seasonal_peak_year <dbl>,
# seasonal_trough_year <dbl>, spikiness <dbl>, linearity <dbl>,
# curvature <dbl>, stl_e_acf1 <dbl>, stl_e_acf10 <dbl>, acf1 <dbl>,
# acf10 <dbl>, diff1_acf1 <dbl>, diff1_acf10 <dbl>,
# diff2_acf1 <dbl>, diff2_acf10 <dbl>, season_acf1 <dbl>,
# pacf5 <dbl>, diff1_pacf5 <dbl>, diff2_pacf5 <dbl>, …
This PCA makes it very clear that the different travel purposes have different time series, likely due to the heavy seasonality of holiday travel (Melbourne seems to be an interesting exception).
ggplot(pcs, aes(x = .fittedPC1, y = .fittedPC2)) +
geom_point(aes(col = Purpose)) +
geom_text_repel(
data = outliers,
aes(label = Region),
size = 2.5
) +
scale_color_brewer(palette = "Set2") +
labs(x = "PC1", y = "PC2") +
coord_fixed() +
theme(legend.position = "bottom")
outlier_series <- tourism %>%
filter(key %in% outliers$key)
ggplot(outlier_series) +
geom_line(aes(x = date(Quarter), y = Trips, col = Purpose)) +
scale_color_brewer(palette = "Set2") +
facet_wrap(~Region, scale = "free_y") +
theme(legend.position = "bottom")
For attribution, please cite this work as
Sankaran (2024, Feb. 25). STAT 436 (Spring 2024): Collections of Time Series. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week06-05/
BibTeX citation
@misc{sankaran2024collections, author = {Sankaran, Kris}, title = {STAT 436 (Spring 2024): Collections of Time Series}, url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week06-05/}, year = {2024} }