Which columns might help us understand extreme values?
If we uncover extreme values, it’s natural to ask why they arise. The most reliable way to find out is to study how the data were collected. Often, though, through a little detective work based on the rest of the available data, we can form reasonable hypotheses for the source of extreme values.
The profiler paper approaches this by coordinating views. On the left, a binned scatterplot shows the relationship between Global (\(x\)-axis) and US (\(y\)-axis) gross product for movies in the IMDB dataset. There seem to be a few outliers along the bottom of the scatterplot. These are movies with high Global gross but which had low revenue in the US. At first, we might suspect that these are errors. However, linking the scatterplot with the map on the right gives a explanation – these outlier movies were released outside of the US. In the full system, selecting bins in the left hand plot highlights the countries that contribute to that bin in orange color.
How can we know which views to coordinate? Ideally, there would be a systematic way to search through promising views, without relying solely on intuition. The Profiler paper suggests a criterion based on mutual information. We will not discuss this specific proposal, since I do not want us to get side-tracked learning about information theory. Instead, we will use a simpler (and I think more practical) alternative based on predictability.
We will illustrate the idea with a dataset of rail tickets in Spain1. Each row shows a train trip, along with its date, the available seats, and an associated fare. A histogram of trip duration reveals some outliers – some trips take more than 10 hours.
trips <- read_csv("https://uwmadison.box.com/shared/static/fmag4vyhqad1sdb76aphukkumq2z2yp2.csv") # give it a minute, it's a large file
ggplot(trips, aes(x = duration)) +
geom_histogram(bins = 100)
seats
) that is entirely missing.skim(trips)
Name | trips |
Number of rows | 1000000 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
POSIXct | 3 |
character | 7 |
logical | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
departure | 0 | 1 | 2019-04-12 05:50:00 | 2019-08-03 21:15:00 | 2019-05-29 16:45:00 | 11509 |
arrival | 0 | 1 | 2019-04-12 08:38:00 | 2019-08-04 00:02:00 | 2019-05-29 19:30:00 | 15100 |
insert_date | 0 | 1 | 2019-04-11 21:49:46 | 2019-06-05 15:06:25 | 2019-05-08 19:34:03 | 230569 |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company | 0 | 1 | 5 | 5 | 0 | 1 | 0 |
origin | 0 | 1 | 6 | 10 | 0 | 5 | 0 |
destination | 0 | 1 | 6 | 10 | 0 | 5 | 0 |
vehicle_type | 0 | 1 | 2 | 9 | 0 | 16 | 0 |
vehicle_class | 3957 | 1 | 7 | 22 | 0 | 8 | 0 |
fare | 3957 | 1 | 4 | 23 | 0 | 9 | 0 |
meta | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
seats | 1e+06 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 5000665.58 | 2887593.85 | 19.00 | 2500853.25 | 5004586.00 | 7504431.75 | 9999990.00 | ▇▇▇▇▇ |
duration | 0 | 1.00 | 3.12 | 1.58 | 1.63 | 2.50 | 2.63 | 3.17 | 12.42 | ▇▁▁▁▁ |
price | 87446 | 0.91 | 63.07 | 25.88 | 0.00 | 43.25 | 60.30 | 78.80 | 235.30 | ▅▇▂▁▁ |
At this point, we could try relating trip duration to each of the available columns, evaluating a few of our initial hypothesis. While this strategy would work in this particular case, it would become untenable for datasets with many more columns. Instead, we’ll consider a more generally useful strategy, based on predictability.
Intuitively, we should try finding features that help predict whether an observation will be an outlier. The features that are most predictive determine conditions within which the outliers are expected to arise, suggesting potential explanations. For example, in the movie revenue example, you might have found that country label is strongly predictive of whether an observation had high global, but low US, revenue.
Applying this idea to the train trips example, let’s extract some features that might be predictive of whether a trip took more than 6 hours. We put this in the x
data.frame. We put an indicator of whether the trip is 6 or more hours in the y
vector.
y
based on x
. We’ll subsample rows so that it doesn’t take too long – our goal is visualization, not prediction performance. We are using the caret
package to run the model, because it provides an interface to many potential prediction models. Finally, the varImp
command summarizes how important each variable was for the eventual prediction.ix <- sample(seq_len(nrow(x)), 1000)
fit <- train(x = x[ix, ], y = y[ix], method = "gbm", verbose = FALSE)
varImp(fit)
gbm variable importance
Overall
vehicle_type 100.0000
origin 2.5829
vehicle_class 1.9751
destination 1.1214
fare 0.6559
hour 0.1539
month 0.0000
vehicle_type
is by far the most predictive variable. To see its relationship with trip duration, let’s use ridge lines. We create separate histograms split according vehicle_type
. We’ll also split histograms within each vehicle type according to origin city, since this was the second most important variable.# reorder for the vehicle types
vehicle_order <- trips %>%
group_by(vehicle_type) %>%
summarise(mduration = mean(duration)) %>%
arrange(mduration) %>%
pull(vehicle_type)
trips <- trips %>%
mutate(vehicle_type = factor(vehicle_type, levels = vehicle_order))
# make the ridge line plot
ggplot(trips, aes(x = duration, y = vehicle_type, fill = origin)) +
geom_density_ridges(alpha = 0.8) +
labs(y = "Vehicle Type", x = "Duration", fill = "Origin City") +
scale_fill_brewer(palette = "Set2")
This is a nice explanation. A few of the vehicle_types
, like AVE and ALVIA, correspond to high speed express trains. The regional lines, like MD and R Express, tend to take longer to reach their destination. Within a particular type of train, the duration is related to the origin city, likely because that determines the distance between origin and destination.
While you likely would have discovered this pattern after exploring the dataset on your own, in higher-dimensional settings, it can be useful to have mechanisms that can partially automate the process. Our strategy of using a predictive model in R to sort through variables is a practically useful one, but be aware that there is modern research (in the spirit of Profiler) that streamlines the process even further, putting all computation in the background and making use of interactivity to quickly compare views.
I sampled 1 million rows from 2019, the complete dataset is quite a bit larger.↩︎