Techniques to identify extreme values.
Extreme values are a common data quality issue. They can either be genuine extreme values or measurement errors — in either case, it’s important to identify them and potentially account for them.
Visualization can support detection and characterization of anomalies. We’ll review the ideas in the Profiler paper. That work is trying to set the foundation for more complex systems for data cleaning. We’ll look at a few of the interesting, more self-contained ideas within the paper, and demonstrate them in simple examples.
How can we detect anomalies in numerical data? A first idea is to use \(z\)-scores. For column \(j\), estimate the mean \(\hat{\mu}_{j}\) and standard deviation \(\hat{\sigma}_{j}\). For each observation \(i\), it’s anomalousness can be summarized by \(z_{ij} := \frac{x_{ij} - \hat{\mu}_{j}}{\hat{\sigma}_{j}}\). This is illustrated below on a normally distributed dataset that has been contaminated with a few outliers from a \(t\) distribution.
n <- 1000
p <- 0.95
x <- c(rnorm(n * p), rt(n * (1 - p), df = 3))
z <- (x - mean(x)) / sd(x)
x <- data.frame(x = x, z = z, z_ = abs(z))
ggplot(x, aes(x = x)) +
geom_histogram(binwidth = 0.5) +
geom_rug(
data = x %>% filter(z_ > 2.5),
aes(col = z_),
size = 1.5
) +
scale_y_continuous(expand = c(0, 0)) +
scale_color_gradient2(low = "#fffbfc", high = "#c50f2c")
make_dataset <- function(n = 2000, p = 0.95, sigma = 2) {
x <- c(rnorm(n * p, 0, sigma), rt(n * (1 - p), df = 2 * sigma ^ 2 / (sigma ^ 2 - 1)))
data.frame(mu_hat = mean(x), sigma_hat = sd(x), med_hat = median(x), iqr_hat = IQR(x))
}
x_sim <- map_dfr(1:1000, ~ make_dataset(100, 0.9), .id = "replicate")
ggplot(x_sim) +
geom_point(aes(x = mu_hat, y = sigma_hat))
ggplot(x_sim) +
geom_point(aes(x = med_hat, y = iqr_hat))
Sigma <- matrix(c(1, 0.9, 0.9, 1), ncol = 2)
x <- mvrnorm(500, mu = c(0, 0), Sigma = Sigma) %>%
rbind(c(1.5, -1.5)) %>%
data.frame() %>%
mutate(type = c(rep("normal", 500), "anomaly"))
ggplot(x, aes(x = X1, y = X2, col = type)) +
geom_point() +
scale_color_manual(values = c("red", "black")) +
coord_fixed()
However, if we had only looked at one-dimensional \(z\)-scores, we would have completely missed the anomaly.
x_long <- x %>%
pivot_longer(X1:X2, names_to = "dimension")
ggplot(x_long, aes(x = value)) +
geom_histogram(binwidth = 0.4) +
geom_rug(
data = x_long %>% filter(type == "anomaly"),
col = "red", size = 2
) +
scale_y_continuous(expand = c(0, 0)) +
facet_wrap(~ dimension) +
theme(panel.border = element_rect(fill = NA, size = 1))
mu_hat <- apply(x[, 1:2], 2, mean)
sigma_hat <- cov(x[, 1:2])
x <- x %>%
mutate(D2 = mahalanobis(x[, 1:2], mu_hat, sigma_hat))
ggplot(x) +
geom_point(
aes(x = X1, y = X2, col = sqrt(D2))
) +
scale_color_gradient2(low = "#fffbfc", high = "#c50f2c")