A crash course on entity resolution, plus some other tips.
library("cluster")
library("ggplot2")
library("readr")
library("stringr")
library("dplyr")
theme_set(theme_minimal())
Entity resolution
One simple way to detect candidates for unification is to cluster all the strings using an appropriate distance. The Profiler paper uses the string edit distance to decide whether two strings are similar. For example, two insertions to the first string below would give us the second string, so the edit distance is 2.
[,1] [,2]
[1,] 0 2
[2,] 2 0
movies <- read_csv("https://uwmadison.box.com/shared/static/txa56mux3ca2f8w2zmc9dq7yunljd1ak.csv") %>%
filter(!is.na(Title))
D <- adist(movies$Title)
tree <- hclust(as.dist(D), method = "complete")
movies$cluster <- cutree(tree, h = 3)
movies <- movies %>%
arrange(cluster) %>%
group_by(cluster) %>%
mutate(
cluster_ix = seq_len(n()), # indexes movies in each cluster from 1 .. cluster_size
cluster_size = n()
)
movies %>%
filter(cluster_size > 1, cluster < 100) %>%
ggplot() +
geom_text(
aes(y = reorder(cluster, -cluster_size), x = cluster_ix, label = Title),
size = 4
) +
scale_x_discrete(expand = c(0.1, 0.1))
movies %>%
filter(Title == "Alice in Wonderland") %>%
select(Title, Release_Date, IMDB_Rating, IMDB_Votes)
# A tibble: 2 x 5
# Groups: cluster [1]
cluster Title Release_Date IMDB_Rating IMDB_Votes
<int> <chr> <chr> <dbl> <dbl>
1 47 Alice in Wonderland Jul 28 1951 6.7 63458
2 47 Alice in Wonderland Mar 05 2010 6.7 63458
movies <- movies %>%
mutate(year = str_extract(Release_Date, "[0-9]+$"))
movies %>%
filter(!is.na(year)) %>%
group_by(cluster, year) %>%
mutate(conditional_count = n()) %>%
filter(conditional_count > 1) %>%
select(Title, year, cluster)
# A tibble: 43 x 3
# Groups: cluster, year [20]
Title year cluster
<chr> <chr> <int>
1 Ray 2004 5
2 Saw 2004 5
3 Rent 2005 84
4 Venom 2005 84
5 Bobby 2006 92
6 Borat 2006 92
7 Day of the Dead 2008 230
8 Diary of the Dead 2008 230
9 The Box 2009 324
10 The Road 2009 324
# … with 33 more rows
There are two other nice tricks in the Profiler paper that are worth knowing. The first is the binning trick. If you have a very large dataset, scatterplots can be misleading, since the points overlap too much. A better alternative is to use 2D binning.
For example, here are two Gaussian blobs. Looking at a scatterplot, it looks like one blob.
n <- 1e6
x <- matrix(rnorm(n), ncol = 2)
z <- matrix(rnorm(0.05 * n, sd = 0.2), ncol = 2)
df <- data.frame(rbind(x, z))
ggplot(df) +
geom_point(aes(x = X1, y = X2))
ggplot(df) +
geom_bin2d(aes(x = X1, y = X2), binwidth = 0.1) +
scale_fill_viridis_b()