A crash course on entity resolution, plus some other tips.
Entity resolution1 is the process of resolving field values that refer to the same thing, but which are stored using different names. For example, a column might include entries for “UW Madison” and “University of Wisconsin - Madison.” This often arises when a dataset is made by linking a few different sources, combining different time periods of the same source, or involved manual text entry.
One simple way to detect candidates for unification is to cluster all the strings using an appropriate distance. The Profiler paper uses the string edit distance to decide whether two strings are similar. For example, two insertions to the first string below would give us the second string, so the edit distance is 2.
movies %>%
filter(Title == "Alice in Wonderland") %>%
select(Title, Release_Date, IMDB_Rating, IMDB_Votes)
# A tibble: 2 x 5
# Groups: cluster [1]
cluster Title Release_Date IMDB_Rating IMDB_Votes
<int> <chr> <chr> <dbl> <dbl>
1 47 Alice in Wonderland Jul 28 1951 6.7 63458
2 47 Alice in Wonderland Mar 05 2010 6.7 63458
movies <- movies %>%
mutate(year = str_extract(Release_Date, "[0-9]+$"))
movies %>%
filter(!is.na(year)) %>%
group_by(cluster, year) %>%
mutate(conditional_count = n()) %>%
filter(conditional_count > 1) %>%
select(Title, year, cluster)
# A tibble: 43 x 3
# Groups: cluster, year [20]
Title year cluster
<chr> <chr> <int>
1 Ray 2004 5
2 Saw 2004 5
3 Rent 2005 84
4 Venom 2005 84
5 Bobby 2006 92
6 Borat 2006 92
7 Day of the Dead 2008 230
8 Diary of the Dead 2008 230
9 The Box 2009 324
10 The Road 2009 324
# … with 33 more rows
There are two other nice tricks in the Profiler paper that are worth knowing. The first is the binning trick. If you have a very large dataset, scatterplots can be misleading, since the points overlap too much. A better alternative is to use 2D binning.
For example, here are two Gaussian blobs. Looking at a scatterplot, it looks like one blob.
ggplot(df) +
geom_point(aes(x = X1, y = X2))
ggplot(df) +
geom_bin2d(aes(x = X1, y = X2), binwidth = 0.1) +
scale_fill_viridis_b()