library(tidyverse)
library(data.table)
library(superheat)
library(viridis)
library(patchwork)
library(rnaturalearth)
library(rnaturalearthdata)
theme_set(theme_minimal())
CCI = read.csv("2021_state_CCI.csv",header = TRUE,row.names="date")
t_CCI = transpose(CCI)
colnames(t_CCI) <- rownames(CCI)
rownames(t_CCI) <- colnames(CCI)
rownames(t_CCI) = rownames(t_CCI) %>%
str_replace("\\."," ")%>%
str_replace("\\."," ")
day_avg = colSums(t_CCI)/nrow(t_CCI)
superheat(t_CCI,
left.label.text.size = 1.5,
pretty.order.rows = TRUE,
yt = day_avg,
yt.plot.type = "line",
yt.axis.name = "CCI average",
row.dendrogram = TRUE,
heat.pal = viridis_pal(direction = -1)(50),
title = "heatmap of daily CCI of all states in USA")
set.seed(16)
p = list()
fit = t_CCI %>%
kmeans(centers = 4, nstart = 20)
centroids = data.frame(fit$centers) %>%
rownames_to_column("cluster") %>%
pivot_longer(-cluster, names_to = "date",values_to = "value")
p[["facet"]]=centroids %>%
mutate(date = str_remove(date, "X"),
date = as.Date(date,"%Y.%m.%d")) %>%
ggplot() +
geom_point(aes(date,value),size = 0.6) +
facet_wrap(.~cluster,labeller = "label_both")+
scale_x_date(date_breaks = "3 months",date_labels = "%m-%d")+
ggtitle("Centroids of Close Contact Index 2021 Time Series Clusters")+
theme(plot.title = element_text(hjust = 0.5))+
labs(x="Date",y = "Close Contact Index Value")
us = ne_states(country = 'United States of America',returnclass = "sf") %>%
select(name,geometry)
mem = data.frame(fit$cluster)
mem = mem %>%
rownames_to_column("name")
p[["map"]]=us %>%
left_join(mem,by = "name") %>%
filter(!name %in% c("Alaska","Hawaii")) %>%
ggplot()+
geom_sf(aes(fill = factor(fit.cluster)))+
scale_fill_manual(values=c("#6baed6","#2171b5","#eff3ff","#bdd7e7"))+
labs(fill="cluster")+
ggtitle("Map of clusters of states")+
theme(plot.title = element_text(hjust = 0.5,vjust=0.1))
p[["facet"]]/p[["map"]]+plot_layout(width=c(0.5,1))
The visualizations are trying to answer the question: How do the interaction between people changed throughout the year of 2021 in terms of their social distance in the United States? I will quantify the interaction between people by the Close Contact Index(CCI) which measures if two or more devices come within 50 feet of each other within a five minute time period.
The heat map shows the general trend of CCI through 2021 and the line plot on the top shows the general trend of average CCI over all the states.It shows that CCI has been steadily increasing for all states which is not surprising as people take less precautions over COVID and start to reduce social distances. Another key finding that I would not have noticed if I were to just plot the time series is that there seems to be four clusters of states. I think this is the trade-off between making a line plot of 51 time series and making a heatmap of 51 rows. Although the line plot could show every states’ CCI individualy, it is less straightforward for identifying clusters. The dendrogram on the right shows the hierarchical clustering process of the states.
For creating this heatmap, I first flipped the dataframe so that rows are states and columns are dates. I also calculated the daily average CCI over all states by taking column sums and dividing by the number of states. I generated the plot using the superheat package. I reordered the rows by hierarchical clustering. I drew the line plot and dendrogram by setting the arguments accordingly.
The second plot shows the centroids of four clusters of the time series of CCI from Kmeans since the last plot suggested so. The map below shows the cluster membership of states. The centroids are the imaginary representatives of the group of states in each cluster and resembles the behavior of those states in the corresponding cluster. The key findings include: 1) cluster 1, 3, and 4 seem to share the same trend but states in cluster 1 have higher values than states in cluster 3, 4 in general. 2) states in cluster 3 have the least amount of increase in CCI which suggests that people are more cautious about social distancing. 3) cluster 2 displays the most unique pattern of CCI compared to other clusters and there was a rapid change in CCI by the end of August which might be related to Fall Semester starting. I found an advantage of scatter plot over line plot in this case is that when there is a sudden increase in CCI, the distance between points can show how much it went up within a day which would be much harder to tell in a line plot. The trade-off is that when points are close to each other, it’s hard to tell which point comes first which would not be problematic in a line plot. In the map, we can also note that the cluster members are mostly spatially contiguous as well which also validates the Kmeans clustering in some sense. We can see the states in cluster 2 which is the cluster with highest CCI and darkest color are mostly states in the south.
For creating this plot, I obtained the centroids from running the Kmeans algorithm on the CCI matrix and pivoting the resulting dataframe. I drew the plot using geom_point() and faceted by cluster. I also removed the grey background. I read in the US shapefile from the naturalearth package and drew the map using geom_sf().