library(tidyverse)
source("https://github.com/krisrs1128/stat479_notes/raw/refs/heads/master/activities/03-helpers.R")
set.seed(20251227)CART Get-out-the-Vote Analysis
Motivation
Campaigns spend significant effort trying to get their voter base to turn out during elections, and they often run randomized trials to see which interventions work (e.g., online ads? phone banks?). The authors of (Gerber, Green, and Larimer 2008) were interested in how social pressure influences voter turnout. They defined four types of mailers and sent them randomly to potential voters in the Michigan 2006 general election.
The neighbor intervention increased voter turnout on average. But the researchers thought that some groups were more strongly influenced than others. Intuitively, there are some people who are already so motivated (or jaded) that the intervention is unlikely to change their behavior. Statistically, this variation is called “treatment effect heterogeneity.” The goal is to find subgropus where the treatment works especially especially well (or poorly).
Tree methods are well-suited to this task because of their interpretability. When a leaf shows a strong treatment effect, we can describe that subgroup with simple rules, like “age < A”, “voted in last N elections,” etc. This makes voter outreach straightforward.
Dataset
We’ll follow the overall approach in (Künzel et al. 2022), but instead of using linear aggregation CART (a method proposed in that paper), we’ll use plain CART (they also have a more slightly more nuanced causal analysis). Let’s load some libraries and helper functions.
The blow below reads in the data and renames the columns. The vector treatment_vars refers to the four mailers.
f <- tempfile()
download.file("https://zenodo.org/records/18371236/files/gotv_processed.rds?download=1", f)
social <- readRDS(f)
treatment_vars <- c("civic_duty", "hawthorne", "household", "neighbor")
social# A tibble: 344,084 × 14
y civic_duty hawthorne household neighbor voted_2000_general
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 No 1 0 0 0 yes
2 No 1 0 0 0 yes
3 Yes 0 1 0 0 yes
4 Yes 0 1 0 0 yes
5 Yes 0 1 0 0 yes
6 No 0 0 0 0 no
7 Yes 0 0 0 0 yes
8 Yes 0 0 0 0 yes
9 No 0 0 0 0 no
10 No 0 0 0 0 yes
# ℹ 344,074 more rows
# ℹ 8 more variables: voted_2002_general <fct>, voted_2000_primary <fct>,
# voted_2002_primary <fct>, voted_2004_primary <fct>, age <dbl>,
# total_elections <int>, total_general_elections <int>,
# total_primary_elections <int>
Classification Tree
The mailer was sent to \approx 340,000 households. The outcome y indicates whether anyone in the household voted in the referenced primary: 32% did and 68% did not. The first split is whether the targeted individual had previously voted in a primary.
Later on, we’ll test for associations between the treatment assignments and y. We want to avoid “double dipping,” which uses the same data both for modeling and testing and can lead to overoptimistic p-values. This is why the model only uses the samples in ix.
library(rpart)
ix <- sample(nrow(social), 0.75 * nrow(social))
fit <- rpart(y ~ ., data = social[ix, ])
fitn= 258063
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 258063 81489 No (0.6842283 0.3157717)
2) total_primary_elections< 1.5 187034 48295 No (0.7417849 0.2582151) *
3) total_primary_elections>=1.5 71029 33194 No (0.5326698 0.4673302)
6) total_elections< 4.5 60686 26438 No (0.5643476 0.4356524) *
7) total_elections>=4.5 10343 3587 Yes (0.3468046 0.6531954) *
Hyperparameter Tuning
The fit above uses the default complexity penalty \alpha. It’s better to tune this hyperparameter using holdout samples. Rather than implementing cross validation ourselves, let’s use the mlr3tuning package. There are a few new functions here,
as_task_classif: Wraps the data and specifies the response y. This allows us to handle \{x_n, y_n\} in a unified object.lrn: Defines anrpartclassifier and markscpas a hyperparameter that we want to tune.ti: Defines the approch to hyperparameter tuning. Here, we’re using 3-fold cross validation and measuring the classification cross-entropy.
library(mlr3)
library(mlr3tuning)
task <- as_task_classif(social[ix, ], target = "y", id = "social")
learner <- lrn("classif.rpart", cp = to_tune(0.0005, 0.5))
instance <- ti(
task = task,
learner = learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
terminator = trm("none")
)Now that we’ve specified the hyperparameter tuning approach, we can execute it. resolution = 5 means we’ll test out 5 candidate values of cp.
tuner <- tnr("grid_search", resolution = 5)
tuner$optimize(instance) cp learner_param_vals x_domain classif.ce
<num> <list> <list> <num>
1: 5e-04 <list[2]> <list[1]> 0.3002523
The previous step internally modifies the instance object so that it includes the tuned cp value. Let’s refit the regression tree with this choice.
fit <- rpart(y ~ ., data = social[ix, ], cp=instance$result$cp)
fitn= 258063
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 258063 81489 No (0.6842283 0.3157717)
2) total_primary_elections< 1.5 187034 48295 No (0.7417849 0.2582151) *
3) total_primary_elections>=1.5 71029 33194 No (0.5326698 0.4673302)
6) total_elections< 4.5 60686 26438 No (0.5643476 0.4356524)
12) voted_2000_primary=yes 36073 14651 No (0.5938514 0.4061486) *
13) voted_2000_primary=no 24613 11787 No (0.5211067 0.4788933)
26) age< 51.5 12341 5410 No (0.5616239 0.4383761)
52) neighbor< 0.5 10899 4637 No (0.5745481 0.4254519) *
53) neighbor>=0.5 1442 669 Yes (0.4639390 0.5360610) *
27) age>=51.5 12272 5895 Yes (0.4803618 0.5196382)
54) voted_2000_general=yes 10715 5276 Yes (0.4923938 0.5076062)
108) neighbor< 0.5 9586 4766 No (0.5028166 0.4971834)
216) age< 57.5 3300 1531 No (0.5360606 0.4639394) *
217) age>=57.5 6286 3051 Yes (0.4853643 0.5146357)
434) age>=78.5 745 323 No (0.5664430 0.4335570) *
435) age< 78.5 5541 2629 Yes (0.4744631 0.5255369) *
109) neighbor>=0.5 1129 456 Yes (0.4038973 0.5961027) *
55) voted_2000_general=no 1557 619 Yes (0.3975594 0.6024406) *
7) total_elections>=4.5 10343 3587 Yes (0.3468046 0.6531954) *
Interpretation
Visualization
First let’s orient ourselves to the visualization. In each node,
- Color and
Yes/Nois the prediction of whether descending samples will vote. - The fraction (like 0.32 in the root) gives the proportion of descendants who vote.
- The percentage (like 100% in the root) gives the proportion of the dataset that descends from that node.
- The split rule appears just below a node. If the condition is true, we proceed down the right branch.
library(rpart.plot)
rpart.plot(fit, type = 2, extra = 106, cex=.8, left = FALSE)There are two questions we can answer by looking at the resulting tree:
- Who is most likely to vote? For example, the first split suggests that study participants who had voted in at least two primary elections were more likely to vote in this one.
- How did the treatment effect different subgroups? For example, in the bottom left, it looks like there is a split where
neighbor = 1changed the voting probability from 0.43 to 0.54.
Even if a leaf doesn’t have any of the treatment types as an ancestor node, we can still test whether there is an association between treatment status and voting within the subgroup defined by the leaf. To do this, we first divide the full dataset into subsets depending on the leaf it belongs to. This is a bit tricky, so let’s look at the full set of nodes.
Subgroup Analysis
In this last section, we’ll test whether there is an association between the treatment (mailer) assignment and the response (voting in the primary). Rather than testing an overall association, we’re going to test within each of the partition elements induced by the optimized classification tree. This allows us to see whether there are treatment effects that are stronger/weaker within particular subpopulations.
Let’s revisit the learned tree structure.
fit$frame[, 1:4] var n wt dev
1 total_primary_elections 258063 258063 81489
2 <leaf> 187034 187034 48295
3 total_elections 71029 71029 33194
6 voted_2000_primary 60686 60686 26438
12 <leaf> 36073 36073 14651
13 age 24613 24613 11787
26 neighbor 12341 12341 5410
52 <leaf> 10899 10899 4637
53 <leaf> 1442 1442 669
27 voted_2000_general 12272 12272 5895
54 neighbor 10715 10715 5276
108 age 9586 9586 4766
216 <leaf> 3300 3300 1531
217 age 6286 6286 3051
434 <leaf> 745 745 323
435 <leaf> 5541 5541 2629
109 <leaf> 1129 1129 456
55 <leaf> 1557 1557 619
7 <leaf> 10343 10343 3587
We only need the leaf nodes from this set. A trick is to use fit$where to isolate the appropriate IDs.
fit$frame[unique(fit$where), 1:4] var n wt dev
2 <leaf> 187034 187034 48295
7 <leaf> 10343 10343 3587
12 <leaf> 36073 36073 14651
52 <leaf> 10899 10899 4637
216 <leaf> 3300 3300 1531
55 <leaf> 1557 1557 619
435 <leaf> 5541 5541 2629
434 <leaf> 745 745 323
109 <leaf> 1129 1129 456
53 <leaf> 1442 1442 669
Now we divide samples according to these leaf IDs. Notice we’re only using samples that hadn’t been used during model fitting, to avoid the double dipping problem.
leaf_groups <- split(social[-ix, ], rownames(fit$frame)[fit$where])
head(leaf_groups, 3)$`109`
# A tibble: 407 × 14
y civic_duty hawthorne household neighbor voted_2000_general
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 Yes 0 0 1 0 yes
2 No 0 0 0 0 yes
3 No 0 0 0 0 no
4 No 0 0 1 0 yes
5 No 0 0 0 1 no
6 Yes 1 0 0 0 yes
7 Yes 0 0 0 0 yes
8 Yes 1 0 0 0 yes
9 No 0 0 0 0 yes
10 Yes 0 0 0 0 yes
# ℹ 397 more rows
# ℹ 8 more variables: voted_2002_general <fct>, voted_2000_primary <fct>,
# voted_2002_primary <fct>, voted_2004_primary <fct>, age <dbl>,
# total_elections <int>, total_general_elections <int>,
# total_primary_elections <int>
$`12`
# A tibble: 11,997 × 14
y civic_duty hawthorne household neighbor voted_2000_general
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 No 0 0 1 0 yes
2 Yes 0 0 0 0 yes
3 No 0 0 0 0 yes
4 Yes 0 0 0 1 yes
5 No 1 0 0 0 yes
6 Yes 0 0 0 0 yes
7 No 0 0 0 0 yes
8 Yes 0 0 0 0 yes
9 Yes 0 0 0 0 yes
10 Yes 0 0 0 1 yes
# ℹ 11,987 more rows
# ℹ 8 more variables: voted_2002_general <fct>, voted_2000_primary <fct>,
# voted_2002_primary <fct>, voted_2004_primary <fct>, age <dbl>,
# total_elections <int>, total_general_elections <int>,
# total_primary_elections <int>
$`2`
# A tibble: 62,281 × 14
y civic_duty hawthorne household neighbor voted_2000_general
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 No 1 0 0 0 yes
2 No 0 0 0 0 no
3 No 0 0 0 0 yes
4 No 0 0 0 0 yes
5 Yes 0 0 0 1 no
6 No 0 0 0 0 yes
7 No 1 0 0 0 yes
8 No 0 0 0 0 yes
9 Yes 1 0 0 0 yes
10 No 0 0 0 0 yes
# ℹ 62,271 more rows
# ℹ 8 more variables: voted_2002_general <fct>, voted_2000_primary <fct>,
# voted_2002_primary <fct>, voted_2004_primary <fct>, age <dbl>,
# total_elections <int>, total_general_elections <int>,
# total_primary_elections <int>
In each of those leaves, we can test for an association between treatment status and voting. The leaf_pvalue helper function is defined in the .R file sourced at the start of this notebook (it’s just a chi-square test).
p_values <- map_dbl(leaf_groups, leaf_pvalue) |> sort()
p_values 2 12 435 7 216 53
4.531800e-35 2.377988e-06 1.670487e-04 2.619645e-02 6.714801e-02 1.027085e-01
52 109 55 434
1.061176e-01 1.973461e-01 3.182416e-01 9.770038e-01
There’s a strong association for the leaf with ID = 2. In this subgroup, it looks like the treatment is associated with an increase in voting probability from 0.31 to 0.38 (of course, these are just estimates – can you think of how to add error bars?).
target_node <- "2"
leaf_props(leaf_groups[[target_node]])
No Yes
0 0.69 0.31
1 0.62 0.38
What were the characteristics of participants in this leaf? Apparently, these were the participants who had never voted in a primary before.
path.rpart(fit, nodes = as.integer(target_node))
node number: 2
root
total_primary_elections< 1.5
Here is the next most significant group. The voting probability increased from 0.29 to 0.36 among participants with some prior political engagement.
target_node <- "12"
print(leaf_props(leaf_groups[[target_node]]))
No Yes
0 0.71 0.29
1 0.64 0.36
path.rpart(fit, nodes = as.integer(target_node))
node number: 12
root
total_primary_elections>=1.5
total_elections< 4.5
voted_2000_primary=yes