CART Get-out-the-Vote Analysis

Author

Kris Sankaran

Published

February 11, 2026

Motivation

Campaigns spend significant effort trying to get their voter base to turn out during elections, and they often run randomized trials to see which interventions work (e.g., online ads? phone banks?). The authors of (Gerber, Green, and Larimer 2008) were interested in how social pressure influences voter turnout. They defined four types of mailers and sent them randomly to potential voters in the Michigan 2006 general election.

The neighbor intervention increased voter turnout on average. But the researchers thought that some groups were more strongly influenced than others. Intuitively, there are some people who are already so motivated (or jaded) that the intervention is unlikely to change their behavior. Statistically, this variation is called “treatment effect heterogeneity.” The goal is to find subgropus where the treatment works especially especially well (or poorly).

Tree methods are well-suited to this task because of their interpretability. When a leaf shows a strong treatment effect, we can describe that subgroup with simple rules, like “age < A”, “voted in last N elections,” etc. This makes voter outreach straightforward.

Dataset

We’ll follow the overall approach in (Künzel et al. 2022), but instead of using linear aggregation CART (a method proposed in that paper), we’ll use plain CART (they also have a more slightly more nuanced causal analysis). Let’s load some libraries and helper functions.

library(tidyverse)
source("https://github.com/krisrs1128/stat479_notes/raw/refs/heads/master/activities/03-helpers.R")
set.seed(20251227)

The blow below reads in the data and renames the columns. The vector treatment_vars refers to the four mailers.

f <- tempfile()
download.file("https://zenodo.org/records/18371236/files/gotv_processed.rds?download=1", f)
social <- readRDS(f)

treatment_vars <- c("civic_duty", "hawthorne", "household", "neighbor")
social

# A tibble: 344,084 × 14
   y     civic_duty hawthorne household neighbor voted_2000_general
   <fct>      <dbl>     <dbl>     <dbl>    <dbl> <fct>             
 1 No             1         0         0        0 yes               
 2 No             1         0         0        0 yes               
 3 Yes            0         1         0        0 yes               
 4 Yes            0         1         0        0 yes               
 5 Yes            0         1         0        0 yes               
 6 No             0         0         0        0 no                
 7 Yes            0         0         0        0 yes               
 8 Yes            0         0         0        0 yes               
 9 No             0         0         0        0 no                
10 No             0         0         0        0 yes               
# ℹ 344,074 more rows
# ℹ 8 more variables: voted_2002_general <fct>, voted_2000_primary <fct>,
#   voted_2002_primary <fct>, voted_2004_primary <fct>, age <dbl>,
#   total_elections <int>, total_general_elections <int>,
#   total_primary_elections <int>

Classification Tree

The mailer was sent to \approx 340,000 households. The outcome y indicates whether anyone in the household voted in the referenced primary: 32% did and 68% did not. The first split is whether the targeted individual had previously voted in a primary.

Later on, we’ll test for associations between the treatment assignments and y. We want to avoid “double dipping,” which uses the same data both for modeling and testing and can lead to overoptimistic p-values. This is why the model only uses the samples in ix.

library(rpart)

ix <-  sample(nrow(social), 0.75 * nrow(social))
fit <- rpart(y ~ ., data = social[ix, ])
fit

n= 258063 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 258063 81489 No (0.6842283 0.3157717)  
  2) total_primary_elections< 1.5 187034 48295 No (0.7417849 0.2582151) *
  3) total_primary_elections>=1.5 71029 33194 No (0.5326698 0.4673302)  
    6) total_elections< 4.5 60686 26438 No (0.5643476 0.4356524) *
    7) total_elections>=4.5 10343  3587 Yes (0.3468046 0.6531954) *

Hyperparameter Tuning

The fit above uses the default complexity penalty \alpha. It’s better to tune this hyperparameter using holdout samples. Rather than implementing cross validation ourselves, let’s use the mlr3tuning package. There are a few new functions here,

as_task_classif: Wraps the data and specifies the response y. This allows us to handle \{x_n, y_n\} in a unified object.
lrn: Defines an rpart classifier and marks cp as a hyperparameter that we want to tune.
ti: Defines the approch to hyperparameter tuning. Here, we’re using 3-fold cross validation and measuring the classification cross-entropy.

library(mlr3)
library(mlr3tuning)

task <- as_task_classif(social[ix, ], target = "y", id = "social")
learner <- lrn("classif.rpart",  cp  = to_tune(0.0005, 0.5))

instance <- ti(
  task = task,
  learner = learner,
  resampling = rsmp("cv", folds = 3),
  measures = msr("classif.ce"),
  terminator = trm("none")
)

Now that we’ve specified the hyperparameter tuning approach, we can execute it. resolution = 5 means we’ll test out 5 candidate values of cp.

tuner <- tnr("grid_search", resolution = 5)
tuner$optimize(instance)

      cp learner_param_vals  x_domain classif.ce
   <num>             <list>    <list>      <num>
1: 5e-04          <list[2]> <list[1]>  0.3002523

The previous step internally modifies the instance object so that it includes the tuned cp value. Let’s refit the regression tree with this choice.

fit <- rpart(y ~ ., data = social[ix, ], cp=instance$result$cp)
fit

n= 258063 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 258063 81489 No (0.6842283 0.3157717)  
    2) total_primary_elections< 1.5 187034 48295 No (0.7417849 0.2582151) *
    3) total_primary_elections>=1.5 71029 33194 No (0.5326698 0.4673302)  
      6) total_elections< 4.5 60686 26438 No (0.5643476 0.4356524)  
       12) voted_2000_primary=yes 36073 14651 No (0.5938514 0.4061486) *
       13) voted_2000_primary=no 24613 11787 No (0.5211067 0.4788933)  
         26) age< 51.5 12341  5410 No (0.5616239 0.4383761)  
           52) neighbor< 0.5 10899  4637 No (0.5745481 0.4254519) *
           53) neighbor>=0.5 1442   669 Yes (0.4639390 0.5360610) *
         27) age>=51.5 12272  5895 Yes (0.4803618 0.5196382)  
           54) voted_2000_general=yes 10715  5276 Yes (0.4923938 0.5076062)  
            108) neighbor< 0.5 9586  4766 No (0.5028166 0.4971834)  
              216) age< 57.5 3300  1531 No (0.5360606 0.4639394) *
              217) age>=57.5 6286  3051 Yes (0.4853643 0.5146357)  
                434) age>=78.5 745   323 No (0.5664430 0.4335570) *
                435) age< 78.5 5541  2629 Yes (0.4744631 0.5255369) *
            109) neighbor>=0.5 1129   456 Yes (0.4038973 0.5961027) *
           55) voted_2000_general=no 1557   619 Yes (0.3975594 0.6024406) *
      7) total_elections>=4.5 10343  3587 Yes (0.3468046 0.6531954) *

Interpretation

Visualization

First let’s orient ourselves to the visualization. In each node,

Color and Yes/No is the prediction of whether descending samples will vote.
The fraction (like 0.32 in the root) gives the proportion of descendants who vote.
The percentage (like 100% in the root) gives the proportion of the dataset that descends from that node.
The split rule appears just below a node. If the condition is true, we proceed down the right branch.

library(rpart.plot)
rpart.plot(fit, type = 2, extra = 106, cex=.8, left = FALSE)

There are two questions we can answer by looking at the resulting tree:

Who is most likely to vote? For example, the first split suggests that study participants who had voted in at least two primary elections were more likely to vote in this one.
How did the treatment effect different subgroups? For example, in the bottom left, it looks like there is a split where neighbor = 1 changed the voting probability from 0.43 to 0.54.

Even if a leaf doesn’t have any of the treatment types as an ancestor node, we can still test whether there is an association between treatment status and voting within the subgroup defined by the leaf. To do this, we first divide the full dataset into subsets depending on the leaf it belongs to. This is a bit tricky, so let’s look at the full set of nodes.

Subgroup Analysis

In this last section, we’ll test whether there is an association between the treatment (mailer) assignment and the response (voting in the primary). Rather than testing an overall association, we’re going to test within each of the partition elements induced by the optimized classification tree. This allows us to see whether there are treatment effects that are stronger/weaker within particular subpopulations.

Let’s revisit the learned tree structure.

fit$frame[, 1:4]

                        var      n     wt   dev
1   total_primary_elections 258063 258063 81489
2                    <leaf> 187034 187034 48295
3           total_elections  71029  71029 33194
6        voted_2000_primary  60686  60686 26438
12                   <leaf>  36073  36073 14651
13                      age  24613  24613 11787
26                 neighbor  12341  12341  5410
52                   <leaf>  10899  10899  4637
53                   <leaf>   1442   1442   669
27       voted_2000_general  12272  12272  5895
54                 neighbor  10715  10715  5276
108                     age   9586   9586  4766
216                  <leaf>   3300   3300  1531
217                     age   6286   6286  3051
434                  <leaf>    745    745   323
435                  <leaf>   5541   5541  2629
109                  <leaf>   1129   1129   456
55                   <leaf>   1557   1557   619
7                    <leaf>  10343  10343  3587

We only need the leaf nodes from this set. A trick is to use fit$where to isolate the appropriate IDs.

fit$frame[unique(fit$where), 1:4]

       var      n     wt   dev
2   <leaf> 187034 187034 48295
7   <leaf>  10343  10343  3587
12  <leaf>  36073  36073 14651
52  <leaf>  10899  10899  4637
216 <leaf>   3300   3300  1531
55  <leaf>   1557   1557   619
435 <leaf>   5541   5541  2629
434 <leaf>    745    745   323
109 <leaf>   1129   1129   456
53  <leaf>   1442   1442   669

Now we divide samples according to these leaf IDs. Notice we’re only using samples that hadn’t been used during model fitting, to avoid the double dipping problem.

leaf_groups <- split(social[-ix, ], rownames(fit$frame)[fit$where])
head(leaf_groups, 3)

$`109`
# A tibble: 407 × 14
   y     civic_duty hawthorne household neighbor voted_2000_general
   <fct>      <dbl>     <dbl>     <dbl>    <dbl> <fct>             
 1 Yes            0         0         1        0 yes               
 2 No             0         0         0        0 yes               
 3 No             0         0         0        0 no                
 4 No             0         0         1        0 yes               
 5 No             0         0         0        1 no                
 6 Yes            1         0         0        0 yes               
 7 Yes            0         0         0        0 yes               
 8 Yes            1         0         0        0 yes               
 9 No             0         0         0        0 yes               
10 Yes            0         0         0        0 yes               
# ℹ 397 more rows
# ℹ 8 more variables: voted_2002_general <fct>, voted_2000_primary <fct>,
#   voted_2002_primary <fct>, voted_2004_primary <fct>, age <dbl>,
#   total_elections <int>, total_general_elections <int>,
#   total_primary_elections <int>

$`12`
# A tibble: 11,997 × 14
   y     civic_duty hawthorne household neighbor voted_2000_general
   <fct>      <dbl>     <dbl>     <dbl>    <dbl> <fct>             
 1 No             0         0         1        0 yes               
 2 Yes            0         0         0        0 yes               
 3 No             0         0         0        0 yes               
 4 Yes            0         0         0        1 yes               
 5 No             1         0         0        0 yes               
 6 Yes            0         0         0        0 yes               
 7 No             0         0         0        0 yes               
 8 Yes            0         0         0        0 yes               
 9 Yes            0         0         0        0 yes               
10 Yes            0         0         0        1 yes               
# ℹ 11,987 more rows
# ℹ 8 more variables: voted_2002_general <fct>, voted_2000_primary <fct>,
#   voted_2002_primary <fct>, voted_2004_primary <fct>, age <dbl>,
#   total_elections <int>, total_general_elections <int>,
#   total_primary_elections <int>

$`2`
# A tibble: 62,281 × 14
   y     civic_duty hawthorne household neighbor voted_2000_general
   <fct>      <dbl>     <dbl>     <dbl>    <dbl> <fct>             
 1 No             1         0         0        0 yes               
 2 No             0         0         0        0 no                
 3 No             0         0         0        0 yes               
 4 No             0         0         0        0 yes               
 5 Yes            0         0         0        1 no                
 6 No             0         0         0        0 yes               
 7 No             1         0         0        0 yes               
 8 No             0         0         0        0 yes               
 9 Yes            1         0         0        0 yes               
10 No             0         0         0        0 yes               
# ℹ 62,271 more rows
# ℹ 8 more variables: voted_2002_general <fct>, voted_2000_primary <fct>,
#   voted_2002_primary <fct>, voted_2004_primary <fct>, age <dbl>,
#   total_elections <int>, total_general_elections <int>,
#   total_primary_elections <int>

In each of those leaves, we can test for an association between treatment status and voting. The leaf_pvalue helper function is defined in the .R file sourced at the start of this notebook (it’s just a chi-square test).

p_values <- map_dbl(leaf_groups, leaf_pvalue) |> sort()
p_values

           2           12          435            7          216           53 
4.531800e-35 2.377988e-06 1.670487e-04 2.619645e-02 6.714801e-02 1.027085e-01 
          52          109           55          434 
1.061176e-01 1.973461e-01 3.182416e-01 9.770038e-01

There’s a strong association for the leaf with ID = 2. In this subgroup, it looks like the treatment is associated with an increase in voting probability from 0.31 to 0.38 (of course, these are just estimates – can you think of how to add error bars?).

target_node <- "2"
leaf_props(leaf_groups[[target_node]])

   
      No  Yes
  0 0.69 0.31
  1 0.62 0.38

What were the characteristics of participants in this leaf? Apparently, these were the participants who had never voted in a primary before.

path.rpart(fit, nodes = as.integer(target_node))


 node number: 2 
   root
   total_primary_elections< 1.5

Here is the next most significant group. The voting probability increased from 0.29 to 0.36 among participants with some prior political engagement.

target_node <- "12"
print(leaf_props(leaf_groups[[target_node]]))

   
      No  Yes
  0 0.71 0.29
  1 0.64 0.36

path.rpart(fit, nodes = as.integer(target_node))


 node number: 12 
   root
   total_primary_elections>=1.5
   total_elections< 4.5
   voted_2000_primary=yes

References

Gerber, Alan S., Donald P. Green, and Christopher W. Larimer. 2008. “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” American Political Science Review 102 (1): 33–48. https://doi.org/10.1017/s000305540808009x.

Künzel, Sören R., Theo F. Saarinen, Edward W. Liu, and Jasjeet S. Sekhon. 2022. “Linear Aggregation in Tree-Based Estimators.” Journal of Computational and Graphical Statistics 31 (3): 917–34. https://doi.org/10.1080/10618600.2022.2026780.