pi <- sample(1:10)
pi [1] 9 1 6 5 3 4 7 10 8 2
Readings: 1 (required), 2 (optional), Code
Items marked \(^{\dagger}\) are not in the required reading and will not be tested. The required reading covers some methods beyond this handout, those also won’t be tested.
Goal. Given a model \(f\) and a sample \(x\in\reals^{D}\), efficiently compute the SHAP value \(\varphi_d(x, f)\), the contribution of feature \(d\) to the prediction \(f(x)\).
Approach.
There are two general strategies for efficient Shapley values computation,
Let’s first see why naive SHAP computation is slow. Recall the key quantities,
Feature \(d\)’s attribution, \[ \varphi_d(f, x) = \sum_{S \subseteq \mathcal{D} - \{d\}} \frac{1}{D {D - 1 \choose \left|S\right|}} C(d \vert S) \tag{1}\] where \(C(d \mid S) = v(S \cup \{d\}) - v(S)\) is the marginal contribution of \(d\) to coalition \(S\). Both \(C\) and \(v\) depend on \(f\) and \(x\), but we suppress this from our notation. We also write \(\varphi_d(x)\) instead of \(\varphi_d(f, x)\), since \(f\) is fixed in these notes.
The value function \(v(S)\) evaluates \(f\) on \(x\) after “removing” features outside of \(S\) removed. We covered three removal strategies:
To compute all \(D\) attributions for one sample, we need \(v(S)\) for every \(S \subseteq \mathcal{D}\). Using marginal feature removal, each \(v(S)\) costs \(B\) evaluations of \(f\). Let’s try a Fermi calculation. Take \(N = 100, D = 10, B = 100\). Attribution across all samples and features requires \[10^2 \text{ samples} \times 2^{10} \text{ subsets} \times 10^2 \text{ calls per subset} \approx 10^7 \text{ evaluations}.\] Supposing 1ms per call, this takes \(\approx 3\) hours (and this is only 100 samples with 10 features). More generally, marginal removal costs \(NB\,2^D\) evaluations.
Exercise: What are the relative costs of marginal vs. baseline feature removal?
The weights in Equation 1 define a probability distribution, \[ \mathbb{P}(S) = \frac{1}{D \binom{D-1}{|S|}}, \] because they are nonnegative and sum to one. This lets us rewrite \(\varphi_d(x)\) as an expectation, \[\begin{align} \label{eq-distribution} \varphi_d(x) &= \sum_{S \subseteq \mathcal{D} \setminus \{d\}} \mathbb{P}(S)\, C(d \mid S) = \mathbb{E}_{\mathbb{P}(S)}[C(d \mid S)]. \end{align}\] i.e., the Shapley value is the expected contribution of feature \(d\) to a random coalition.
Instead of enumerating all \(2^{D-1}\) subsets to get the exact expectation, we sample \(S_1, \dots, S_M \sim \mathbb{P}(S)\) and estimate \[ \hat{\varphi}_d(x) = \frac{1}{M}\sum_{m=1}^{M} C(d \mid S_m). \tag{2}\] This is unbiased (\(\mathbb{E}[\hat{\varphi}_d(x)] = \varphi_d(x)\)) and increasing \(M\) improves the approximation.
Exercise: Suppose \(\mathcal{D} = \{1, \dots, 4\}\) and \(d = 1\). What are the \(\mathbb{P}(S)\) for the sets in Equation 1?
To draw \(S_{m} \sim \mathbb{P}(S)\) in practice, we use random permutations. Draw \(\pi : \{1, \dots, D\} \to \{1, \dots, D\}\) uniformly at random. For example, for 10 features we could use,
pi <- sample(1:10)
pi [1] 9 1 6 5 3 4 7 10 8 2
Let \(\text{Pre}^{d}\left(\pi\right)\) the features appearing before \(d\) in \(\pi\). E.g., for \(d = 4\),
pi[seq_len(match(4, pi) - 1)][1] 9 1 6 5 3
It’s not obvious, but \(\text{Pre}^d\left(\pi\right)\) gives a sample from \(\mathbb{P}\left(S\right)\).
To summarize, the procedure is,
for m = 1, ..., M:
sample a random permutation π_m over D features
set S_m = Pre^d(π_m)
compute C(d | S_m)
return average of C(d | S_m) over m
This is often called IME (“Interactions-based Method for Explanation”). Larger \(M\) gives better approximations but at higher computing cost.
Adaptive sampling. The terms \(C\left(d \vert S_{m}\right)\) are random variables with mean \(\varphi_{d}\left(x\right)\).
(picture of their histograms across \(d\))
For features that are irrelevant, the contributions are tightly clustered near zero, so even small values of \(M\) are enough. Inspired by this observation, adaptive sampling tracks the variance of \(C(d \mid S_m)\) across features and allocates more samples to high-variance features.
Antithetic sampling. Recall that \[ \text{Var}(Z_1 + Z_2) = \text{Var}(Z_1) + \text{Var}(Z_2) + 2\text{Cov}(Z_1, Z_2). \] so negative correlated terms reduce the variance of the sum.
Exercise: If \(\text{Cor}(Z_1, Z_2) = -1\) and \(Z_1, Z_2\) have the same variance, what is \(\text{Var}(Z_1 + Z_2)\)?
More generally, \[\text{Var}\!\left(\sum_{m=1}^M Z_m\right) = \sum_m \text{Var}(Z_m) + \sum_{m \neq m'} \text{Cov}(Z_m, Z_{m'}),\] so again negative pairwise correlations reduce variance.
Applying this to Equation 2, we can reduce the correlation between \(C\left(d \vert S_m\right)\) and \(C\left(d \vert S_{m'}\right)\) by pairing each \(S_m\) with its complement,
Despite its simplicity, this yields state-of-the-art variance reductions (Mitchell et al. 2022).
A classical result in game theory (Charnes et al. 1988) shows that SHAP values can be characterized through a weighted least squares problem, \[ \varphi\left(x\right) = \arg \min_{\beta \in \reals^{D + 1}} \sum_{S \subset \mathcal{D}} w(S) \left(v(S) - \beta_0 - \sum_{d \in S}\beta_{d}\right)^2 \tag{3}\] where \(w(S) = \left[\binom{D}{|S|} |S|(D - |S|)\right]^{-1}\) and \(\varphi(x) = (\varphi_1(x), \dots, \varphi_D(x))^\top\).
This characterization still enumerates \(2^D\) subsets, but it suggests an approximation: sample random subsets \(S_1, \dots, S_M\) and solve the weighted regression on the reduced design. This is “KernelSHAP.”
Let \(z_{md} = \mathbf{1}[d \in s_m]\) be the associated design matrix. the regression suggests \(v(\emptyset) \approx \beta_0\) and \[ v(S_m) - v(\emptyset) \approx \sum_{d = 1}^{D} z_{md}\beta_d. \tag{4}\] The efficiency constraint becomes, \[ v(\mathcal{D}) - v(\emptyset) \approx \sum_{d = 1}^{D} \beta_d =: \Delta\\ \] We can improve the quality of our approximation by enforcing these constraints, substituting \(v(\emptyset) = \Esubarg{p(x)}{f(X)} \approx \frac{1}{N}\sum_{i = 1}^{N} f(x_i)\) and \(v(\mathcal{D}) = f(x).\)
Exercise: In fact, the regression objective ensures that \(v\left(\emptyset\right) = \beta_0\) and \(v(\mathcal{D}) - v(\emptyset) = \Delta\) hold exactly. Can you see why?
\(^\dagger\) To see this concretely, let’s try a code implementation (validated against the shap package below). First, initialize the binary mask matrix \(z_{md}\), kernel weights, and feature-removed inputs:
mask_matrix = np.zeros((M, D))
kernel_weights = np.zeros(M)
synth_data = np.tile(data, (M, 1))\(^\dagger\) Next we draw random subsets \(S_m\) and the corresponding feature-removed data \((x_{S_m}, x_{i,\bar{S}_m})\):
for i in range(M):
mask = np.random.choice([0, 1], size=D)
mask_matrix[i] = mask
synth_data[i * N:(i + 1) * N, mask == 1] = x[0, mask == 1] # explaining x[0]
kernel_weights[i] = 1 / (comb(self.D, mask.sum()) * mask.sum() * (self.D - mask.sum()))We evaluate \(f\) on the feature-removed data. Here f_bar estimates \(v(\emptyset)\).
f_bar = np.mean(model(data))
f = model(synth_data)
v = np.mean(f.reshape(M, N, -1), axis=1)Exercise: In the line synth_data[i, ...], how many entries of synth_data get modified? What is the interpretation?
\(^\dagger\) Finally, we run the weighted regression. We incorporate efficiency constraints from point (3). Since \(\beta_D = \Delta - \sum_{d= 1}^{D - 1}
\beta_d\), we can substitute into Equation 4, \[
v(S_m) - v(\emptyset) - z_{mD}\Delta \approx \sum_{d = 1}^{D - 1}\left(z_{md}- z_{mD}\right)\beta_{d}
\] The left hand side is y below and the right hand side is X.
y = (v - f_bar).flatten() - mask_matrix[:, -1] * (model(x) - f_bar)
X = mask_matrix - mask_matrix[:, -1][:, None]
lm = LinearRegression(fit_intercept=False).fit(X, y, sample_weight=kernel_weights)
# final shapley values
phi = np.zeros(D)
phi[:-1] = lm.coef_[:-1]
phi[-1] = (model(x) - f_bar) - np.sum(lm.coef_)\(^\dagger\) So far, we’ve only explained a single observation \(x\). In practice, we want attributions for many \(\{x_i\}_{i =1}^{N}\). An important insight is that attributions across observations are often related, so we can pool our estimates across all samples. This lets us tolerate rougher estimates for each individual sample.
\(^\dagger\) Amortization frames parameter estimation as supervised learning. For concreteness, consider noisy KernelSHAP estimates using small \(M\), \[ \hat{\varphi}_d^{M}(x_i) = \arg \min_{\beta \in \reals^{D + 1}} \sum_{m = 1}^{M} w(S_m) \left(v(S_m) - \beta_0 - \sum_{d \in S_m}\beta_{d}\right)^{2} \] Then train a model \(a_\theta : \reals^D \to \reals^D\) (e.g. linear or neural network) to predict these estimates, \[ \hat{\theta} = \arg\min_{\theta \in \Theta} \sum_{i=1}^{N} \left\|\hat{\varphi}^M(x_i) - a_\theta(x_i)\right\|^2. \] This is now like a supervised learning problem where the “labels” are \(\hat{\varphi}^{M}(x_i)\) and the input features are \(x_i\).
\(^\dagger\) Once \(\hat{\theta}\) has been found, we can very quickly get attributions for a new sample \(x^\ast\) by just calling \(a_{\hat{\theta}}(x^\ast)\). We don’t even need to sample any subsets \(S_m\) or evaluate any \(v(S_m)\). Moreover, the learned model can smooth over the noise in the per-sample estimates, leading to \(a_{\hat{\theta}}(x^*)\) that approximates \(\varphi(x^*)\) more accurately than \(\hat{\varphi}^M(x^*)\).
\(^\dagger\) Some caveats. Training \(a_{\theta}\) can itself be computationally intensive. Also, if the labels \(\hat{\varphi}^M(x_i)\) are too noisy, the learned mapping may be inaccurate. Nonetheless, early research suggests that even moderate \(M\) and practical \(a_{\theta}\) can learn \(\varphi(x)\) with high fidelity.
Within specific model classes, we can derive closed-form SHAP formulas that bypass enumeration over subsets. For example, in your HW3, you will show that linear models \(f(x) = \beta_0 + \sum_{d} \beta_{d}x_{d}\) with baseline removal satisfy, \[ \varphi_{d}(x) = \beta_{d}\left(x_{d} - x_{d}'\right) \] The attribution only depends on feature \(d\)’s deviation from the baseline and coefficient \(\beta_d\). A similar derivation shows that for marginal feature removal, \[ \varphi_{d}(x) = \beta_{d}\left(x_{d} - \Esubarg{p(x)}{X_{d}}\right) \] so using the marginal removal is the same as using the sample mean as a baseline.
There are two other important model-specific methods,
TreeExplainer in the shap python package and the treeshap R package.DeepExplainer class in shap and the DeepSHAP class in the innsight R package.This example shows that the KernelSHAP sketch above indeed agrees with the official implementation. We’ve copied our code into the explain method in the class below. shap_values simply loops explain over many samples.
import numpy as np
from math import comb
from sklearn.linear_model import LinearRegression
class SimpleKernelExplainer:
def __init__(self, model, data):
self.model = model
self.data = data
self.N, self.D = data.shape
self.f_bar = np.mean(model(data))
def shap_values(self, X, M):
explanations = []
for i in range(X.shape[0]):
explanations.append(self.explain(X[i:(i + 1)], M))
return np.array(explanations)
def explain(self, x, M):
# initialize data structures
mask_matrix = np.zeros((M, self.D))
kernel_weights = np.zeros(M)
synth_data = np.tile(self.data, (M, 1))
# sample random subsets
for i in range(M):
mask = np.random.choice([0, 1], size=self.D)
mask_matrix[i] = mask
synth_data[i * self.N:(i + 1) * self.N, mask == 1] = x[0, mask == 1]
# edge 0 and self.D handled by constraints
if 0 < mask.sum() < self.D:
kernel_weights[i] = 1 / (comb(self.D, mask.sum()) * mask.sum() * (self.D - mask.sum()))
# model predictions on the synthetic data
model_out = self.model(synth_data)
v = np.mean(model_out.reshape(M, self.N, -1), axis=1)
# weighted linear regression on adjusted f
X = mask_matrix - mask_matrix[:, -1][:, None]
y = (v - self.f_bar).flatten() - mask_matrix[:, -1] * (self.model(x) - self.f_bar)
lm = LinearRegression(fit_intercept=False).fit(X, y, sample_weight=kernel_weights)
# return shapley approximation
phi = np.zeros(self.D)
phi[:-1] = lm.coef_[:-1]
phi[-1] = (self.model(x) - self.f_bar) - np.sum(lm.coef_)
return phiWe’ll test this implementation using a random forest model trained on the make_friedman1 dataset (Friedman 1991).
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_friedman1
X, y = make_friedman1(n_samples=100, n_features=5, noise=0.1)
model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)RandomForestRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
We run both our simple and the official explainer in the block below.
import shap
explainer = SimpleKernelExplainer(model.predict, X)
official_explainer = shap.KernelExplainer(model.predict, X)
shap_values = explainer.shap_values(X, M=1000)
official_shap_values = official_explainer.shap_values(X, nsamples=1000, silent=True)There’s randomness in the estimates, but they’re close.
print("Package: ", np.round(official_shap_values[:5], 4))Package: [[ 0.2954 -4.0058 -0.366 -2.3923 0.3805]
[ 0.9994 -2.1396 -0.303 -2.932 0.3207]
[-0.2326 -1.7868 0.6939 3.8895 0.6953]
[ 1.2317 -0.24 0.5055 5.1771 0.7091]
[-0.364 -0.1141 0.0555 1.9114 0.6093]]
print("Ours: ", np.round(shap_values[:5], 4))Ours: [[ 0.3099 -3.9995 -0.3584 -2.4307 0.3906]
[ 0.9947 -2.1426 -0.3155 -2.9009 0.3098]
[-0.2011 -1.7675 0.6592 3.867 0.7018]
[ 1.2755 -0.2798 0.5157 5.1571 0.7149]
[-0.3609 -0.107 0.0509 1.9142 0.6008]]
Compared to our earlier notes, we’re allowing for background subsamples B < N.↩︎