SHAP Computation

Author

Kris Sankaran

Published

March 2, 2026

\[ \newcommand{\bs}[1]{\mathbf{#1}} \newcommand{\reals}{\mathbb{R}} \newcommand{\widebar}[1]{\overline{#1}} \newcommand{\E}{\mathbb{E}} \newcommand{\Earg}[1]{\mathbb{E}\left[{#1}\right]} \newcommand{\Esubarg}[2]{\mathbb{E}_{#1}\left[{#2}\right]} \]

Readings: 1 (required), 2 (optional), Code

Items marked \(^{\dagger}\) are not in the required reading and will not be tested. The required reading covers some methods beyond this handout, those also won’t be tested.

Setup

Goal. Given a model \(f\) and a sample \(x\in\reals^{D}\), efficiently compute the SHAP value \(\varphi_d(x, f)\), the contribution of feature \(d\) to the prediction \(f(x)\).

Approach.

There are two general strategies for efficient Shapley values computation,

  • Model-agnostic: Avoid assumptions about the model \(f\). Approximate \(\varphi_{d}(x)\) in a way that trades-off statistical accuracy for computation speed.
  • Model-specific: Assume a particular form for \(f\) (e.g., linear or tree ensemble) and derive a closed-form formula that bypasses costly enumerations. This is often more accurate but is less general.

Challenge

  1. Let’s first see why naive SHAP computation is slow. Recall the key quantities,

    • Feature \(d\)’s attribution, \[ \varphi_d(f, x) = \sum_{S \subseteq \mathcal{D} - \{d\}} \frac{1}{D {D - 1 \choose \left|S\right|}} C(d \vert S) \tag{1}\] where \(C(d \mid S) = v(S \cup \{d\}) - v(S)\) is the marginal contribution of \(d\) to coalition \(S\). Both \(C\) and \(v\) depend on \(f\) and \(x\), but we suppress this from our notation. We also write \(\varphi_d(x)\) instead of \(\varphi_d(f, x)\), since \(f\) is fixed in these notes.

    • The value function \(v(S)\) evaluates \(f\) on \(x\) after “removing” features outside of \(S\) removed. We covered three removal strategies:

      • Baseline \(v(S) = f(x_S, x'_{\bar{S}})\)
      • Marginal1: \(v(S) = \mathbb{E}_{p(X_{\bar{S}})}[f(x_S, X_{\bar{S}})] \approx \frac{1}{B}\sum_{i=1}^{B} f(x_S, x_{i,\bar{S}})\)
      • Conditional: \(v(S) = \mathbb{E}_{p(X_{\bar{S}} \mid X_S = x_S)}[f(x_S, X_{\bar{S}})]\)
  2. To compute all \(D\) attributions for one sample, we need \(v(S)\) for every \(S \subseteq \mathcal{D}\). Using marginal feature removal, each \(v(S)\) costs \(B\) evaluations of \(f\). Let’s try a Fermi calculation. Take \(N = 100, D = 10, B = 100\). Attribution across all samples and features requires \[10^2 \text{ samples} \times 2^{10} \text{ subsets} \times 10^2 \text{ calls per subset} \approx 10^7 \text{ evaluations}.\] Supposing 1ms per call, this takes \(\approx 3\) hours (and this is only 100 samples with 10 features). More generally, marginal removal costs \(NB\,2^D\) evaluations.

    Exercise: What are the relative costs of marginal vs. baseline feature removal?

Sampling Perspective

  1. The weights in Equation 1 define a probability distribution, \[ \mathbb{P}(S) = \frac{1}{D \binom{D-1}{|S|}}, \] because they are nonnegative and sum to one. This lets us rewrite \(\varphi_d(x)\) as an expectation, \[\begin{align} \label{eq-distribution} \varphi_d(x) &= \sum_{S \subseteq \mathcal{D} \setminus \{d\}} \mathbb{P}(S)\, C(d \mid S) = \mathbb{E}_{\mathbb{P}(S)}[C(d \mid S)]. \end{align}\] i.e., the Shapley value is the expected contribution of feature \(d\) to a random coalition.

    Instead of enumerating all \(2^{D-1}\) subsets to get the exact expectation, we sample \(S_1, \dots, S_M \sim \mathbb{P}(S)\) and estimate \[ \hat{\varphi}_d(x) = \frac{1}{M}\sum_{m=1}^{M} C(d \mid S_m). \tag{2}\] This is unbiased (\(\mathbb{E}[\hat{\varphi}_d(x)] = \varphi_d(x)\)) and increasing \(M\) improves the approximation.

    Exercise: Suppose \(\mathcal{D} = \{1, \dots, 4\}\) and \(d = 1\). What are the \(\mathbb{P}(S)\) for the sets in Equation 1?

  2. To draw \(S_{m} \sim \mathbb{P}(S)\) in practice, we use random permutations. Draw \(\pi : \{1, \dots, D\} \to \{1, \dots, D\}\) uniformly at random. For example, for 10 features we could use,

    pi <- sample(1:10)
    pi
     [1]  9  1  6  5  3  4  7 10  8  2

    Let \(\text{Pre}^{d}\left(\pi\right)\) the features appearing before \(d\) in \(\pi\). E.g., for \(d = 4\),

     pi[seq_len(match(4, pi) - 1)]
    [1] 9 1 6 5 3

    It’s not obvious, but \(\text{Pre}^d\left(\pi\right)\) gives a sample from \(\mathbb{P}\left(S\right)\).

  3. To summarize, the procedure is,

    for m = 1, ..., M:
      sample a random permutation π_m over D features
      set S_m = Pre^d(π_m)
      compute C(d | S_m)
    return average of C(d | S_m) over m

    This is often called IME (“Interactions-based Method for Explanation”). Larger \(M\) gives better approximations but at higher computing cost.

  4. Adaptive sampling. The terms \(C\left(d \vert S_{m}\right)\) are random variables with mean \(\varphi_{d}\left(x\right)\).

    (picture of their histograms across \(d\))

    For features that are irrelevant, the contributions are tightly clustered near zero, so even small values of \(M\) are enough. Inspired by this observation, adaptive sampling tracks the variance of \(C(d \mid S_m)\) across features and allocates more samples to high-variance features.

  5. Antithetic sampling. Recall that \[ \text{Var}(Z_1 + Z_2) = \text{Var}(Z_1) + \text{Var}(Z_2) + 2\text{Cov}(Z_1, Z_2). \] so negative correlated terms reduce the variance of the sum.

    Exercise: If \(\text{Cor}(Z_1, Z_2) = -1\) and \(Z_1, Z_2\) have the same variance, what is \(\text{Var}(Z_1 + Z_2)\)?

    More generally, \[\text{Var}\!\left(\sum_{m=1}^M Z_m\right) = \sum_m \text{Var}(Z_m) + \sum_{m \neq m'} \text{Cov}(Z_m, Z_{m'}),\] so again negative pairwise correlations reduce variance.

  6. Applying this to Equation 2, we can reduce the correlation between \(C\left(d \vert S_m\right)\) and \(C\left(d \vert S_{m'}\right)\) by pairing each \(S_m\) with its complement,

    • Sample \(S_m \sim \mathbb{P}(S)\)
    • Set \(S_{m'} = \bar{S}_{m}\) the complement of \(S_m\) in \(\{1, \dots, D\}\).

    Despite its simplicity, this yields state-of-the-art variance reductions (Mitchell et al. 2022).

Weighted Least Squares

  1. A classical result in game theory (Charnes et al. 1988) shows that SHAP values can be characterized through a weighted least squares problem, \[ \varphi\left(x\right) = \arg \min_{\beta \in \reals^{D + 1}} \sum_{S \subset \mathcal{D}} w(S) \left(v(S) - \beta_0 - \sum_{d \in S}\beta_{d}\right)^2 \tag{3}\] where \(w(S) = \left[\binom{D}{|S|} |S|(D - |S|)\right]^{-1}\) and \(\varphi(x) = (\varphi_1(x), \dots, \varphi_D(x))^\top\).

  2. This characterization still enumerates \(2^D\) subsets, but it suggests an approximation: sample random subsets \(S_1, \dots, S_M\) and solve the weighted regression on the reduced design. This is “KernelSHAP.”

  3. Let \(z_{md} = \mathbf{1}[d \in s_m]\) be the associated design matrix. the regression suggests \(v(\emptyset) \approx \beta_0\) and \[ v(S_m) - v(\emptyset) \approx \sum_{d = 1}^{D} z_{md}\beta_d. \tag{4}\] The efficiency constraint becomes, \[ v(\mathcal{D}) - v(\emptyset) \approx \sum_{d = 1}^{D} \beta_d =: \Delta\\ \] We can improve the quality of our approximation by enforcing these constraints, substituting \(v(\emptyset) = \Esubarg{p(x)}{f(X)} \approx \frac{1}{N}\sum_{i = 1}^{N} f(x_i)\) and \(v(\mathcal{D}) = f(x).\)

    Exercise: In fact, the regression objective ensures that \(v\left(\emptyset\right) = \beta_0\) and \(v(\mathcal{D}) - v(\emptyset) = \Delta\) hold exactly. Can you see why?

  4. \(^\dagger\) To see this concretely, let’s try a code implementation (validated against the shap package below). First, initialize the binary mask matrix \(z_{md}\), kernel weights, and feature-removed inputs:

    mask_matrix = np.zeros((M, D))
    kernel_weights = np.zeros(M)
    synth_data = np.tile(data, (M, 1))
  5. \(^\dagger\) Next we draw random subsets \(S_m\) and the corresponding feature-removed data \((x_{S_m}, x_{i,\bar{S}_m})\):

    for i in range(M):
         mask = np.random.choice([0, 1], size=D)
         mask_matrix[i] = mask
         synth_data[i * N:(i + 1) * N, mask == 1] = x[0, mask == 1] # explaining x[0]
         kernel_weights[i] = 1 / (comb(self.D, mask.sum()) * mask.sum() * (self.D - mask.sum()))

    We evaluate \(f\) on the feature-removed data. Here f_bar estimates \(v(\emptyset)\).

     f_bar = np.mean(model(data))
     f = model(synth_data)
     v = np.mean(f.reshape(M, N, -1), axis=1)

    Exercise: In the line synth_data[i, ...], how many entries of synth_data get modified? What is the interpretation?

  6. \(^\dagger\) Finally, we run the weighted regression. We incorporate efficiency constraints from point (3). Since \(\beta_D = \Delta - \sum_{d= 1}^{D - 1} \beta_d\), we can substitute into Equation 4, \[ v(S_m) - v(\emptyset) - z_{mD}\Delta \approx \sum_{d = 1}^{D - 1}\left(z_{md}- z_{mD}\right)\beta_{d} \] The left hand side is y below and the right hand side is X.

    y = (v - f_bar).flatten() - mask_matrix[:, -1] * (model(x) - f_bar)
    X = mask_matrix - mask_matrix[:, -1][:, None]
    lm = LinearRegression(fit_intercept=False).fit(X, y, sample_weight=kernel_weights)
    
    # final shapley values
    phi = np.zeros(D)
    phi[:-1] = lm.coef_[:-1]
    phi[-1] = (model(x) - f_bar) - np.sum(lm.coef_)

Amortization

  1. \(^\dagger\) So far, we’ve only explained a single observation \(x\). In practice, we want attributions for many \(\{x_i\}_{i =1}^{N}\). An important insight is that attributions across observations are often related, so we can pool our estimates across all samples. This lets us tolerate rougher estimates for each individual sample.

    Notice that many samples have similar attributions.
  2. \(^\dagger\) Amortization frames parameter estimation as supervised learning. For concreteness, consider noisy KernelSHAP estimates using small \(M\), \[ \hat{\varphi}_d^{M}(x_i) = \arg \min_{\beta \in \reals^{D + 1}} \sum_{m = 1}^{M} w(S_m) \left(v(S_m) - \beta_0 - \sum_{d \in S_m}\beta_{d}\right)^{2} \] Then train a model \(a_\theta : \reals^D \to \reals^D\) (e.g. linear or neural network) to predict these estimates, \[ \hat{\theta} = \arg\min_{\theta \in \Theta} \sum_{i=1}^{N} \left\|\hat{\varphi}^M(x_i) - a_\theta(x_i)\right\|^2. \] This is now like a supervised learning problem where the “labels” are \(\hat{\varphi}^{M}(x_i)\) and the input features are \(x_i\).

  3. \(^\dagger\) Once \(\hat{\theta}\) has been found, we can very quickly get attributions for a new sample \(x^\ast\) by just calling \(a_{\hat{\theta}}(x^\ast)\). We don’t even need to sample any subsets \(S_m\) or evaluate any \(v(S_m)\). Moreover, the learned model can smooth over the noise in the per-sample estimates, leading to \(a_{\hat{\theta}}(x^*)\) that approximates \(\varphi(x^*)\) more accurately than \(\hat{\varphi}^M(x^*)\).

  4. \(^\dagger\) Some caveats. Training \(a_{\theta}\) can itself be computationally intensive. Also, if the labels \(\hat{\varphi}^M(x_i)\) are too noisy, the learned mapping may be inaccurate. Nonetheless, early research suggests that even moderate \(M\) and practical \(a_{\theta}\) can learn \(\varphi(x)\) with high fidelity.

Model-Specific Versions

  1. Within specific model classes, we can derive closed-form SHAP formulas that bypass enumeration over subsets. For example, in your HW3, you will show that linear models \(f(x) = \beta_0 + \sum_{d} \beta_{d}x_{d}\) with baseline removal satisfy, \[ \varphi_{d}(x) = \beta_{d}\left(x_{d} - x_{d}'\right) \] The attribution only depends on feature \(d\)’s deviation from the baseline and coefficient \(\beta_d\). A similar derivation shows that for marginal feature removal, \[ \varphi_{d}(x) = \beta_{d}\left(x_{d} - \Esubarg{p(x)}{X_{d}}\right) \] so using the marginal removal is the same as using the sample mean as a baseline.

  2. There are two other important model-specific methods,

    • TreeSHAP: This gives exact SHAP values tree-based model (CART, bagging, boosting, random forest). It’s available as TreeExplainer in the shap python package and the treeshap R package.
    • DeepSHAP: This is an approximation for deep learning models. It’s not exact, but has lower variance than model-agnostic approaches. It is available through the DeepExplainer class in shap and the DeepSHAP class in the innsight R package.

Code Example

  1. This example shows that the KernelSHAP sketch above indeed agrees with the official implementation. We’ve copied our code into the explain method in the class below. shap_values simply loops explain over many samples.

    import numpy as np
    from math import comb
    from sklearn.linear_model import LinearRegression
    
    class SimpleKernelExplainer:
        def __init__(self, model, data):
            self.model = model
            self.data = data
            self.N, self.D = data.shape
            self.f_bar = np.mean(model(data))
    
        def shap_values(self, X, M):
            explanations = []
            for i in range(X.shape[0]):
                explanations.append(self.explain(X[i:(i + 1)], M))
            return np.array(explanations)
    
        def explain(self, x, M):
            # initialize data structures
            mask_matrix = np.zeros((M, self.D))
            kernel_weights = np.zeros(M)
            synth_data = np.tile(self.data, (M, 1))
    
            # sample random subsets
            for i in range(M):
                mask = np.random.choice([0, 1], size=self.D)
                mask_matrix[i] = mask
                synth_data[i * self.N:(i + 1) * self.N, mask == 1] = x[0, mask == 1]
    
                # edge 0 and self.D handled by constraints
                if 0 < mask.sum() < self.D:
                    kernel_weights[i] = 1 / (comb(self.D, mask.sum()) * mask.sum() * (self.D - mask.sum()))
    
            # model predictions on the synthetic data
            model_out = self.model(synth_data)
            v = np.mean(model_out.reshape(M, self.N, -1), axis=1)
    
            # weighted linear regression on adjusted f
            X = mask_matrix - mask_matrix[:, -1][:, None]
            y = (v - self.f_bar).flatten() - mask_matrix[:, -1] * (self.model(x) - self.f_bar)
            lm = LinearRegression(fit_intercept=False).fit(X, y, sample_weight=kernel_weights)
    
            # return shapley approximation
            phi = np.zeros(self.D)
            phi[:-1] = lm.coef_[:-1]
            phi[-1] = (self.model(x) - self.f_bar) - np.sum(lm.coef_)
            return phi
  2. We’ll test this implementation using a random forest model trained on the make_friedman1 dataset (Friedman 1991).

    from sklearn.ensemble import RandomForestRegressor
    from sklearn.datasets import make_friedman1
    
    X, y = make_friedman1(n_samples=100, n_features=5, noise=0.1)
    model = RandomForestRegressor(n_estimators=100)
    model.fit(X, y)
    RandomForestRegressor()
    In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
    On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  3. We run both our simple and the official explainer in the block below.

    import shap
    
    explainer = SimpleKernelExplainer(model.predict, X)
    official_explainer = shap.KernelExplainer(model.predict, X)
    
    shap_values = explainer.shap_values(X, M=1000)
    official_shap_values = official_explainer.shap_values(X, nsamples=1000, silent=True)

    There’s randomness in the estimates, but they’re close.

    print("Package: ", np.round(official_shap_values[:5], 4))
    Package:  [[ 0.2954 -4.0058 -0.366  -2.3923  0.3805]
     [ 0.9994 -2.1396 -0.303  -2.932   0.3207]
     [-0.2326 -1.7868  0.6939  3.8895  0.6953]
     [ 1.2317 -0.24    0.5055  5.1771  0.7091]
     [-0.364  -0.1141  0.0555  1.9114  0.6093]]
    print("Ours: ", np.round(shap_values[:5], 4))
    Ours:  [[ 0.3099 -3.9995 -0.3584 -2.4307  0.3906]
     [ 0.9947 -2.1426 -0.3155 -2.9009  0.3098]
     [-0.2011 -1.7675  0.6592  3.867   0.7018]
     [ 1.2755 -0.2798  0.5157  5.1571  0.7149]
     [-0.3609 -0.107   0.0509  1.9142  0.6008]]

References

Charnes, A., B. Golany, M. Keane, and J. Rousseau. 1988. “Extremal Principle Solutions of Games in Characteristic Function Form: Core, Chebychev and Shapley Value Generalizations.” In Econometrics of Planning and Efficiency, 123–33. Springer Netherlands. https://doi.org/10.1007/978-94-009-3677-5_7.
Friedman, Jerome H. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics 19 (1). https://doi.org/10.1214/aos/1176347963.
Mitchell, Rory, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. 2022. “Sampling Permutations for Shapley Value Estimation.” J. Mach. Learn. Res. 23 (1).

Footnotes

  1. Compared to our earlier notes, we’re allowing for background subsamples B < N.↩︎