SHAP Computation

Author

Kris Sankaran

Published

March 2, 2026

Readings: 1 (required), 2 (optional), Code

Items marked \(^{\dagger}\) are not in the required reading and will not be tested. The required reading covers some methods beyond this handout, those also won’t be tested.

Setup

Goal. Given a model \(f\) and a sample \(x\in\reals^{D}\), efficiently compute the SHAP value \(\varphi_d(x, f)\), the contribution of feature \(d\) to the prediction \(f(x)\).

Approach.

There are two general strategies for efficient Shapley values computation,

Model-agnostic: Avoid assumptions about the model \(f\). Approximate \(\varphi_{d}(x)\) in a way that trades-off statistical accuracy for computation speed.
Model-specific: Assume a particular form for \(f\) (e.g., linear or tree ensemble) and derive a closed-form formula that bypasses costly enumerations. This is often more accurate but is less general.

Challenge

Let’s first see why naive SHAP computation is slow. Recall the key quantities,
- Feature \(d\)’s attribution, \[ \varphi_d(f, x) = \sum_{S \subseteq \mathcal{D} - \{d\}} \frac{1}{D {D - 1 \choose \left|S\right|}} C(d \vert S) \tag{1}\] where \(C(d \mid S) = v(S \cup \{d\}) - v(S)\) is the marginal contribution of \(d\) to coalition \(S\). Both \(C\) and \(v\) depend on \(f\) and \(x\), but we suppress this from our notation. We also write \(\varphi_d(x)\) instead of \(\varphi_d(f, x)\), since \(f\) is fixed in these notes.
- The value function \(v(S)\) evaluates \(f\) on \(x\) after “removing” features outside of \(S\) removed. We covered three removal strategies:
  - Baseline \(v(S) = f(x_S, x'_{\bar{S}})\)
  - Marginal¹: \(v(S) = \mathbb{E}_{p(X_{\bar{S}})}[f(x_S, X_{\bar{S}})] \approx \frac{1}{B}\sum_{i=1}^{B} f(x_S, x_{i,\bar{S}})\)
  - Conditional: \(v(S) = \mathbb{E}_{p(X_{\bar{S}} \mid X_S = x_S)}[f(x_S, X_{\bar{S}})]\)
To compute all \(D\) attributions for one sample, we need \(v(S)\) for every \(S \subseteq \mathcal{D}\). Using marginal feature removal, each \(v(S)\) costs \(B\) evaluations of \(f\). Let’s try a Fermi calculation. Take \(N = 100, D = 10, B = 100\). Attribution across all samples and features requires \[10^2 \text{ samples} \times 2^{10} \text{ subsets} \times 10^2 \text{ calls per subset} \approx 10^7 \text{ evaluations}.\] Supposing 1ms per call, this takes \(\approx 3\) hours (and this is only 100 samples with 10 features). More generally, marginal removal costs \(NB\,2^D\) evaluations.

Exercise: What are the relative costs of marginal vs. baseline feature removal?

Sampling Perspective

The weights in Equation 1 define a probability distribution, \[ \mathbb{P}(S) = \frac{1}{D \binom{D-1}{|S|}}, \] because they are nonnegative and sum to one. This lets us rewrite \(\varphi_d(x)\) as an expectation, \[\begin{align} \label{eq-distribution} \varphi_d(x) &= \sum_{S \subseteq \mathcal{D} \setminus \{d\}} \mathbb{P}(S)\, C(d \mid S) = \mathbb{E}_{\mathbb{P}(S)}[C(d \mid S)]. \end{align}\] i.e., the Shapley value is the expected contribution of feature \(d\) to a random coalition.

Instead of enumerating all \(2^{D-1}\) subsets to get the exact expectation, we sample \(S_1, \dots, S_M \sim \mathbb{P}(S)\) and estimate \[ \hat{\varphi}_d(x) = \frac{1}{M}\sum_{m=1}^{M} C(d \mid S_m). \tag{2}\] This is unbiased (\(\mathbb{E}[\hat{\varphi}_d(x)] = \varphi_d(x)\)) and increasing \(M\) improves the approximation.

Exercise: Suppose \(\mathcal{D} = \{1, \dots, 4\}\) and \(d = 1\). What are the \(\mathbb{P}(S)\) for the sets in Equation 1?
To draw \(S_{m} \sim \mathbb{P}(S)\) in practice, we use random permutations. Draw \(\pi : \{1, \dots, D\} \to \{1, \dots, D\}\) uniformly at random. For example, for 10 features we could use,
```
pi <- sample(1:10)
pi
```
```
 [1]  9  1  6  5  3  4  7 10  8  2
```
Let \(\text{Pre}^{d}\left(\pi\right)\) the features appearing before \(d\) in \(\pi\). E.g., for \(d = 4\),
```
 pi[seq_len(match(4, pi) - 1)]
```
```
[1] 9 1 6 5 3
```
It’s not obvious, but \(\text{Pre}^d\left(\pi\right)\) gives a sample from \(\mathbb{P}\left(S\right)\).
To summarize, the procedure is,
```
for m = 1, ..., M:
  sample a random permutation π_m over D features
  set S_m = Pre^d(π_m)
  compute C(d | S_m)
return average of C(d | S_m) over m
```
This is often called IME (“Interactions-based Method for Explanation”). Larger \(M\) gives better approximations but at higher computing cost.
Adaptive sampling. The terms \(C\left(d \vert S_{m}\right)\) are random variables with mean \(\varphi_{d}\left(x\right)\).

(picture of their histograms across \(d\))

For features that are irrelevant, the contributions are tightly clustered near zero, so even small values of \(M\) are enough. Inspired by this observation, adaptive sampling tracks the variance of \(C(d \mid S_m)\) across features and allocates more samples to high-variance features.
Antithetic sampling. Recall that \[ \text{Var}(Z_1 + Z_2) = \text{Var}(Z_1) + \text{Var}(Z_2) + 2\text{Cov}(Z_1, Z_2). \] so negative correlated terms reduce the variance of the sum.

Exercise: If \(\text{Cor}(Z_1, Z_2) = -1\) and \(Z_1, Z_2\) have the same variance, what is \(\text{Var}(Z_1 + Z_2)\)?

More generally, \[\text{Var}\!\left(\sum_{m=1}^M Z_m\right) = \sum_m \text{Var}(Z_m) + \sum_{m \neq m'} \text{Cov}(Z_m, Z_{m'}),\] so again negative pairwise correlations reduce variance.
Applying this to Equation 2, we can reduce the correlation between \(C\left(d \vert S_m\right)\) and \(C\left(d \vert S_{m'}\right)\) by pairing each \(S_m\) with its complement,
- Sample \(S_m \sim \mathbb{P}(S)\)
- Set \(S_{m'} = \bar{S}_{m}\) the complement of \(S_m\) in \(\{1, \dots, D\}\).
Despite its simplicity, this yields state-of-the-art variance reductions (Mitchell et al. 2022).

Weighted Least Squares

A classical result in game theory (Charnes et al. 1988) shows that SHAP values can be characterized through a weighted least squares problem, \[ \varphi\left(x\right) = \arg \min_{\beta \in \reals^{D + 1}} \sum_{S \subset \mathcal{D}} w(S) \left(v(S) - \beta_0 - \sum_{d \in S}\beta_{d}\right)^2 \tag{3}\] where \(w(S) = \left[\binom{D}{|S|} |S|(D - |S|)\right]^{-1}\) and \(\varphi(x) = (\varphi_1(x), \dots, \varphi_D(x))^\top\).
This characterization still enumerates \(2^D\) subsets, but it suggests an approximation: sample random subsets \(S_1, \dots, S_M\) and solve the weighted regression on the reduced design. This is “KernelSHAP.”
Let \(z_{md} = \mathbf{1}[d \in s_m]\) be the associated design matrix. the regression suggests \(v(\emptyset) \approx \beta_0\) and \[ v(S_m) - v(\emptyset) \approx \sum_{d = 1}^{D} z_{md}\beta_d. \tag{4}\] The efficiency constraint becomes, \[ v(\mathcal{D}) - v(\emptyset) \approx \sum_{d = 1}^{D} \beta_d =: \Delta\\ \] We can improve the quality of our approximation by enforcing these constraints, substituting \(v(\emptyset) = \Esubarg{p(x)}{f(X)} \approx \frac{1}{N}\sum_{i = 1}^{N} f(x_i)\) and \(v(\mathcal{D}) = f(x).\)

Exercise: In fact, the regression objective ensures that \(v\left(\emptyset\right) = \beta_0\) and \(v(\mathcal{D}) - v(\emptyset) = \Delta\) hold exactly. Can you see why?
\(^\dagger\) To see this concretely, let’s try a code implementation (validated against the shap package below). First, initialize the binary mask matrix \(z_{md}\), kernel weights, and feature-removed inputs:
```
mask_matrix = np.zeros((M, D))
kernel_weights = np.zeros(M)
synth_data = np.tile(data, (M, 1))
```

\(^\dagger\) Next we draw random subsets \(S_m\) and the corresponding feature-removed data \((x_{S_m}, x_{i,\bar{S}_m})\):

for i in range(M):
     mask = np.random.choice([0, 1], size=D)
     mask_matrix[i] = mask
     synth_data[i * N:(i + 1) * N, mask == 1] = x[0, mask == 1] # explaining x[0]
     kernel_weights[i] = 1 / (comb(self.D, mask.sum()) * mask.sum() * (self.D - mask.sum()))

We evaluate \(f\) on the feature-removed data. Here f_bar estimates \(v(\emptyset)\).

 f_bar = np.mean(model(data))
 f = model(synth_data)
 v = np.mean(f.reshape(M, N, -1), axis=1)

Exercise: In the line synth_data[i, ...], how many entries of synth_data get modified? What is the interpretation?

\(^\dagger\) Finally, we run the weighted regression. We incorporate efficiency constraints from point (3). Since \(\beta_D = \Delta - \sum_{d= 1}^{D - 1} \beta_d\), we can substitute into Equation 4, \[ v(S_m) - v(\emptyset) - z_{mD}\Delta \approx \sum_{d = 1}^{D - 1}\left(z_{md}- z_{mD}\right)\beta_{d} \] The left hand side is y below and the right hand side is X.
```
y = (v - f_bar).flatten() - mask_matrix[:, -1] * (model(x) - f_bar)
X = mask_matrix - mask_matrix[:, -1][:, None]
lm = LinearRegression(fit_intercept=False).fit(X, y, sample_weight=kernel_weights)

# final shapley values
phi = np.zeros(D)
phi[:-1] = lm.coef_[:-1]
phi[-1] = (model(x) - f_bar) - np.sum(lm.coef_)
```

Amortization

\(^\dagger\) So far, we’ve only explained a single observation \(x\). In practice, we want attributions for many \(\{x_i\}_{i =1}^{N}\). An important insight is that attributions across observations are often related, so we can pool our estimates across all samples. This lets us tolerate rougher estimates for each individual sample.

Notice that many samples have similar attributions.
\(^\dagger\) Amortization frames parameter estimation as supervised learning. For concreteness, consider noisy KernelSHAP estimates using small \(M\), \[ \hat{\varphi}_d^{M}(x_i) = \arg \min_{\beta \in \reals^{D + 1}} \sum_{m = 1}^{M} w(S_m) \left(v(S_m) - \beta_0 - \sum_{d \in S_m}\beta_{d}\right)^{2} \] Then train a model \(a_\theta : \reals^D \to \reals^D\) (e.g. linear or neural network) to predict these estimates, \[ \hat{\theta} = \arg\min_{\theta \in \Theta} \sum_{i=1}^{N} \left\|\hat{\varphi}^M(x_i) - a_\theta(x_i)\right\|^2. \] This is now like a supervised learning problem where the “labels” are \(\hat{\varphi}^{M}(x_i)\) and the input features are \(x_i\).
\(^\dagger\) Once \(\hat{\theta}\) has been found, we can very quickly get attributions for a new sample \(x^\ast\) by just calling \(a_{\hat{\theta}}(x^\ast)\). We don’t even need to sample any subsets \(S_m\) or evaluate any \(v(S_m)\). Moreover, the learned model can smooth over the noise in the per-sample estimates, leading to \(a_{\hat{\theta}}(x^*)\) that approximates \(\varphi(x^*)\) more accurately than \(\hat{\varphi}^M(x^*)\).
\(^\dagger\) Some caveats. Training \(a_{\theta}\) can itself be computationally intensive. Also, if the labels \(\hat{\varphi}^M(x_i)\) are too noisy, the learned mapping may be inaccurate. Nonetheless, early research suggests that even moderate \(M\) and practical \(a_{\theta}\) can learn \(\varphi(x)\) with high fidelity.

Model-Specific Versions

Within specific model classes, we can derive closed-form SHAP formulas that bypass enumeration over subsets. For example, in your HW3, you will show that linear models \(f(x) = \beta_0 + \sum_{d} \beta_{d}x_{d}\) with baseline removal satisfy, \[ \varphi_{d}(x) = \beta_{d}\left(x_{d} - x_{d}'\right) \] The attribution only depends on feature \(d\)’s deviation from the baseline and coefficient \(\beta_d\). A similar derivation shows that for marginal feature removal, \[ \varphi_{d}(x) = \beta_{d}\left(x_{d} - \Esubarg{p(x)}{X_{d}}\right) \] so using the marginal removal is the same as using the sample mean as a baseline.
There are two other important model-specific methods,
- TreeSHAP: This gives exact SHAP values tree-based model (CART, bagging, boosting, random forest). It’s available as TreeExplainer in the shap python package and the treeshap R package.
- DeepSHAP: This is an approximation for deep learning models. It’s not exact, but has lower variance than model-agnostic approaches. It is available through the DeepExplainer class in shap and the DeepSHAP class in the innsight R package.

Code Example

This example shows that the KernelSHAP sketch above indeed agrees with the official implementation. We’ve copied our code into the explain method in the class below. shap_values simply loops explain over many samples.

import numpy as np
from math import comb
from sklearn.linear_model import LinearRegression

class SimpleKernelExplainer:
    def __init__(self, model, data):
        self.model = model
        self.data = data
        self.N, self.D = data.shape
        self.f_bar = np.mean(model(data))

    def shap_values(self, X, M):
        explanations = []
        for i in range(X.shape[0]):
            explanations.append(self.explain(X[i:(i + 1)], M))
        return np.array(explanations)

    def explain(self, x, M):
        # initialize data structures
        mask_matrix = np.zeros((M, self.D))
        kernel_weights = np.zeros(M)
        synth_data = np.tile(self.data, (M, 1))

        # sample random subsets
        for i in range(M):
            mask = np.random.choice([0, 1], size=self.D)
            mask_matrix[i] = mask
            synth_data[i * self.N:(i + 1) * self.N, mask == 1] = x[0, mask == 1]

            # edge 0 and self.D handled by constraints
            if 0 < mask.sum() < self.D:
                kernel_weights[i] = 1 / (comb(self.D, mask.sum()) * mask.sum() * (self.D - mask.sum()))

        # model predictions on the synthetic data
        model_out = self.model(synth_data)
        v = np.mean(model_out.reshape(M, self.N, -1), axis=1)

        # weighted linear regression on adjusted f
        X = mask_matrix - mask_matrix[:, -1][:, None]
        y = (v - self.f_bar).flatten() - mask_matrix[:, -1] * (self.model(x) - self.f_bar)
        lm = LinearRegression(fit_intercept=False).fit(X, y, sample_weight=kernel_weights)

        # return shapley approximation
        phi = np.zeros(self.D)
        phi[:-1] = lm.coef_[:-1]
        phi[-1] = (self.model(x) - self.f_bar) - np.sum(lm.coef_)
        return phi

We’ll test this implementation using a random forest model trained on the make_friedman1 dataset (Friedman 1991).

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_friedman1

X, y = make_friedman1(n_samples=100, n_features=5, noise=0.1)
model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)

RandomForestRegressor()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We run both our simple and the official explainer in the block below.

import shap

explainer = SimpleKernelExplainer(model.predict, X)
official_explainer = shap.KernelExplainer(model.predict, X)

shap_values = explainer.shap_values(X, M=1000)
official_shap_values = official_explainer.shap_values(X, nsamples=1000, silent=True)

There’s randomness in the estimates, but they’re close.

print("Package: ", np.round(official_shap_values[:5], 4))

Package:  [[ 0.2954 -4.0058 -0.366  -2.3923  0.3805]
 [ 0.9994 -2.1396 -0.303  -2.932   0.3207]
 [-0.2326 -1.7868  0.6939  3.8895  0.6953]
 [ 1.2317 -0.24    0.5055  5.1771  0.7091]
 [-0.364  -0.1141  0.0555  1.9114  0.6093]]

print("Ours: ", np.round(shap_values[:5], 4))

Ours:  [[ 0.3099 -3.9995 -0.3584 -2.4307  0.3906]
 [ 0.9947 -2.1426 -0.3155 -2.9009  0.3098]
 [-0.2011 -1.7675  0.6592  3.867   0.7018]
 [ 1.2755 -0.2798  0.5157  5.1571  0.7149]
 [-0.3609 -0.107   0.0509  1.9142  0.6008]]

References

Charnes, A., B. Golany, M. Keane, and J. Rousseau. 1988. “Extremal Principle Solutions of Games in Characteristic Function Form: Core, Chebychev and Shapley Value Generalizations.” In Econometrics of Planning and Efficiency, 123–33. Springer Netherlands. https://doi.org/10.1007/978-94-009-3677-5_7.

Friedman, Jerome H. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics 19 (1). https://doi.org/10.1214/aos/1176347963.

Mitchell, Rory, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. 2022. “Sampling Permutations for Shapley Value Estimation.” J. Mach. Learn. Res. 23 (1).

Footnotes

Compared to our earlier notes, we’re allowing for background subsamples B < N.↩︎

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	100
	criterion criterion: {"squared_error", "absolute_error", "friedman_mse", "poisson"}, default="squared_error" The function to measure the quality of a split. Supported criteria are "squared_error" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, "friedman_mse", which uses mean squared error with Friedman's improvement score for potential splits, "absolute_error" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and "poisson" which uses reduction in Poisson deviance to find splits. Training using "absolute_error" is significantly slower than when using "squared_error". .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion. .. versionadded:: 1.0 Poisson criterion.	'squared_error'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	None
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default=1.0 The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None or 1.0, then `max_features=n_features`. .. note:: The default of 1.0 is equivalent to bagged trees and more randomness can be achieved by setting smaller values, e.g. 0.3. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to 1.0. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	1.0
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.r2_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary ` for details.	None
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary ` and :ref:`tree_ensemble_warm_start` for details.	False
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples. - If int, then draw `max_samples` samples. - If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`. .. versionadded:: 0.22	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonically increasing - 0: no constraint - -1: monotonically decreasing If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multioutput regressions (i.e. when `n_outputs_ > 1`), - regressions trained on data with missing values. Read more in the :ref:`User Guide `. .. versionadded:: 1.4	None