| Title: | SHAP Plots for 'XGBoost' |
|---|---|
| Description: | Aid in visual data investigations using SHAP (SHapley Additive exPlanation) visualization plots for 'XGBoost' and 'LightGBM'. It provides summary plot, dependence plot, interaction plot, and force plot and relies on the SHAP implementation provided by 'XGBoost' and 'LightGBM'. |
| Authors: | Yang Liu [aut, cre] (ORCID: <https://orcid.org/0000-0001-6557-6439>), Allan Just [aut, ctb] (ORCID: <https://orcid.org/0000-0003-4312-5957>), Michael Mayer [ctb] |
| Maintainer: | Yang Liu <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0 |
| Built: | 2026-05-12 08:01:23 UTC |
| Source: | https://github.com/liuyanguu/shapforxgboost |
Data.table, contains 9 features, and about 10,000 observations
dataXY_dfdataXY_df
An object of class data.table (inherits from data.frame) with 10148 rows and 10 columns.
label.feature helps to modify labels. If a list is created in the global
environment named new_labels (!is.null(new_labels), the plots will
use that list to replace default list of labels
labels_within_package.
label.feature(x)label.feature(x)
x |
variable names |
a character, e.g. "date", "Time Trend", etc.
It contains a list that match each feature to its labels. It is used in the function label.feature.
labels_within_packagelabels_within_package
An object of class list of length 20.
labels_within_package <- list( dayint = "Time trend", diffcwv = "delta CWV (cm)", date = "", Column_WV = "MAIAC CWV (cm)", AOT_Uncertainty = "Blue band uncertainty", elev = "Elevation (m)", aod = "Aerosol optical depth", RelAZ = "Relative azimuth angle", DevAll_P1km = expression(paste("Proportion developed area in 1",km^2)), dist_water_km = "Distance to water (km)", forestProp_1km = expression(paste("Proportion of forest in 1",km^2)), Aer_optical_depth = "DSCOVR EPIC MAIAC AOD400nm", aer_aod440 = "AERONET AOD440nm", aer_aod500 = "AERONET AOD500nm", diff440 = "DSCOVR MAIAC - AERONET AOD", diff440_pred = "Predicted Error", aer_aod440_hat = "Predicted AERONET AOD440nm", AOD_470nm = "AERONET AOD470nm", Optical_Depth_047_t = "MAIAC AOD470nm (Terra)", Optical_Depth_047_a = "MAIAC AOD470nm (Aqua)" )
if supplied as a list, it offers user to rename labels
new_labelsnew_labels
An object of class NULL of length 0.
This function further fine-tune the format of each feature
## S3 method for class 'label' plot(plot1, show_feature)## S3 method for class 'label' plot(plot1, show_feature)
plot1 |
ggplot2 object |
show_feature |
feature to plot |
returns ggplot2 object with further mordified layers based on the feature
Make customized scatter plot with diagonal line and R2 printed.
scatter.plot.diagonal( data, x, y, size0 = 0.2, alpha0 = 0.3, dilute = FALSE, add_abline = FALSE, add_hist = TRUE, add_stat_cor = TRUE )scatter.plot.diagonal( data, x, y, size0 = 0.2, alpha0 = 0.3, dilute = FALSE, add_abline = FALSE, add_hist = TRUE, add_stat_cor = TRUE )
data |
dataset |
x |
x |
y |
y |
size0 |
point size, default to 1 of nobs<1000, 0.4 if nobs>1000 |
alpha0 |
alpha of point |
dilute |
a number or logical, dafault to TRUE, will plot
|
add_abline |
default to FALSE, add a diagonal line |
add_hist |
optional to add marginal histogram using
|
add_stat_cor |
add correlation and p-value from |
ggplot2 object if add_hist = FALSE
scatter.plot.diagonal(data = iris, x = "Sepal.Length", y = "Petal.Length")scatter.plot.diagonal(data = iris, x = "Sepal.Length", y = "Petal.Length")
Simple scatter plot, adding marginal histogram by default.
scatter.plot.simple( data, x, y, size0 = 0.2, alpha0 = 0.3, dilute = FALSE, add_hist = TRUE, add_stat_cor = FALSE )scatter.plot.simple( data, x, y, size0 = 0.2, alpha0 = 0.3, dilute = FALSE, add_hist = TRUE, add_stat_cor = FALSE )
data |
dataset |
x |
x |
y |
y |
size0 |
point size, default to 1 of nobs<1000, 0.4 if nobs>1000 |
alpha0 |
alpha of point |
dilute |
a number or logical, dafault to TRUE, will plot
|
add_hist |
optional to add marginal histogram using
|
add_stat_cor |
add correlation and p-value from |
ggplot2 object if add_hist = FALSE
scatter.plot.simple(data = iris, x = "Sepal.Length", y = "Petal.Length")scatter.plot.simple(data = iris, x = "Sepal.Length", y = "Petal.Length")
The interaction effect SHAP values example using iris dataset.
shap_int_irisshap_int_iris
An object of class array of dimension 150 x 5 x 5.
The long-format SHAP values example using iris dataset.
shap_long_irisshap_long_iris
An object of class data.table (inherits from data.frame) with 600 rows and 6 columns.
SHAP values example from dataXY_df .
shap_scoreshap_score
An object of class data.table (inherits from data.frame) with 10148 rows and 9 columns.
SHAP values example using iris dataset.
shap_values_irisshap_values_iris
An object of class data.table (inherits from data.frame) with 150 rows and 4 columns.
Variable importance as measured by mean absolute SHAP value.
shap.importance(data_long, names_only = FALSE, top_n = Inf)shap.importance(data_long, names_only = FALSE, top_n = Inf)
data_long |
a long format data of SHAP values from
|
names_only |
If |
top_n |
How many variables to be returned? |
returns data.table with average absolute SHAP
values per variable, sorted in decreasing order of importance.
shap.importance(shap_long_iris) shap.importance(shap_long_iris, names_only = 1)shap.importance(shap_long_iris) shap.importance(shap_long_iris, names_only = 1)
Creates scatter plots showing the relationship between feature values (x-axis) and SHAP values (y-axis). Can display:
Simple dependence: how feature values affect predictions
Colored by another feature: to explore interactions
Interaction effects: when data_int is provided, shows
pairwise SHAP interaction values
shap.plot.dependence( data_long, x, y = NULL, color_feature = NULL, data_int = NULL, dilute = FALSE, smooth = TRUE, size0 = NULL, add_hist = FALSE, add_stat_cor = FALSE, alpha = NULL, jitter_height = 0, jitter_width = 0, ... )shap.plot.dependence( data_long, x, y = NULL, color_feature = NULL, data_int = NULL, dilute = FALSE, smooth = TRUE, size0 = NULL, add_hist = FALSE, add_stat_cor = FALSE, alpha = NULL, jitter_height = 0, jitter_width = 0, ... )
data_long |
the long format SHAP values from |
x |
which feature to show on x-axis, it will plot the feature value |
y |
which shap values to show on y-axis, it will plot the SHAP value of that feature. y is default to x, if y is not provided, just plot the SHAP values of x on the y-axis |
color_feature |
which feature value to use for coloring, color by the feature value. If "auto", will select the feature "c" minimizing the variance of the shap value given x and c, which can be viewed as a heuristic for the strongest interaction. |
data_int |
the 3-dimention SHAP interaction values array. if |
dilute |
a number or logical, dafault to TRUE, will plot
|
smooth |
optional to add a loess smooth line, default to TRUE. |
size0 |
point size, default to 1 if nobs<1000, 0.4 if nobs>1000 |
add_hist |
whether to add histogram using |
add_stat_cor |
add correlation and p-value from |
alpha |
point transparancy, default to 1 if nobs<1000 else 0.6 |
jitter_height |
amount of vertical jitter (see hight in |
jitter_width |
amount of horizontal jitter (see width in |
... |
additional parameters passed to |
be default a ggplot2 object, based on which you could add more geom
layers.
# Example: SHAP dependence plots # 1. Simple dependence plot: SHAP values vs feature values shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", add_hist = TRUE, add_stat_cor = TRUE) # 2. Show different SHAP values on y-axis shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width") # 3. Color by another feature's values shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", color_feature = "Petal.Width") # 4. Customize x, y, and color features shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width") # 5. Additional options: histogram, smooth line, data dilution shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width", add_hist = TRUE, smooth = FALSE, dilute = 3) # Create multiple plots at once plot_list <- lapply(names(iris)[2:3], shap.plot.dependence, data_long = shap_long_iris) # SHAP interaction effect plot # First, prepare the model and interaction data X_iris = as.matrix(iris[,1:4]) y_iris = as.numeric(iris[[5]]) - 1 dtrain = xgboost::xgb.DMatrix(data = X_iris, label = y_iris) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get interaction SHAP values (two methods): data_int <- shap.prep.interaction(xgb_model = mod1, X_train = X_iris) # Or directly: shap_int <- predict(mod1, X_iris, predinteraction = TRUE) # Plot interaction effects (y-axis shows interaction values) shap.plot.dependence(data_long = shap_long_iris, data_int = shap_int_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width")# Example: SHAP dependence plots # 1. Simple dependence plot: SHAP values vs feature values shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", add_hist = TRUE, add_stat_cor = TRUE) # 2. Show different SHAP values on y-axis shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width") # 3. Color by another feature's values shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", color_feature = "Petal.Width") # 4. Customize x, y, and color features shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width") # 5. Additional options: histogram, smooth line, data dilution shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width", add_hist = TRUE, smooth = FALSE, dilute = 3) # Create multiple plots at once plot_list <- lapply(names(iris)[2:3], shap.plot.dependence, data_long = shap_long_iris) # SHAP interaction effect plot # First, prepare the model and interaction data X_iris = as.matrix(iris[,1:4]) y_iris = as.numeric(iris[[5]]) - 1 dtrain = xgboost::xgb.DMatrix(data = X_iris, label = y_iris) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get interaction SHAP values (two methods): data_int <- shap.prep.interaction(xgb_model = mod1, X_train = X_iris) # Or directly: shap_int <- predict(mod1, X_iris, predinteraction = TRUE) # Plot interaction effects (y-axis shows interaction values) shap.plot.dependence(data_long = shap_long_iris, data_int = shap_int_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width")
Displays feature contributions as stacked bars for individual predictions. Each bar shows how features push the prediction above or below the baseline. Supports optional zoom-in for detailed inspection of observation clusters.
shap.plot.force_plot( shapobs, id = "sorted_id", zoom_in_location = NULL, y_parent_limit = NULL, y_zoomin_limit = NULL, zoom_in = TRUE, zoom_in_group = NULL )shap.plot.force_plot( shapobs, id = "sorted_id", zoom_in_location = NULL, y_parent_limit = NULL, y_zoomin_limit = NULL, zoom_in = TRUE, zoom_in_group = NULL )
shapobs |
The dataset obtained by |
id |
the id variable. |
zoom_in_location |
where to zoom in, default at place of 60 percent of the data. |
y_parent_limit |
set y-axis limits. |
y_zoomin_limit |
|
zoom_in |
default to TRUE, zoom in by |
zoom_in_group |
optional to zoom in certain cluster. |
# Example: SHAP force plots (stacked bar charts) # Shows contribution of each feature to individual predictions plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # Plot all clusters separately shap.plot.force_plot_bygroup(plot_data)# Example: SHAP force plots (stacked bar charts) # Shows contribution of each feature to individual predictions plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # Plot all clusters separately shap.plot.force_plot_bygroup(plot_data)
Creates a faceted display with one force plot per observation cluster, allowing comparison of prediction patterns across different groups.
shap.plot.force_plot_bygroup(shapobs, id = "sorted_id", y_parent_limit = NULL)shap.plot.force_plot_bygroup(shapobs, id = "sorted_id", y_parent_limit = NULL)
shapobs |
The dataset obtained by |
id |
the id variable. |
y_parent_limit |
set y-axis limits. |
# Example: SHAP force plots (stacked bar charts) # Shows contribution of each feature to individual predictions plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # Plot all clusters separately shap.plot.force_plot_bygroup(plot_data)# Example: SHAP force plots (stacked bar charts) # Shows contribution of each feature to individual predictions plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # Plot all clusters separately shap.plot.force_plot_bygroup(plot_data)
Creates a beeswarm/sina plot or bar chart showing feature importance.
The sina plot shows SHAP value distributions for each feature, colored by
feature values. For a simpler workflow, use shap.plot.summary.wrap1
(directly from model) or shap.plot.summary.wrap2 (from SHAP matrix).
shap.plot.summary( data_long, x_bound = NULL, dilute = FALSE, scientific = FALSE, my_format = NULL, min_color_bound = "#FFCC33", max_color_bound = "#6600CC", kind = c("sina", "bar") )shap.plot.summary( data_long, x_bound = NULL, dilute = FALSE, scientific = FALSE, my_format = NULL, min_color_bound = "#FFCC33", max_color_bound = "#6600CC", kind = c("sina", "bar") )
data_long |
a long format data of SHAP values from
|
x_bound |
use to set horizontal axis limit in the plot |
dilute |
being numeric or logical (TRUE/FALSE), it aims to help make the test plot for large amount of data faster. If dilute = 5 will plot 1/5 of the data. If dilute = TRUE or a number, will plot at most half points per feature, so the plotting won't be too slow. If you put dilute too high, at least 10 points per feature would be kept. If the dataset is too small after dilution, will just plot all the data |
scientific |
show the mean|SHAP| in scientific format. If TRUE, label format is 0.0E-0, default to FALSE, and the format will be 0.000 |
my_format |
supply your own number format if you really want |
min_color_bound |
min color hex code for colormap. Color gradient is scaled between min_color_bound and max_color_bound. Default is "#FFCC33". |
max_color_bound |
max color hex code for colormap. Color gradient is scaled between min_color_bound and max_color_bound. Default is "#6600CC". |
kind |
By default, a "sina" plot is shown. As an alternative,
set |
To customize feature labels, define new_labels in the global environment
as a named list (see labels_within_package for examples).
returns a ggplot2 object, could add further layers.
# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
shap.plot.summary.wrap1 wraps up function shap.prep and
shap.plot.summary
shap.plot.summary.wrap1(model, X, top_n, dilute = FALSE)shap.plot.summary.wrap1(model, X, top_n, dilute = FALSE)
model |
the model |
X |
the dataset of predictors used for calculating SHAP |
top_n |
how many predictors you want to show in the plot (ranked) |
dilute |
being numeric or logical (TRUE/FALSE), it aims to help make the test plot for large amount of data faster. If dilute = 5 will plot 1/5 of the data. If dilute = TRUE or a number, will plot at most half points per feature, so the plotting won't be too slow. If you put dilute too high, at least 10 points per feature would be kept. If the dataset is too small after dilution, will just plot all the data |
# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
shap.plot.summary.wrap2 wraps up function shap.prep and
shap.plot.summary. Since SHAP matrix could be returned from
cross-validation instead of only one model, here the wrapped
shap.prep takes the SHAP score matrix shap_score as input
shap.plot.summary.wrap2(shap_score, X, top_n, dilute = FALSE)shap.plot.summary.wrap2(shap_score, X, top_n, dilute = FALSE)
shap_score |
the SHAP values dataset, could be obtained by
|
X |
the dataset of predictors used for calculating SHAP values |
top_n |
how many predictors you want to show in the plot (ranked) |
dilute |
being numeric or logical (TRUE/FALSE), it aims to help make the test plot for large amount of data faster. If dilute = 5 will plot 1/5 of the data. If dilute = TRUE or a number, will plot at most half points per feature, so the plotting won't be too slow. If you put dilute too high, at least 10 points per feature would be kept. If the dataset is too small after dilution, will just plot all the data |
# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
Produces a data.table with 6 columns: ID (observation identifier), variable
(feature name), value (SHAP value), rfvalue (raw feature value), stdfvalue
(standardized feature value for coloring), and mean_value (mean absolute SHAP
value per feature). See shap_long_iris for an example.
shap.prep( xgb_model = NULL, shap_contrib = NULL, X_train, top_n = NULL, var_cat = NULL )shap.prep( xgb_model = NULL, shap_contrib = NULL, X_train, top_n = NULL, var_cat = NULL )
xgb_model |
an XGBoost or LightGBM model object (will derive SHAP values from it) |
shap_contrib |
optional: a matrix of SHAP values (without baseline column).
If supplied, will use these values instead of computing from |
X_train |
the predictor matrix used to calculate SHAP values (required) |
top_n |
number of top features to include, ranked by mean|SHAP| (default: all) |
var_cat |
optional: name of a categorical variable for grouped/faceted plots |
The ID variable (1:nrow(shap_contrib)) is added to track each
observation before converting to long format.
a long-format data.table, named as shap_long
# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3) # Example: Using var_cat to group by a categorical variable # The var_cat argument creates grouped long-format data for faceted plots library("data.table") data("iris") set.seed(123) iris$Group <- 0 iris[sample(1:nrow(iris), nrow(iris)/2), "Group"] <- 1 data.table::setDT(iris) X_train = as.matrix(iris[,c(colnames(iris)[1:4], "Group"), with = FALSE]) y_train = as.numeric(iris$Species) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X_train, label = y_train) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Use var_cat to create faceted plots by Group shap_long2 <- shap.prep(xgb_model = mod1, X_train = X_train, var_cat = "Group") shap.plot.summary(shap_long2, scientific = TRUE) + ggplot2::facet_wrap(~ Group)# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3) # Example: Using var_cat to group by a categorical variable # The var_cat argument creates grouped long-format data for faceted plots library("data.table") data("iris") set.seed(123) iris$Group <- 0 iris[sample(1:nrow(iris), nrow(iris)/2), "Group"] <- 1 data.table::setDT(iris) X_train = as.matrix(iris[,c(colnames(iris)[1:4], "Group"), with = FALSE]) y_train = as.numeric(iris$Species) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X_train, label = y_train) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Use var_cat to create faceted plots by Group shap_long2 <- shap.prep(xgb_model = mod1, X_train = X_train, var_cat = "Group") shap.plot.summary(shap_long2, scientific = TRUE) + ggplot2::facet_wrap(~ Group)
This is a convenience wrapper for
predict(xgb_model, X_train, predinteraction = TRUE).
See xgboost::predict.xgb.Booster for details.
Note: This functionality is only available for XGBoost models, not LightGBM.
shap.prep.interaction(xgb_model, X_train)shap.prep.interaction(xgb_model, X_train)
xgb_model |
a xgboost model object |
X_train |
the dataset of predictors used for the xgboost model |
a 3-dimention array: #obs x #features x #features
# Example: SHAP interaction plots # Shows how feature interactions affect predictions X_iris = as.matrix(iris[,1:4]) y_iris = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X_iris, label = y_iris) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get interaction SHAP values (two methods): data_int <- shap.prep.interaction(xgb_model = mod1, X_train = X_iris) # Or directly: shap_int <- predict(mod1, X_iris, predinteraction = TRUE) # Plot interaction effects shap.plot.dependence(data_long = shap_long_iris, data_int = shap_int_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width")# Example: SHAP interaction plots # Shows how feature interactions affect predictions X_iris = as.matrix(iris[,1:4]) y_iris = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X_iris, label = y_iris) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get interaction SHAP values (two methods): data_int <- shap.prep.interaction(xgb_model = mod1, X_train = X_iris) # Or directly: shap_int <- predict(mod1, X_iris, predinteraction = TRUE) # Plot interaction effects shap.plot.dependence(data_long = shap_long_iris, data_int = shap_int_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width")
Transforms SHAP values into a format suitable for force plots, which show how features contribute to individual predictions. The function:
Ranks features by importance
Optionally combines less important features into 'rest_variables'
Clusters observations for better visualization
Assigns group labels for faceted plots
shap.prep.stack.data( shap_contrib, top_n = NULL, data_percent = 1, cluster_method = "ward.D", n_groups = 10L )shap.prep.stack.data( shap_contrib, top_n = NULL, data_percent = 1, cluster_method = "ward.D", n_groups = 10L )
shap_contrib |
shap_contrib is the SHAP value data returned from
predict, here an ID variable is added for each observation in
the |
top_n |
integer, optional to show only top_n features, combine the rest |
data_percent |
what percent of data to plot (to speed up the testing plot). The accepted input range is (0,1], if observations left is too few, there will be an error from the clustering function |
cluster_method |
default to ward.D, please refer to |
n_groups |
a integer, how many groups to plot in
|
a dataset for stack plot
# Example: SHAP force plots (stacked bar charts) # Shows contribution of each feature to individual predictions plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # Plot all clusters separately shap.plot.force_plot_bygroup(plot_data)# Example: SHAP force plots (stacked bar charts) # Shows contribution of each feature to individual predictions plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # Plot all clusters separately shap.plot.force_plot_bygroup(plot_data)
shap.values returns a list of three objects from XGBoost or LightGBM
model: 1. a dataset (data.table) of SHAP scores. It has the same dimension as
the X_train); 2. the ranked variable vector by each variable's mean absolute
SHAP value, it ranks the predictors by their importance in the model; and 3.
The baseline value (intercept), which is stored in the last column of the
SHAP contribution matrix (named "BIAS" in older xgboost versions or
"(Intercept)" in newer versions). The rowsum of SHAP values including the
baseline would equal to the predicted value (y_hat) generally speaking.
shap.values(xgb_model, X_train)shap.values(xgb_model, X_train)
xgb_model |
an XGBoost or LightGBM model object |
X_train |
the data supplied to the |
a list of three elements:
shap_score |
A data.table of SHAP values (without the baseline column) |
mean_shap_score |
Ranked features by mean absolute SHAP value |
BIAS0 |
The baseline/intercept value (from the '(Intercept)' column in xgboost 3.x) |
# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)# Example: Basic workflow for SHAP summary plot # Note: For xgboost 3.x, use xgb.DMatrix + xgb.train, and convert factor labels to numeric data("iris") X1 = as.matrix(iris[,1:4]) y1 = as.numeric(iris[[5]]) - 1 # Convert factor to numeric dtrain = xgboost::xgb.DMatrix(data = X1, label = y1) params = list(learning_rate = 1, min_split_loss = 0, reg_lambda = 0, objective = 'reg:squarederror', nthread = 1) mod1 = xgboost::xgb.train(params = params, data = dtrain, nrounds = 1, verbose = 0) # Get SHAP values and feature importance shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score # Ranked features by mean|SHAP| shap_values_iris <- shap_values$shap_score # Prepare long-format data for plotting shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # Alternative: use pre-computed SHAP values shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # SHAP summary plot shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternative options: # Option 1: directly from xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,1:4]), top_n = 3) # Option 2: from pre-computed SHAP values (useful for cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)