Title: | SHAP Plots for 'XGBoost' |
---|---|
Description: | Aid in visual data investigations using SHAP (SHapley Additive exPlanation) visualization plots for 'XGBoost' and 'LightGBM'. It provides summary plot, dependence plot, interaction plot, and force plot and relies on the SHAP implementation provided by 'XGBoost' and 'LightGBM'. Please refer to 'slundberg/shap' for the original implementation of SHAP in 'Python'. |
Authors: | Yang Liu [aut, cre] , Allan Just [aut, ctb] , Michael Mayer [ctb] |
Maintainer: | Yang Liu <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.3 |
Built: | 2024-11-17 04:40:56 UTC |
Source: | https://github.com/liuyanguu/shapforxgboost |
Data.table, contains 9 features, and about 10,000 observations
dataXY_df
dataXY_df
An object of class data.table
(inherits from data.frame
) with 10148 rows and 10 columns.
label.feature
helps to modify labels. If a list is created in the global
environment named new_labels (!is.null(new_labels
), the plots will
use that list to replace default list of labels
labels_within_package
.
label.feature(x)
label.feature(x)
x |
variable names |
a character, e.g. "date", "Time Trend", etc.
It contains a list that match each feature to its labels. It is used in the function label.feature
.
labels_within_package
labels_within_package
An object of class list
of length 20.
labels_within_package <- list( dayint = "Time trend", diffcwv = "delta CWV (cm)", date = "", Column_WV = "MAIAC CWV (cm)", AOT_Uncertainty = "Blue band uncertainty", elev = "Elevation (m)", aod = "Aerosol optical depth", RelAZ = "Relative azimuth angle", DevAll_P1km = expression(paste("Proportion developed area in 1",km^2)), dist_water_km = "Distance to water (km)", forestProp_1km = expression(paste("Proportion of forest in 1",km^2)), Aer_optical_depth = "DSCOVR EPIC MAIAC AOD400nm", aer_aod440 = "AERONET AOD440nm", aer_aod500 = "AERONET AOD500nm", diff440 = "DSCOVR MAIAC - AERONET AOD", diff440_pred = "Predicted Error", aer_aod440_hat = "Predicted AERONET AOD440nm", AOD_470nm = "AERONET AOD470nm", Optical_Depth_047_t = "MAIAC AOD470nm (Terra)", Optical_Depth_047_a = "MAIAC AOD470nm (Aqua)" )
if supplied as a list, it offers user to rename labels
new_labels
new_labels
An object of class NULL
of length 0.
This function further fine-tune the format of each feature
## S3 method for class 'label' plot(plot1, show_feature)
## S3 method for class 'label' plot(plot1, show_feature)
plot1 |
ggplot2 object |
show_feature |
feature to plot |
returns ggplot2 object with further mordified layers based on the feature
Make customized scatter plot with diagonal line and R2 printed.
scatter.plot.diagonal( data, x, y, size0 = 0.2, alpha0 = 0.3, dilute = FALSE, add_abline = FALSE, add_hist = TRUE, add_stat_cor = TRUE )
scatter.plot.diagonal( data, x, y, size0 = 0.2, alpha0 = 0.3, dilute = FALSE, add_abline = FALSE, add_hist = TRUE, add_stat_cor = TRUE )
data |
dataset |
x |
x |
y |
y |
size0 |
point size, default to 1 of nobs<1000, 0.4 if nobs>1000 |
alpha0 |
alpha of point |
dilute |
a number or logical, dafault to TRUE, will plot
|
add_abline |
default to FALSE, add a diagonal line |
add_hist |
optional to add marginal histogram using
|
add_stat_cor |
add correlation and p-value from |
ggplot2 object if add_hist = FALSE
scatter.plot.diagonal(data = iris, x = "Sepal.Length", y = "Petal.Length")
scatter.plot.diagonal(data = iris, x = "Sepal.Length", y = "Petal.Length")
Simple scatter plot, adding marginal histogram by default.
scatter.plot.simple( data, x, y, size0 = 0.2, alpha0 = 0.3, dilute = FALSE, add_hist = TRUE, add_stat_cor = FALSE )
scatter.plot.simple( data, x, y, size0 = 0.2, alpha0 = 0.3, dilute = FALSE, add_hist = TRUE, add_stat_cor = FALSE )
data |
dataset |
x |
x |
y |
y |
size0 |
point size, default to 1 of nobs<1000, 0.4 if nobs>1000 |
alpha0 |
alpha of point |
dilute |
a number or logical, dafault to TRUE, will plot
|
add_hist |
optional to add marginal histogram using
|
add_stat_cor |
add correlation and p-value from |
ggplot2 object if add_hist = FALSE
scatter.plot.simple(data = iris, x = "Sepal.Length", y = "Petal.Length")
scatter.plot.simple(data = iris, x = "Sepal.Length", y = "Petal.Length")
The interaction effect SHAP values example using iris dataset.
shap_int_iris
shap_int_iris
An object of class array
of dimension 150 x 5 x 5.
The long-format SHAP values example using iris dataset.
shap_long_iris
shap_long_iris
An object of class data.table
(inherits from data.frame
) with 600 rows and 6 columns.
SHAP values example from dataXY_df .
shap_score
shap_score
An object of class data.table
(inherits from data.frame
) with 10148 rows and 9 columns.
SHAP values example using iris dataset.
shap_values_iris
shap_values_iris
An object of class data.table
(inherits from data.frame
) with 150 rows and 4 columns.
Variable importance as measured by mean absolute SHAP value.
shap.importance(data_long, names_only = FALSE, top_n = Inf)
shap.importance(data_long, names_only = FALSE, top_n = Inf)
data_long |
a long format data of SHAP values from
|
names_only |
If |
top_n |
How many variables to be returned? |
returns data.table
with average absolute SHAP
values per variable, sorted in decreasing order of importance.
shap.importance(shap_long_iris) shap.importance(shap_long_iris, names_only = 1)
shap.importance(shap_long_iris) shap.importance(shap_long_iris, names_only = 1)
This function by default makes a simple dependence plot with feature values
on the x-axis and SHAP values on the y-axis, optional to color by another
feature. It is optional to use a different variable for SHAP values on the
y-axis, and color the points by the feature value of a designated variable.
Not colored if color_feature
is not supplied. If data_int
(the
SHAP interaction values dataset) is supplied, it will plot the interaction
effect between y
and x
on the y-axis. Dependence plot is easy
to make if you have the SHAP values dataset from predict.xgb.Booster
or predict.lgb.Booster
.
It is not necessary to start with the long format data, but since that is
used for the summary plot, we just continue to use it here.
shap.plot.dependence( data_long, x, y = NULL, color_feature = NULL, data_int = NULL, dilute = FALSE, smooth = TRUE, size0 = NULL, add_hist = FALSE, add_stat_cor = FALSE, alpha = NULL, jitter_height = 0, jitter_width = 0, ... )
shap.plot.dependence( data_long, x, y = NULL, color_feature = NULL, data_int = NULL, dilute = FALSE, smooth = TRUE, size0 = NULL, add_hist = FALSE, add_stat_cor = FALSE, alpha = NULL, jitter_height = 0, jitter_width = 0, ... )
data_long |
the long format SHAP values from |
x |
which feature to show on x-axis, it will plot the feature value |
y |
which shap values to show on y-axis, it will plot the SHAP value of that feature. y is default to x, if y is not provided, just plot the SHAP values of x on the y-axis |
color_feature |
which feature value to use for coloring, color by the feature value. If "auto", will select the feature "c" minimizing the variance of the shap value given x and c, which can be viewed as a heuristic for the strongest interaction. |
data_int |
the 3-dimention SHAP interaction values array. if |
dilute |
a number or logical, dafault to TRUE, will plot
|
smooth |
optional to add a loess smooth line, default to TRUE. |
size0 |
point size, default to 1 if nobs<1000, 0.4 if nobs>1000 |
add_hist |
whether to add histogram using |
add_stat_cor |
add correlation and p-value from |
alpha |
point transparancy, default to 1 if nobs<1000 else 0.6 |
jitter_height |
amount of vertical jitter (see hight in |
jitter_width |
amount of horizontal jitter (see width in |
... |
additional parameters passed to |
be default a ggplot2
object, based on which you could add more geom
layers.
# **SHAP dependence plot** # 1. simple dependence plot with SHAP values of x on the y axis shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", add_hist = TRUE, add_stat_cor = TRUE) # 2. can choose a different SHAP values on the y axis shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width") # 3. color by another feature's feature values shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", color_feature = "Petal.Width") # 4. choose 3 different variables for x, y, and color shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width") # Optional to add hist or remove smooth line, optional to plot fewer data (make plot quicker) shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width", add_hist = TRUE, smooth = FALSE, dilute = 3) # to make a list of plot plot_list <- lapply(names(iris)[2:3], shap.plot.dependence, data_long = shap_long_iris) # **SHAP interaction effect plot ** # To get the interaction SHAP dataset for plotting, need to get `shap_int` first: mod1 = xgboost::xgboost( data = as.matrix(iris[,-5]), label = iris$Species, gamma = 0, eta = 1, lambda = 0,nrounds = 1, verbose = FALSE, nthread = 1) # Use either: data_int <- shap.prep.interaction(xgb_mod = mod1, X_train = as.matrix(iris[,-5])) # or: shap_int <- predict(mod1, as.matrix(iris[,-5]), predinteraction = TRUE) # if data_int is supplied, y axis will plot the interaction values of y (vs. x) shap.plot.dependence(data_long = shap_long_iris, data_int = shap_int_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width")
# **SHAP dependence plot** # 1. simple dependence plot with SHAP values of x on the y axis shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", add_hist = TRUE, add_stat_cor = TRUE) # 2. can choose a different SHAP values on the y axis shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width") # 3. color by another feature's feature values shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", color_feature = "Petal.Width") # 4. choose 3 different variables for x, y, and color shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width") # Optional to add hist or remove smooth line, optional to plot fewer data (make plot quicker) shap.plot.dependence(data_long = shap_long_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width", add_hist = TRUE, smooth = FALSE, dilute = 3) # to make a list of plot plot_list <- lapply(names(iris)[2:3], shap.plot.dependence, data_long = shap_long_iris) # **SHAP interaction effect plot ** # To get the interaction SHAP dataset for plotting, need to get `shap_int` first: mod1 = xgboost::xgboost( data = as.matrix(iris[,-5]), label = iris$Species, gamma = 0, eta = 1, lambda = 0,nrounds = 1, verbose = FALSE, nthread = 1) # Use either: data_int <- shap.prep.interaction(xgb_mod = mod1, X_train = as.matrix(iris[,-5])) # or: shap_int <- predict(mod1, as.matrix(iris[,-5]), predinteraction = TRUE) # if data_int is supplied, y axis will plot the interaction values of y (vs. x) shap.plot.dependence(data_long = shap_long_iris, data_int = shap_int_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width")
The force/stack plot, optional to zoom in at certain x-axis location or zoom in a specific cluster of observations.
shap.plot.force_plot( shapobs, id = "sorted_id", zoom_in_location = NULL, y_parent_limit = NULL, y_zoomin_limit = NULL, zoom_in = TRUE, zoom_in_group = NULL )
shap.plot.force_plot( shapobs, id = "sorted_id", zoom_in_location = NULL, y_parent_limit = NULL, y_zoomin_limit = NULL, zoom_in = TRUE, zoom_in_group = NULL )
shapobs |
The dataset obtained by |
id |
the id variable. |
zoom_in_location |
where to zoom in, default at place of 60 percent of the data. |
y_parent_limit |
set y-axis limits. |
y_zoomin_limit |
|
zoom_in |
default to TRUE, zoom in by |
zoom_in_group |
optional to zoom in certain cluster. |
# **SHAP force plot** plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # plot all the clusters: shap.plot.force_plot_bygroup(plot_data)
# **SHAP force plot** plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # plot all the clusters: shap.plot.force_plot_bygroup(plot_data)
A collective display of zoom-in plots: one plot for every group of the clustered observations.
shap.plot.force_plot_bygroup(shapobs, id = "sorted_id", y_parent_limit = NULL)
shap.plot.force_plot_bygroup(shapobs, id = "sorted_id", y_parent_limit = NULL)
shapobs |
The dataset obtained by |
id |
the id variable. |
y_parent_limit |
set y-axis limits. |
# **SHAP force plot** plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # plot all the clusters: shap.plot.force_plot_bygroup(plot_data)
# **SHAP force plot** plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # plot all the clusters: shap.plot.force_plot_bygroup(plot_data)
The summary plot (a sina plot) uses a long format data of SHAP values. The
SHAP values could be obtained from either a XGBoost/LightGBM model or a SHAP value
matrix using shap.values
. So this summary plot function
normally follows the long format dataset obtained using shap.values
. If you
want to start with a model and data_X, use
shap.plot.summary.wrap1
. If you want to use a self-derived
dataset of SHAP values, use shap.plot.summary.wrap2
. If a list
named new_labels is provided in the global environment (new_labels
is
pre-loaded by the package as NULL
), the plots will use that list to
label the variables, here is an example of such a list (the default labels):
labels_within_package
.
shap.plot.summary( data_long, x_bound = NULL, dilute = FALSE, scientific = FALSE, my_format = NULL, min_color_bound = "#FFCC33", max_color_bound = "#6600CC", kind = c("sina", "bar") )
shap.plot.summary( data_long, x_bound = NULL, dilute = FALSE, scientific = FALSE, my_format = NULL, min_color_bound = "#FFCC33", max_color_bound = "#6600CC", kind = c("sina", "bar") )
data_long |
a long format data of SHAP values from
|
x_bound |
use to set horizontal axis limit in the plot |
dilute |
being numeric or logical (TRUE/FALSE), it aims to help make the test plot for large amount of data faster. If dilute = 5 will plot 1/5 of the data. If dilute = TRUE or a number, will plot at most half points per feature, so the plotting won't be too slow. If you put dilute too high, at least 10 points per feature would be kept. If the dataset is too small after dilution, will just plot all the data |
scientific |
show the mean|SHAP| in scientific format. If TRUE, label format is 0.0E-0, default to FALSE, and the format will be 0.000 |
my_format |
supply your own number format if you really want |
min_color_bound |
min color hex code for colormap. Color gradient is scaled between min_color_bound and max_color_bound. Default is "#FFCC33". |
max_color_bound |
max color hex code for colormap. Color gradient is scaled between min_color_bound and max_color_bound. Default is "#6600CC". |
kind |
By default, a "sina" plot is shown. As an alternative,
set |
returns a ggplot2 object, could add further layers.
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
shap.plot.summary.wrap1
wraps up function shap.prep
and
shap.plot.summary
shap.plot.summary.wrap1(model, X, top_n, dilute = FALSE)
shap.plot.summary.wrap1(model, X, top_n, dilute = FALSE)
model |
the model |
X |
the dataset of predictors used for calculating SHAP |
top_n |
how many predictors you want to show in the plot (ranked) |
dilute |
being numeric or logical (TRUE/FALSE), it aims to help make the test plot for large amount of data faster. If dilute = 5 will plot 1/5 of the data. If dilute = TRUE or a number, will plot at most half points per feature, so the plotting won't be too slow. If you put dilute too high, at least 10 points per feature would be kept. If the dataset is too small after dilution, will just plot all the data |
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
shap.plot.summary.wrap2
wraps up function shap.prep
and
shap.plot.summary
. Since SHAP matrix could be returned from
cross-validation instead of only one model, here the wrapped
shap.prep
takes the SHAP score matrix shap_score
as input
shap.plot.summary.wrap2(shap_score, X, top_n, dilute = FALSE)
shap.plot.summary.wrap2(shap_score, X, top_n, dilute = FALSE)
shap_score |
the SHAP values dataset, could be obtained by
|
X |
the dataset of predictors used for calculating SHAP values |
top_n |
how many predictors you want to show in the plot (ranked) |
dilute |
being numeric or logical (TRUE/FALSE), it aims to help make the test plot for large amount of data faster. If dilute = 5 will plot 1/5 of the data. If dilute = TRUE or a number, will plot at most half points per feature, so the plotting won't be too slow. If you put dilute too high, at least 10 points per feature would be kept. If the dataset is too small after dilution, will just plot all the data |
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
Produce a dataset of 6 columns: ID of each observation, variable name, SHAP
value, variable values (feature value), deviation of the feature value for
each observation (for coloring the point), and the mean SHAP values for each
variable. You can view this example dataset included in the package:
shap_long_iris
shap.prep( xgb_model = NULL, shap_contrib = NULL, X_train, top_n = NULL, var_cat = NULL )
shap.prep( xgb_model = NULL, shap_contrib = NULL, X_train, top_n = NULL, var_cat = NULL )
xgb_model |
an XGBoost (or LightGBM) model object, will derive the SHAP values from it |
shap_contrib |
optional to directly supply a SHAP values dataset. If
supplied, it will overwrite the |
X_train |
the dataset of predictors used to calculate SHAP values, it provides feature values to the plot, must be supplied |
top_n |
to choose top_n variables ranked by mean|SHAP| if needed |
var_cat |
if supplied, will provide long format data, grouped by this categorical variable |
The ID variable is added for each observation in the shap_contrib
dataset
for better tracking, it is created as 1:nrow(shap_contrib)
before melting
shap_contrib
into long format.
a long-format data.table, named as shap_long
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3) #### # # use `var_cat` to add a categorical variable, output the long-format data differently: library("data.table") data("iris") set.seed(123) iris$Group <- 0 iris[sample(1:nrow(iris), nrow(iris)/2), "Group"] <- 1 data.table::setDT(iris) X_train = as.matrix(iris[,c(colnames(iris)[1:4], "Group"), with = FALSE]) mod1 = xgboost::xgboost( data = X_train, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) shap_long2 <- shap.prep(xgb_model = mod1, X_train = X_train, var_cat = "Group") # **SHAP summary plot** shap.plot.summary(shap_long2, scientific = TRUE) + ggplot2::facet_wrap(~ Group)
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3) #### # # use `var_cat` to add a categorical variable, output the long-format data differently: library("data.table") data("iris") set.seed(123) iris$Group <- 0 iris[sample(1:nrow(iris), nrow(iris)/2), "Group"] <- 1 data.table::setDT(iris) X_train = as.matrix(iris[,c(colnames(iris)[1:4], "Group"), with = FALSE]) mod1 = xgboost::xgboost( data = X_train, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) shap_long2 <- shap.prep(xgb_model = mod1, X_train = X_train, var_cat = "Group") # **SHAP summary plot** shap.plot.summary(shap_long2, scientific = TRUE) + ggplot2::facet_wrap(~ Group)
shap.prep.interaction
just runs shap_int <- predict(xgb_mod, (X_train), predinteraction = TRUE)
, thus it may not be necessary.
Read more about the xgboost predict function at xgboost::predict.xgb.Booster
. Note that this functionality is unavailable for LightGBM models.
shap.prep.interaction(xgb_model, X_train)
shap.prep.interaction(xgb_model, X_train)
xgb_model |
a xgboost model object |
X_train |
the dataset of predictors used for the xgboost model |
a 3-dimention array: #obs x #features x #features
# To get the interaction SHAP dataset for plotting: # fit the xgboost model # options("Ncup" = 1) mod1 = xgboost::xgboost( data = as.matrix(iris[,-5]), label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # Use either: data_int <- shap.prep.interaction(xgb_mod = mod1, X_train = as.matrix(iris[,-5])) # or: shap_int <- predict(mod1, as.matrix(iris[,-5]), predinteraction = TRUE) # **SHAP interaction effect plot ** shap.plot.dependence(data_long = shap_long_iris, data_int = shap_int_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width")
# To get the interaction SHAP dataset for plotting: # fit the xgboost model # options("Ncup" = 1) mod1 = xgboost::xgboost( data = as.matrix(iris[,-5]), label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # Use either: data_int <- shap.prep.interaction(xgb_mod = mod1, X_train = as.matrix(iris[,-5])) # or: shap_int <- predict(mod1, as.matrix(iris[,-5]), predinteraction = TRUE) # **SHAP interaction effect plot ** shap.plot.dependence(data_long = shap_long_iris, data_int = shap_int_iris, x="Petal.Length", y = "Petal.Width", color_feature = "Petal.Width")
Make force plot for top_n
features, optional to randomly plot certain
portion of the data in case the dataset is large.
shap.prep.stack.data( shap_contrib, top_n = NULL, data_percent = 1, cluster_method = "ward.D", n_groups = 10L )
shap.prep.stack.data( shap_contrib, top_n = NULL, data_percent = 1, cluster_method = "ward.D", n_groups = 10L )
shap_contrib |
shap_contrib is the SHAP value data returned from
predict, here an ID variable is added for each observation in
the |
top_n |
integer, optional to show only top_n features, combine the rest |
data_percent |
what percent of data to plot (to speed up the testing plot). The accepted input range is (0,1], if observations left is too few, there will be an error from the clustering function |
cluster_method |
default to ward.D, please refer to |
n_groups |
a integer, how many groups to plot in
|
a dataset for stack plot
# **SHAP force plot** plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # plot all the clusters: shap.plot.force_plot_bygroup(plot_data)
# **SHAP force plot** plot_data <- shap.prep.stack.data(shap_contrib = shap_values_iris, n_groups = 4) shap.plot.force_plot(plot_data) shap.plot.force_plot(plot_data, zoom_in_group = 2) # plot all the clusters: shap.plot.force_plot_bygroup(plot_data)
shap.values
returns a list of three objects from XGBoost or LightGBM
model: 1. a dataset (data.table) of SHAP scores. It has the same dimension as
the X_train); 2. the ranked variable vector by each variable's mean absolute
SHAP value, it ranks the predictors by their importance in the model; and 3.
The BIAS, which is like an intercept. The rowsum of SHAP values including the
BIAS would equal to the predicted value (y_hat) generally speaking.
shap.values(xgb_model, X_train)
shap.values(xgb_model, X_train)
xgb_model |
an XGBoost or LightGBM model object |
X_train |
the data supplied to the |
a list of three elements: the SHAP values as data.table, ranked mean|SHAP|, and BIAS
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)
data("iris") X1 = as.matrix(iris[,-5]) mod1 = xgboost::xgboost( data = X1, label = iris$Species, gamma = 0, eta = 1, lambda = 0, nrounds = 1, verbose = FALSE, nthread = 1) # shap.values(model, X_dataset) returns the SHAP # data matrix and ranked features by mean|SHAP| shap_values <- shap.values(xgb_model = mod1, X_train = X1) shap_values$mean_shap_score shap_values_iris <- shap_values$shap_score # shap.prep() returns the long-format SHAP data from either model or shap_long_iris <- shap.prep(xgb_model = mod1, X_train = X1) # is the same as: using given shap_contrib shap_long_iris <- shap.prep(shap_contrib = shap_values_iris, X_train = X1) # **SHAP summary plot** shap.plot.summary(shap_long_iris, scientific = TRUE) shap.plot.summary(shap_long_iris, x_bound = 1.5, dilute = 10) # Alternatives options to make the same plot: # option 1: from the xgboost model shap.plot.summary.wrap1(mod1, X = as.matrix(iris[,-5]), top_n = 3) # option 2: supply a self-made SHAP values dataset # (e.g. sometimes as output from cross-validation) shap.plot.summary.wrap2(shap_score = shap_values_iris, X = X1, top_n = 3)