
Calculate Weight-of-Evidence
woe.Rdwoe() builds the Weight-of-Evidence ("WoE")
dictionary of a set of predictor variables upon a given binary outcome.
Usage
woe(
data,
outcome,
predictors,
Laplace = 1e-06,
verbose = TRUE,
method = c("dict", "add", "replace")
)Arguments
- data
A data frame containing the
outcomevariable and any variables passed topredictors- outcome
The column variable in the
datadata frame representing the outcome (i.e., the dependent variable indata); must have exactly 2 distinct values- predictors
<
tidy-select> The column variable(s) in thedatadata frame representing the independent variable(s) (i.e., "predictor" variable(s)); omitting this argument will select all column variables indataexceptoutcome- Laplace
(Numeric) The
pseudocountparameter of the Laplace Smoothing estimator; default is1e-6. This value helps to avoid -Inf/Inf WoE values from arising in situations where a independent variable class has only one outcome class in the data. Set to 0 to allow Inf/-Inf WoE values.- verbose
(Logical) Should information on the WoE calculation be printed in the console?
- method
(String) One of "dict", "add", or "replace". Default is "dict", which creates a Weight-of-Evidence dictionary for each unique class across each variable in
predictors. "add" appends new "woe_*" columns to the input data frame (data), while "replace" replaces the variables passed topredictorswith the Weight-of-Evidence equivalents.
Details
Negative "WoE" values indicate that there is a higher proportion of
"bads" than "goods" in that particular class of the independent variable,
while positive "WoE" values indicate that there is a higher proportion of
"goods" than there are "bads". This hinges on the assumption that the
levels of the outcome variable are ordered c("good", "bad"), otherwise
the inverse is true.
References
Siddiqi, Naeem (2017). Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards. 2nd ed., Wiley., p. 184
Hvitfeldt E, Kuhn M (2022). embed: Extra Recipes for Encoding Predictors. https://embed.tidymodels.org, https://github.com/tidymodels/embed.
Examples
# View the order of levels in `loans$default_status`
levels(loans$default_status)
#> [1] "bad" "good"
# Reverse levels in dependent variable for WoE calculation
df <- loans |>
dplyr::mutate(
default_status = factor(default_status, levels = c("good", "bad"))
)
# Create a WoE dictionary
df |>
woe(
outcome = default_status,
predictors = c(industry, housing_status)
)
#>
#> ── `outcome` variable `default_status` has two levels:
#> • Using "good" as *good* level.
#> • Using "bad" as *bad* level.
#>
#> ── You can reverse this by running:
#> df$default_status <- factor(df$default_status, levels = c("bad", "good"))
#>
#> ℹ See ?woe for further detail
#> # A tibble: 12 × 8
#> variable class n_total n_good n_bad p_good p_bad woe
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 industry "" 9 8 1 0.0114 0.00333 1.23
#> 2 industry "beef" 97 63 34 0.09 0.113 -0.231
#> 3 industry "dairy" 181 123 58 0.176 0.193 -0.0956
#> 4 industry "fruit" 234 145 89 0.207 0.297 -0.359
#> 5 industry "grain" 280 218 62 0.311 0.207 0.410
#> 6 industry "greenhouse" 12 7 5 0.01 0.0167 -0.511
#> 7 industry "nuts" 22 14 8 0.02 0.0267 -0.288
#> 8 industry "pork" 50 28 22 0.04 0.0733 -0.606
#> 9 industry "poultry" 103 86 17 0.123 0.0567 0.774
#> 10 industry "sod" 12 8 4 0.0114 0.0133 -0.154
#> 11 housing_status "own" 713 527 186 0.753 0.62 0.194
#> 12 housing_status "rent" 287 173 114 0.247 0.38 -0.430