Weight of Evidence
woe.Rmd
Once your independent variables are categorical, you may want to
transform them into Weight-of-Evidence (“WoE”) equivalents.
WoE represents the proportion of goods versus
bads (with respect to the dependent variable) in a particular
class of an independent variable. The woe()
function
provides three methods to do this,
c("dict", "add", "replace")
.
Create a WoE Dictionary
The default method, “dict”, creates a “dictionary” of the
WoE values (along with additional statistics) for each unique class in
for the indpendent variables supplied in the predictors
argument.
# "dict" is the default method
loans |>
woe(
outcome = default_status,
predictors = c(industry, housing_status),
verbose = FALSE
)
#> # A tibble: 12 × 8
#> variable class n_total n_bad n_good p_bad p_good woe
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 industry "" 9 1 8 0.00333 0.0114 -1.23
#> 2 industry "beef" 97 34 63 0.113 0.09 0.231
#> 3 industry "dairy" 181 58 123 0.193 0.176 0.0956
#> 4 industry "fruit" 234 89 145 0.297 0.207 0.359
#> 5 industry "grain" 280 62 218 0.207 0.311 -0.410
#> 6 industry "greenhouse" 12 5 7 0.0167 0.01 0.511
#> 7 industry "nuts" 22 8 14 0.0267 0.02 0.288
#> 8 industry "pork" 50 22 28 0.0733 0.04 0.606
#> 9 industry "poultry" 103 17 86 0.0567 0.123 -0.774
#> 10 industry "sod" 12 4 8 0.0133 0.0114 0.154
#> 11 housing_status "own" 713 186 527 0.62 0.753 -0.194
#> 12 housing_status "rent" 287 114 173 0.38 0.247 0.430
The Weight-of-Evidence calculation assumes that the levels of the
outcome
variable are ordered c("good", "bad")
.
This results in the interpretation that negative WoE values indicate
that there is a higher proportion of “bads” than “goods” in a particular
class of the independent variable, while positive WoE values indicate
that there is a higher proportion of “goods” than there are “bads”).
Note that the levels of the dependent variable
(default_status
) in the loans
data frame are
instead ordered c("bad", "good")
.
# View the original order of the levels in the dependent variable `default_status
levels(loans$default_status)
#> [1] "bad" "good"
In order to correctly interpret the Weight-of-Evidence values and
downstream calculations, we can reverse the levels in
default_status
column variable of the loans
data frame.
# Reverse the levels
loans$default_status <- factor(loans$default_status, levels = c("good", "bad"))
# View the new order of the levels in the dependent variable `default_status`
levels(loans$default_status)
#> [1] "good" "bad"
Now the signs on the values in the woe
column should be
flipped, and line up with the expected interpretation:
# "dict" is the default method
loans |>
woe(
outcome = default_status,
predictors = c(industry, housing_status),
verbose = FALSE
)
#> # A tibble: 12 × 8
#> variable class n_total n_good n_bad p_good p_bad woe
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 industry "" 9 8 1 0.0114 0.00333 1.23
#> 2 industry "beef" 97 63 34 0.09 0.113 -0.231
#> 3 industry "dairy" 181 123 58 0.176 0.193 -0.0956
#> 4 industry "fruit" 234 145 89 0.207 0.297 -0.359
#> 5 industry "grain" 280 218 62 0.311 0.207 0.410
#> 6 industry "greenhouse" 12 7 5 0.01 0.0167 -0.511
#> 7 industry "nuts" 22 14 8 0.02 0.0267 -0.288
#> 8 industry "pork" 50 28 22 0.04 0.0733 -0.606
#> 9 industry "poultry" 103 86 17 0.123 0.0567 0.774
#> 10 industry "sod" 12 8 4 0.0114 0.0133 -0.154
#> 11 housing_status "own" 713 527 186 0.753 0.62 0.194
#> 12 housing_status "rent" 287 173 114 0.247 0.38 -0.430
Add WoE Variables to a Data Frame
If instead we want to add WoE features to our original data
frame, we can do so with method = "add"
.
loans |>
woe(
outcome = default_status,
predictors = c(industry, housing_status),
method = "add",
verbose = FALSE
)
#> # A tibble: 1,000 × 5
#> default_status industry housing_status woe_industry woe_housing_status
#> <fct> <fct> <fct> <dbl> <dbl>
#> 1 good grain own 0.410 0.194
#> 2 bad grain own 0.410 0.194
#> 3 good pork own -0.606 0.194
#> 4 good dairy rent -0.0956 -0.430
#> 5 bad fruit rent -0.359 -0.430
#> 6 good pork rent -0.606 -0.430
#> 7 good dairy own -0.0956 0.194
#> 8 good poultry rent 0.774 -0.430
#> 9 good grain own 0.410 0.194
#> 10 bad fruit own -0.359 0.194
#> # ℹ 990 more rows
Replace Variables in a Data Frame with WoE Transformations
Lastly, if we want to replace the original independent
variables in the data frame with their WoE transformations, we can do so
with method = "replace"
.
loans |>
woe(
outcome = default_status,
predictors = c(industry, housing_status),
method = "replace",
verbose = FALSE
)
#> # A tibble: 1,000 × 3
#> default_status woe_industry woe_housing_status
#> <fct> <dbl> <dbl>
#> 1 good 0.410 0.194
#> 2 bad 0.410 0.194
#> 3 good -0.606 0.194
#> 4 good -0.0956 -0.430
#> 5 bad -0.359 -0.430
#> 6 good -0.606 -0.430
#> 7 good -0.0956 0.194
#> 8 good 0.774 -0.430
#> 9 good 0.410 0.194
#> 10 bad -0.359 0.194
#> # ℹ 990 more rows