Skip to contents

Once your independent variables are categorical, you may want to transform them into Weight-of-Evidence (“WoE”) equivalents. WoE represents the proportion of goods versus bads (with respect to the dependent variable) in a particular class of an independent variable. The woe() function provides three methods to do this, c("dict", "add", "replace").

Create a WoE Dictionary

The default method, “dict”, creates a “dictionary” of the WoE values (along with additional statistics) for each unique class in for the indpendent variables supplied in the predictors argument.

# "dict" is the default method
loans |>
  woe(
    outcome = default_status,
    predictors = c(industry, housing_status), 
    verbose = FALSE
  )
#> # A tibble: 12 × 8
#>    variable       class        n_total n_bad n_good   p_bad p_good     woe
#>    <chr>          <chr>          <int> <int>  <int>   <dbl>  <dbl>   <dbl>
#>  1 industry       ""                 9     1      8 0.00333 0.0114 -1.23  
#>  2 industry       "beef"            97    34     63 0.113   0.09    0.231 
#>  3 industry       "dairy"          181    58    123 0.193   0.176   0.0956
#>  4 industry       "fruit"          234    89    145 0.297   0.207   0.359 
#>  5 industry       "grain"          280    62    218 0.207   0.311  -0.410 
#>  6 industry       "greenhouse"      12     5      7 0.0167  0.01    0.511 
#>  7 industry       "nuts"            22     8     14 0.0267  0.02    0.288 
#>  8 industry       "pork"            50    22     28 0.0733  0.04    0.606 
#>  9 industry       "poultry"        103    17     86 0.0567  0.123  -0.774 
#> 10 industry       "sod"             12     4      8 0.0133  0.0114  0.154 
#> 11 housing_status "own"            713   186    527 0.62    0.753  -0.194 
#> 12 housing_status "rent"           287   114    173 0.38    0.247   0.430

The Weight-of-Evidence calculation assumes that the levels of the outcome variable are ordered c("good", "bad"). This results in the interpretation that negative WoE values indicate that there is a higher proportion of “bads” than “goods” in a particular class of the independent variable, while positive WoE values indicate that there is a higher proportion of “goods” than there are “bads”). Note that the levels of the dependent variable (default_status) in the loans data frame are instead ordered c("bad", "good").

# View the original order of the levels in the dependent variable `default_status
levels(loans$default_status)
#> [1] "bad"  "good"

In order to correctly interpret the Weight-of-Evidence values and downstream calculations, we can reverse the levels in default_status column variable of the loans data frame.

# Reverse the levels 
loans$default_status <- factor(loans$default_status, levels = c("good", "bad"))

# View the new order of the levels in the dependent variable `default_status`
levels(loans$default_status)
#> [1] "good" "bad"

Now the signs on the values in the woe column should be flipped, and line up with the expected interpretation:

# "dict" is the default method
loans |>
  woe(
    outcome = default_status,
    predictors = c(industry, housing_status),
    verbose = FALSE
  )
#> # A tibble: 12 × 8
#>    variable       class        n_total n_good n_bad p_good   p_bad     woe
#>    <chr>          <chr>          <int>  <int> <int>  <dbl>   <dbl>   <dbl>
#>  1 industry       ""                 9      8     1 0.0114 0.00333  1.23  
#>  2 industry       "beef"            97     63    34 0.09   0.113   -0.231 
#>  3 industry       "dairy"          181    123    58 0.176  0.193   -0.0956
#>  4 industry       "fruit"          234    145    89 0.207  0.297   -0.359 
#>  5 industry       "grain"          280    218    62 0.311  0.207    0.410 
#>  6 industry       "greenhouse"      12      7     5 0.01   0.0167  -0.511 
#>  7 industry       "nuts"            22     14     8 0.02   0.0267  -0.288 
#>  8 industry       "pork"            50     28    22 0.04   0.0733  -0.606 
#>  9 industry       "poultry"        103     86    17 0.123  0.0567   0.774 
#> 10 industry       "sod"             12      8     4 0.0114 0.0133  -0.154 
#> 11 housing_status "own"            713    527   186 0.753  0.62     0.194 
#> 12 housing_status "rent"           287    173   114 0.247  0.38    -0.430

Add WoE Variables to a Data Frame

If instead we want to add WoE features to our original data frame, we can do so with method = "add".

loans |>
  woe(
    outcome = default_status,
    predictors = c(industry, housing_status),
    method = "add", 
    verbose = FALSE
  )
#> # A tibble: 1,000 × 5
#>    default_status industry housing_status woe_industry woe_housing_status
#>    <fct>          <fct>    <fct>                 <dbl>              <dbl>
#>  1 good           grain    own                  0.410               0.194
#>  2 bad            grain    own                  0.410               0.194
#>  3 good           pork     own                 -0.606               0.194
#>  4 good           dairy    rent                -0.0956             -0.430
#>  5 bad            fruit    rent                -0.359              -0.430
#>  6 good           pork     rent                -0.606              -0.430
#>  7 good           dairy    own                 -0.0956              0.194
#>  8 good           poultry  rent                 0.774              -0.430
#>  9 good           grain    own                  0.410               0.194
#> 10 bad            fruit    own                 -0.359               0.194
#> # ℹ 990 more rows

Replace Variables in a Data Frame with WoE Transformations

Lastly, if we want to replace the original independent variables in the data frame with their WoE transformations, we can do so with method = "replace".

loans |>
  woe(
    outcome = default_status,
    predictors = c(industry, housing_status),
    method = "replace", 
    verbose = FALSE
  )
#> # A tibble: 1,000 × 3
#>    default_status woe_industry woe_housing_status
#>    <fct>                 <dbl>              <dbl>
#>  1 good                 0.410               0.194
#>  2 bad                  0.410               0.194
#>  3 good                -0.606               0.194
#>  4 good                -0.0956             -0.430
#>  5 bad                 -0.359              -0.430
#>  6 good                -0.606              -0.430
#>  7 good                -0.0956              0.194
#>  8 good                 0.774              -0.430
#>  9 good                 0.410               0.194
#> 10 bad                 -0.359               0.194
#> # ℹ 990 more rows