woe() builds the Weight-of-Evidence ("WoE") dictionary of a set of predictor variables upon a given binary outcome.


  Laplace = 1e-06,
  verbose = TRUE,
  method = c("dict", "add", "replace")



A data frame containing the outcome variable and any variables passed to predictors


The column variable in the data data frame representing the outcome (i.e., the dependent variable in data); must have exactly 2 distinct values


<tidy-select> The column variable(s) in the data data frame representing the independent variable(s) (i.e., "predictor" variable(s)); omitting this argument will select all column variables in data except outcome


(Numeric) The pseudocount parameter of the Laplace Smoothing estimator; default is 1e-6. This value helps to avoid -Inf/Inf WoE values from arising in situations where a independent variable class has only one outcome class in the data. Set to 0 to allow Inf/-Inf WoE values.


(Logical) Should information on the WoE calculation be printed in the console?


(String) One of "dict", "add", or "replace". Default is "dict", which creates a Weight-of-Evidence dictionary for each unique class across each variable in predictors. "add" appends new "woe_*" columns to the input data frame (data), while "replace" replaces the variables passed to predictors with the Weight-of-Evidence equivalents.


A tibble with calculated Weights-of-Evidence


Negative "WoE" values indicate that there is a higher proportion of "bads" than "goods" in that particular class of the independent variable, while positive "WoE" values indicate that there is a higher proportion of "goods" than there are "bads". This hinges on the assumption that the levels of the outcome variable are ordered c("good", "bad"), otherwise the inverse is true.


# View the order of levels in `loans$default_status`
#> [1] "bad"  "good"

# Reverse levels in dependent variable for WoE calculation
df <- loans |>
    default_status = factor(default_status, levels = c("good", "bad"))

# Create a WoE dictionary
df |>
    outcome = default_status,
    predictors = c(industry, housing_status)
#> ── `outcome` variable `default_status` has two levels: 
#> • Using "good" as *good* level.
#> • Using "bad" as *bad* level.
#> ── You can reverse this by running: 
#> df$default_status <- factor(df$default_status, levels = c("bad", "good"))
#>  See ?woe for further detail
#> # A tibble: 12 × 8
#>    variable       class        n_total n_good n_bad p_good   p_bad     woe
#>    <chr>          <chr>          <int>  <int> <int>  <dbl>   <dbl>   <dbl>
#>  1 industry       ""                 9      8     1 0.0114 0.00333  1.23  
#>  2 industry       "beef"            97     63    34 0.09   0.113   -0.231 
#>  3 industry       "dairy"          181    123    58 0.176  0.193   -0.0956
#>  4 industry       "fruit"          234    145    89 0.207  0.297   -0.359 
#>  5 industry       "grain"          280    218    62 0.311  0.207    0.410 
#>  6 industry       "greenhouse"      12      7     5 0.01   0.0167  -0.511 
#>  7 industry       "nuts"            22     14     8 0.02   0.0267  -0.288 
#>  8 industry       "pork"            50     28    22 0.04   0.0733  -0.606 
#>  9 industry       "poultry"        103     86    17 0.123  0.0567   0.774 
#> 10 industry       "sod"             12      8     4 0.0114 0.0133  -0.154 
#> 11 housing_status "own"            713    527   186 0.753  0.62     0.194 
#> 12 housing_status "rent"           287    173   114 0.247  0.38    -0.430