Calculate Information Value for a Set of Independent Variables

iv() calculates the information value for each independent variable against the dependent variable in a data frame. The formal equation can be described as:

$$iv = \sum ((P(X = c|Y = 1)) - (P(X = c|Y = 0))) * WoE$$

for each level $c$ of a given categorical independent variable $X$.

Information Value is a useful by-product of Weight-of-Evidence transformation, and can provide insight on the relative feature importance of each WOE-transformed independent variable towards your scorecard model.

Usage

iv(data, outcome, predictors, Laplace = 1e-06, verbose = TRUE, labels = FALSE)

Arguments

data: A data frame containing the potential independent & dependent variables for a scorecard model
outcome: The column variable in the data data frame representing the outcome (i.e., the dependent variable in data); must have exactly 2 distinct values
predictors: <tidy-select> The column variable(s) in the data data frame representing the independent variable(s) (i.e., "predictor" variable(s)); default is all column variables in data except the outcome variable
Laplace: (Numeric) The pseudocount parameter of the Laplace Smoothing estimator; default is 1e-6. This value helps to avoid -Inf/Inf WoE values from arising in situations where a independent variable class has only one outcome class in the data. Set to 0 to allow Inf/-Inf WoE values.
verbose: (Logical) Should information on the WoE calculation be printed in the console?
labels: (Logical) If TRUE, add a column variable called 'labels' containing IV interpretation values to the output data frame

Value

A tibble containing the information value statistic for each independent variable in data

Details

Information Value is intended to help in feature selection prior to fitting a scorecard model. Siddiqi (2017) advocates for interpreting the information value statistic as follows:

Information Value	Interpretation
< 0.02	Not predictive
[0.02, 0.1)	Weakly predictive
[0.1, 0.3)	Moderately predictive
[0.3, 0.5)	Strongly predictive
> 0.5	Likely overfit

References

Siddiqi, Naeem (2017). Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards. 2nd ed., Wiley., pp. 184-185.

Examples

# Reverse levels in dependent variable for WoE calculation
df <- loans |>
  dplyr::mutate(
    default_status = factor(default_status, levels = c("good", "bad"))
  )

df |>
  iv(
    outcome = default_status,
    predictors = dplyr::where(is.factor)
  )
#> 
#> ── `outcome` variable `default_status` has two levels: 
#> • Using "good" as *good* level.
#> • Using "bad" as *bad* level.
#> 
#> ── You can reverse this by running: 
#> data$default_status <- factor(data$default_status, levels = c("bad", "good"))
#> 
#> ℹ See ?woe for further detail
#> # A tibble: 6 × 2
#>   variable                      iv
#>   <chr>                      <dbl>
#> 1 amount_of_existing_debt  0.666  
#> 2 collateral_type          0.113  
#> 3 housing_status           0.0830 
#> 4 industry                 0.169  
#> 5 other_debtors_guarantors 0.0320 
#> 6 years_at_current_address 0.00359

# No Laplace smoothing, add IV labels
df |>
  iv(
    outcome = default_status,
    predictors = c(amount_of_existing_debt, industry),
    Laplace = 0,
    labels = TRUE
  )
#> 
#> ── `outcome` variable `default_status` has two levels: 
#> • Using "good" as *good* level.
#> • Using "bad" as *bad* level.
#> 
#> ── You can reverse this by running: 
#> data$default_status <- factor(data$default_status, levels = c("bad", "good"))
#> 
#> ℹ See ?woe for further detail
#> # A tibble: 2 × 3
#>   variable                   iv label                
#>   <chr>                   <dbl> <chr>                
#> 1 amount_of_existing_debt 0.666 Likely overfit       
#> 2 industry                0.169 Moderately predictive