Calculate Information Value for a Set of Independent Variables
iv.Rd
iv()
calculates the information value for each independent
variable against the dependent variable in a data frame. The formal
equation can be described as:
$$iv = \sum ((P(X = c|Y = 1)) - (P(X = c|Y = 0))) * WoE$$
for each level \(c\) of a given categorical independent variable \(X\).
Information Value is a useful by-product of Weight-of-Evidence transformation, and can provide insight on the relative feature importance of each WOE-transformed independent variable towards your scorecard model.
Arguments
- data
A data frame containing the potential independent & dependent variables for a scorecard model
- outcome
The column variable in the
data
data frame representing the outcome (i.e., the dependent variable indata
); must have exactly 2 distinct values- predictors
<
tidy-select
> The column variable(s) in thedata
data frame representing the independent variable(s) (i.e., "predictor" variable(s)); default is all column variables indata
except theoutcome
variable- Laplace
(Numeric) The
pseudocount
parameter of the Laplace Smoothing estimator; default is1e-6
. This value helps to avoid -Inf/Inf WoE values from arising in situations where a independent variable class has only one outcome class in the data. Set to 0 to allow Inf/-Inf WoE values.- verbose
(Logical) Should information on the WoE calculation be printed in the console?
- labels
(Logical) If TRUE, add a column variable called 'labels' containing IV interpretation values to the output data frame
Details
Information Value is intended to help in feature selection prior to fitting a scorecard model. Siddiqi (2017) advocates for interpreting the information value statistic as follows:
Information Value | Interpretation |
< 0.02 | Not predictive |
[0.02, 0.1) | Weakly predictive |
[0.1, 0.3) | Moderately predictive |
[0.3, 0.5) | Strongly predictive |
> 0.5 | Likely overfit |
References
Siddiqi, Naeem (2017). Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards. 2nd ed., Wiley., pp. 184-185.
Examples
# Reverse levels in dependent variable for WoE calculation
df <- loans |>
dplyr::mutate(
default_status = factor(default_status, levels = c("good", "bad"))
)
df |>
iv(
outcome = default_status,
predictors = dplyr::where(is.factor)
)
#>
#> ── `outcome` variable `default_status` has two levels:
#> • Using "good" as *good* level.
#> • Using "bad" as *bad* level.
#>
#> ── You can reverse this by running:
#> data$default_status <- factor(data$default_status, levels = c("bad", "good"))
#>
#> ℹ See ?woe for further detail
#> # A tibble: 6 × 2
#> variable iv
#> <chr> <dbl>
#> 1 amount_of_existing_debt 0.666
#> 2 collateral_type 0.113
#> 3 housing_status 0.0830
#> 4 industry 0.169
#> 5 other_debtors_guarantors 0.0320
#> 6 years_at_current_address 0.00359
# No Laplace smoothing, add IV labels
df |>
iv(
outcome = default_status,
predictors = c(amount_of_existing_debt, industry),
Laplace = 0,
labels = TRUE
)
#>
#> ── `outcome` variable `default_status` has two levels:
#> • Using "good" as *good* level.
#> • Using "bad" as *bad* level.
#>
#> ── You can reverse this by running:
#> data$default_status <- factor(data$default_status, levels = c("bad", "good"))
#>
#> ℹ See ?woe for further detail
#> # A tibble: 2 × 3
#> variable iv label
#> <chr> <dbl> <chr>
#> 1 amount_of_existing_debt 0.666 Likely overfit
#> 2 industry 0.169 Moderately predictive