Binning Data
binning.Rmd
A common step in building a scorecard involves calculating “Weights
of Evidence”. However, this step calls for categorical
independent variables (not continuous). One possible way to
convert continuous data into a categorical feature is to “bin”
a continuous independent variable. While users could create a
traditional CASE
statement (e.g., using
dplyr::case_when()
) to specify their own bins manually,
another common approach to binning data is by quantile. This
ensures that there is an equal amount of data in each bin.
The bin_quantile()
function makes it easy to do so.
# Bin the "loan_amount" variable into 5 equally sized intervals
bin_quantile(
x = loans$loan_amount,
n_bins = 5
) |>
levels()
#> [1] "[-Inf,12620]" "(12620,19068]" "(19068,28524]" "(28524,47200]"
#> [5] "(47200, Inf]"
This function can be easily embedded into a
dplyr::mutate()
call.
loans |>
dplyr::select(loan_amount) |>
dplyr::mutate(
loan_amount_binned = bin_quantile(x = loan_amount, n_bins = 5)
)
#> # A tibble: 1,000 × 2
#> loan_amount loan_amount_binned
#> <dbl> <fct>
#> 1 11690 [-Inf,12620]
#> 2 59510 (47200, Inf]
#> 3 20960 (19068,28524]
#> 4 78820 (47200, Inf]
#> 5 48700 (47200, Inf]
#> 6 90550 (47200, Inf]
#> 7 28350 (19068,28524]
#> 8 69480 (47200, Inf]
#> 9 30590 (28524,47200]
#> 10 52340 (47200, Inf]
#> # ℹ 990 more rows
If you prefer to set hard outer bounds on your bins (instead of
-Inf
/Inf
), you can do so with the
min_value
and max_value
arguments.
# Specify the number of bins, include min and max cap to the lowest and
# highest intervals
bin_quantile(
x = loans$loan_amount,
n_bins = 3,
min_value = 0,
max_value = 200000
) |>
levels()
#> [1] "[0,15540]" "(15540,33680]" "(33680,200000]"