Skip to contents

A common step in building a scorecard involves calculating “Weights of Evidence”. However, this step calls for categorical independent variables (not continuous). One possible way to convert continuous data into a categorical feature is to “bin” a continuous independent variable. While users could create a traditional CASE statement (e.g., using dplyr::case_when()) to specify their own bins manually, another common approach to binning data is by quantile. This ensures that there is an equal amount of data in each bin.

The bin_quantile() function makes it easy to do so.

# Bin the "loan_amount" variable into 5 equally sized intervals
bin_quantile(
  x = loans$loan_amount, 
  n_bins = 5
) |> 
  levels()
#> [1] "[-Inf,12620]"  "(12620,19068]" "(19068,28524]" "(28524,47200]"
#> [5] "(47200, Inf]"

This function can be easily embedded into a dplyr::mutate() call.

loans |> 
  dplyr::select(loan_amount) |> 
  dplyr::mutate(
    loan_amount_binned = bin_quantile(x = loan_amount, n_bins = 5)
  )
#> # A tibble: 1,000 × 2
#>    loan_amount loan_amount_binned
#>          <dbl> <fct>             
#>  1       11690 [-Inf,12620]      
#>  2       59510 (47200, Inf]      
#>  3       20960 (19068,28524]     
#>  4       78820 (47200, Inf]      
#>  5       48700 (47200, Inf]      
#>  6       90550 (47200, Inf]      
#>  7       28350 (19068,28524]     
#>  8       69480 (47200, Inf]      
#>  9       30590 (28524,47200]     
#> 10       52340 (47200, Inf]      
#> # ℹ 990 more rows

If you prefer to set hard outer bounds on your bins (instead of -Inf/Inf), you can do so with the min_value and max_value arguments.

# Specify the number of bins, include min and max cap to the lowest and
# highest intervals
bin_quantile(
  x = loans$loan_amount,
  n_bins = 3,
  min_value = 0,
  max_value = 200000
) |>
  levels()
#> [1] "[0,15540]"      "(15540,33680]"  "(33680,200000]"