Skip to contents

bin_quantile() takes a vector of numeric values and converts them into intervals using the quantile() and cut() functions in R, such that the data is equally distributed across each of the unique intervals. Binning continuous independent variables into categorical representations is an important pre-processing step for building credit scorecards, and needs to take place prior to fitting a logistic regression model.

Usage

bin_quantile(
  x,
  n_bins = 4L,
  min_value = -Inf,
  max_value = Inf,
  decimals = 0L,
  digits = 10L,
  na.rm = FALSE
)

Arguments

x

A numeric vector to convert to a categorical factor vector

n_bins

(Integer) The number of "bins" (unique intervals) to be returned; default is 4

min_value

(Numeric) The value to floor the lowest interval at; default is -Inf (no floor)

max_value

(Numeric) The value to ceiling the highest interval at; default is Inf (no ceiling)

decimals

(Integer) The number of decimals to round the quantile values to before creating intervals; default is 0 (round to the nearest integer)

digits

(Integer) Number of digits to display in the console for each interval; this helps avoid scientific notation (default is 10)

na.rm

(Logical) Should NA and NaN (missing) values be removed from x before calculating the quantile bins? Default is FALSE.

Value

A vector of factors with the same length as x, representing the interval that corresponds to each value in x

Examples

# Use the function's defaults
bin_quantile(
  x = iris$Sepal.Width * 100
) |>
  levels()
#> [1] "[-Inf,280]" "(280,300]"  "(300,330]"  "(330, Inf]"

# Specify the number of bins, include min and max cap to the lowest and
# highest intervals
bin_quantile(
  x = iris$Sepal.Width * 100,
  n_bins = 3,
  min_value = -5,
  max_value = 9999
) |>
  levels()
#> [1] "[-5,290]"   "(290,320]"  "(320,9999]"

# Handle small `x` values by ignoring rounding intervals to nearest whole
# number
bin_quantile(
  x = iris$Sepal.Width,
  decimals = 2
) |>
  levels()
#> [1] "[-Inf,2.8]" "(2.8,3]"    "(3,3.3]"    "(3.3, Inf]"

# Remove missing values in `x` before the binning calculation
bin_quantile(
  x = c(iris$Sepal.Width, NA),
  decimals = 2,
  na.rm = TRUE
) |>
  levels()
#> Warning: `bin_quantile()` produced some NA values for input values that were outside of the defined `min_value` or `max_value` specifications
#> [1] "[-Inf,2.8]" "(2.8,3]"    "(3,3.3]"    "(3.3, Inf]"