Tabulate mode estimates with the certainty about them

mode_df() takes a data frame (or another list) of numeric vectors and computes the mode or modes of each element. Where the true mode is unknown due to missing values, more and more NAs are ignored until an estimate for the mode is found.

Estimates are presented along with information about whether they are known to be the true mode, how many NAs had to be ignored during estimation, the rate of ignored NAs, etc.

Usage

mode_df(
  x,
  method = c("first", "all", "single"),
  na.rm.from = c("first", "last", "random"),
  accept = FALSE,
  multiple = c("NA", "min", "max", "mean", "median", "first", "last", "random")
)

Arguments

x

List of vectors. Each vector needs to be numeric or similar. Note that data frames are lists, so x can be a data frame.

method

String. How to determine the mode(s)? Options are:

"first" for mode_first(), the default.
"all" for mode_all(). This may return multiple values per estimate.
"single" for mode_single(). The only option that can return NA estimates.

na.rm.from

String. Only relevant to the default method = "first". Where to start when removing NAs from x? Options are "start", "end", and "random". Default is "start".

accept

Passed on to mode_first() and mode_single(). Default is FALSE.

multiple

Passed on to mode_single(). Default is "NA".

Value

Tibble (data frame) with these columns:

term: the names of x elements. Only present if any are named.
estimate: the modes of x elements, ignoring as many NAs as necessary. List-column if method = "all".
certainty: TRUE if the corresponding estimate is certain to be the true mode, and FALSE if this is unclear due to missing values.
na_ignored: the number of missing values that had to be ignored to arrive at the estimate.
na_total: the total number of missing values.
rate_ignored_na: the proportion of missing values that had to be ignored from among all missing values.
sum_total: the total number of values, missing or not.
rate_ignored_sum: the proportion of missing values that had to be ignored from among all values, missing or not.

Details

The function deals with missing values (NAs) by first checking whether they make the true mode unknown. If they do, it removes one NA, then checks again; and so on until an estimate is found.

This strategy is based on the na.rm.amount argument of mode_first(), mode_all(), and mode_single(). It represents a middle way between simply ignoring all NAs and not even trying to compute an estimate. Instead, it only removes the minimum number of NAs necessary, because some distributions have a known mode (or set of modes) even if some of their values are missing. By keeping track of the removed NAs, mode_df() quantifies the uncertainty about its estimates.

Examples

# Use a list of numeric vectors:
my_list <- list(
  a = 1:15,
  b = c(1, 1, NA),
  c = c(4, 4, NA, NA, NA, NA),
  d = c(96, 24, 3, NA)
)

mode_df(my_list)
#> # A tibble: 4 × 8
#>   term  estimate certainty na_ignored na_total rate_ignored_na sum_total
#>   <chr>    <dbl> <lgl>          <int>    <int>           <dbl>     <int>
#> 1 a            1 TRUE               0        0           NaN          15
#> 2 b            1 TRUE               0        1             0           3
#> 3 c            4 FALSE              2        4             0.5         6
#> 4 d           96 FALSE              1        1             1           4
#> # ℹ 1 more variable: rate_ignored_sum <dbl>

# Data frames are allowed:
mode_df(iris[1:4])
#> # A tibble: 4 × 8
#>   term         estimate certainty na_ignored na_total rate_ignored_na sum_total
#>   <chr>           <dbl> <lgl>          <int>    <int>           <dbl>     <int>
#> 1 Sepal.Length      5   TRUE               0        0             NaN       150
#> 2 Sepal.Width       3   TRUE               0        0             NaN       150
#> 3 Petal.Length      1.4 TRUE               0        0             NaN       150
#> 4 Petal.Width       0.2 TRUE               0        0             NaN       150
#> # ℹ 1 more variable: rate_ignored_sum <dbl>