mode_df() takes a data frame (or another list) of numeric
vectors and computes the mode or modes of each element. Where the true mode
is unknown due to missing values, more and more NAs are ignored until an
estimate for the mode is found.
Estimates are presented along with information about whether they are known
to be the true mode, how many NAs had to be ignored during estimation,
the rate of ignored NAs, etc.
Arguments
- x
List of vectors. Each vector needs to be numeric or similar. Note that data frames are lists, so
xcan be a data frame.- method
String. How to determine the mode(s)? Options are:
"first"formode_first(), the default."all"formode_all(). This may return multiple values per estimate."single"formode_single(). The only option that can returnNAestimates.
- na.rm.from
String. Only relevant to the default
method = "first". Where to start when removingNAs fromx? Options are"start","end", and"random". Default is"start".- accept
Passed on to
mode_first()andmode_single(). Default isFALSE.- multiple
Passed on to
mode_single(). Default is"NA".
Value
Tibble (data frame) with these columns:
term: the names ofxelements. Only present if any are named.estimate: the modes ofxelements, ignoring as manyNAs as necessary. List-column ifmethod = "all".certainty:TRUEif the corresponding estimate is certain to be the true mode, andFALSEif this is unclear due to missing values.na_ignored: the number of missing values that had to be ignored to arrive at the estimate.na_total: the total number of missing values.rate_ignored_na: the proportion of missing values that had to be ignored from among all missing values.sum_total: the total number of values, missing or not.rate_ignored_sum: the proportion of missing values that had to be ignored from among all values, missing or not.
Details
The function deals with missing values (NAs) by first checking
whether they make the true mode unknown. If they do, it removes one NA,
then checks again; and so on until an estimate is found.
This strategy is based on the na.rm.amount argument of mode_first(),
mode_all(), and mode_single(). It represents a middle way between
simply ignoring all NAs and not even trying to compute an estimate.
Instead, it only removes the minimum number of NAs necessary, because
some distributions have a known mode (or set of modes) even if some of
their values are missing. By keeping track of the removed NAs,
mode_df() quantifies the uncertainty about its estimates.
Examples
# Use a list of numeric vectors:
my_list <- list(
a = 1:15,
b = c(1, 1, NA),
c = c(4, 4, NA, NA, NA, NA),
d = c(96, 24, 3, NA)
)
mode_df(my_list)
#> # A tibble: 4 × 8
#> term estimate certainty na_ignored na_total rate_ignored_na sum_total
#> <chr> <dbl> <lgl> <int> <int> <dbl> <int>
#> 1 a 1 TRUE 0 0 NaN 15
#> 2 b 1 TRUE 0 1 0 3
#> 3 c 4 FALSE 2 4 0.5 6
#> 4 d 96 FALSE 1 1 1 4
#> # ℹ 1 more variable: rate_ignored_sum <dbl>
# Data frames are allowed:
mode_df(iris[1:4])
#> # A tibble: 4 × 8
#> term estimate certainty na_ignored na_total rate_ignored_na sum_total
#> <chr> <dbl> <lgl> <int> <int> <dbl> <int>
#> 1 Sepal.Length 5 TRUE 0 0 NaN 150
#> 2 Sepal.Width 3 TRUE 0 0 NaN 150
#> 3 Petal.Length 1.4 TRUE 0 0 NaN 150
#> 4 Petal.Width 0.2 TRUE 0 0 NaN 150
#> # ℹ 1 more variable: rate_ignored_sum <dbl>