mode_df()
takes a data frame (or another list) of numeric
vectors and computes the mode or modes of each element. Where the true mode
is unknown due to missing values, more and more NA
s are ignored until an
estimate for the mode is found.
Estimates are presented along with information about whether they are known
to be the true mode, how many NA
s had to be ignored during estimation,
the rate of ignored NA
s, etc.
Arguments
- x
List of vectors. Each vector needs to be numeric or similar. Note that data frames are lists, so
x
can be a data frame.- method
String. How to determine the mode(s)? Options are:
"first"
formode_first()
, the default."all"
formode_all()
. This may return multiple values per estimate."single"
formode_single()
. The only option that can returnNA
estimates.
- na.rm.from
String. Only relevant to the default
method = "first"
. Where to start when removingNA
s fromx
? Options are"start"
,"end"
, and"random"
. Default is"start"
.- accept
Passed on to
mode_first()
andmode_single()
. Default isFALSE
.- multiple
Passed on to
mode_single()
. Default is"NA"
.
Value
Tibble (data frame) with these columns:
term
: the names ofx
elements. Only present if any are named.estimate
: the modes ofx
elements, ignoring as manyNA
s as necessary. List-column ifmethod = "all"
.certainty
:TRUE
if the corresponding estimate is certain to be the true mode, andFALSE
if this is unclear due to missing values.na_ignored
: the number of missing values that had to be ignored to arrive at the estimate.na_total
: the total number of missing values.rate_ignored_na
: the proportion of missing values that had to be ignored from among all missing values.sum_total
: the total number of values, missing or not.rate_ignored_sum
: the proportion of missing values that had to be ignored from among all values, missing or not.
Details
The function deals with missing values (NA
s) by first checking
whether they make the true mode unknown. If they do, it removes one NA
,
then checks again; and so on until an estimate is found.
This strategy is based on the na.rm.amount
argument of mode_first()
,
mode_all()
, and mode_single()
. It represents a middle way between
simply ignoring all NA
s and not even trying to compute an estimate.
Instead, it only removes the minimum number of NA
s necessary, because
some distributions have a known mode (or set of modes) even if some of
their values are missing. By keeping track of the removed NA
s,
mode_df()
quantifies the uncertainty about its estimates.
Examples
# Use a list of numeric vectors:
my_list <- list(
a = 1:15,
b = c(1, 1, NA),
c = c(4, 4, NA, NA, NA, NA),
d = c(96, 24, 3, NA)
)
mode_df(my_list)
#> # A tibble: 4 × 8
#> term estimate certainty na_ignored na_total rate_ignored_na sum_total
#> <chr> <dbl> <lgl> <int> <int> <dbl> <int>
#> 1 a 1 TRUE 0 0 NaN 15
#> 2 b 1 TRUE 0 1 0 3
#> 3 c 4 FALSE 2 4 0.5 6
#> 4 d 96 FALSE 1 1 1 4
#> # ℹ 1 more variable: rate_ignored_sum <dbl>
# Data frames are allowed:
mode_df(iris[1:4])
#> # A tibble: 4 × 8
#> term estimate certainty na_ignored na_total rate_ignored_na sum_total
#> <chr> <dbl> <lgl> <int> <int> <dbl> <int>
#> 1 Sepal.Length 5 TRUE 0 0 NaN 150
#> 2 Sepal.Width 3 TRUE 0 0 NaN 150
#> 3 Petal.Length 1.4 TRUE 0 0 NaN 150
#> 4 Petal.Width 0.2 TRUE 0 0 NaN 150
#> # ℹ 1 more variable: rate_ignored_sum <dbl>