Tabulate median estimates with the certainty about them

median_df() takes a data frame (or another list) of numeric vectors and computes the median of each element. Where the true median is unknown due to missing values, more and more missings are ignored until an estimate for the median is found.

Estimates are presented along with information about whether they are known to be the true median, how many missing had to be ignored during estimation, the rate of ignored values, etc.

Usage

median_df(x, even = c("mean", "low", "high"), ...)

Arguments

x: List of vectors. Each vector needs to be numeric or similar. Note that data frames are lists, so x can be a data frame.
even: Passed on to median2().
...: Optional further arguments for median2() methods. Not used in its default method.

Value

Data frame with these columns:

term: the names of x elements. Only present if any are named.
estimate: the medians of x elements, ignoring as many NAs as necessary.
certainty: TRUE if the corresponding estimate is certain to be the true median, and FALSE if this is unclear due to missing values.
na_ignored: the number of missing values that had to be ignored to arrive at the estimate.
na_total: the total number of missing values.
rate_ignored_na: the proportion of missing values that had to be ignored from among all missing values.
sum_total: the total number of values, missing or not.
rate_ignored_sum: the proportion of missing values that had to be ignored from among all values, missing or not.

Details

The function deals with missing values (NAs) by first checking whether they make the true median unknown. If they do, it removes one NA, then checks again; and so on until an estimate is found.

This strategy is based on median2() and its na.rm.amount argument, which represents a middle way between simply ignoring all NAs and not even trying to compute an estimate. Instead, it only removes the minimum number of NAs necessary, because some distributions have a known median even if some of their values are missing. By keeping track of the removed NAs, median_df() quantifies the uncertainty about its estimates.

Examples

# Use a list of numeric vectors:
my_list <- list(
  a = 1:15,
  b = c(1, 1, NA),
  c = c(4, 4, NA, NA, NA, NA),
  d = c(96, 24, 3, NA)
)

median_df(my_list)
#> # A tibble: 4 × 8
#>   term  estimate certainty na_ignored na_total rate_ignored_na sum_total
#>   <chr>    <dbl> <lgl>          <int>    <int>           <dbl>     <int>
#> 1 a            8 TRUE               0        0          NaN           15
#> 2 b            1 FALSE              1        1            1            3
#> 3 c            4 FALSE              3        4            0.75         6
#> 4 d           24 FALSE              1        1            1            4
#> # ℹ 1 more variable: rate_ignored_sum <dbl>

# Data frames are allowed:
median_df(iris[1:4])
#> # A tibble: 4 × 8
#>   term         estimate certainty na_ignored na_total rate_ignored_na sum_total
#>   <chr>           <dbl> <lgl>          <int>    <int>           <dbl>     <int>
#> 1 Sepal.Length     5.8  TRUE               0        0             NaN       150
#> 2 Sepal.Width      3    TRUE               0        0             NaN       150
#> 3 Petal.Length     4.35 TRUE               0        0             NaN       150
#> 4 Petal.Width      1.3  TRUE               0        0             NaN       150
#> # ℹ 1 more variable: rate_ignored_sum <dbl>