median_df()
takes a data frame (or another list) of numeric
vectors and computes the median of each element. Where the true median is
unknown due to missing values, more and more missings are ignored until an
estimate for the median is found.
Estimates are presented along with information about whether they are known to be the true median, how many missing had to be ignored during estimation, the rate of ignored values, etc.
Usage
median_df(x, even = c("mean", "low", "high"), ...)
Value
Data frame with these columns:
term
: the names ofx
elements. Only present if any are named.estimate
: the medians ofx
elements, ignoring as manyNA
s as necessary.certainty
:TRUE
if the corresponding estimate is certain to be the true median, andFALSE
if this is unclear due to missing values.na_ignored
: the number of missing values that had to be ignored to arrive at the estimate.na_total
: the total number of missing values.rate_ignored_na
: the proportion of missing values that had to be ignored from among all missing values.sum_total
: the total number of values, missing or not.rate_ignored_sum
: the proportion of missing values that had to be ignored from among all values, missing or not.
Details
The function deals with missing values (NA
s) by first checking
whether they make the true median unknown. If they do, it removes one NA
,
then checks again; and so on until an estimate is found.
This strategy is based on median2()
and its na.rm.amount
argument,
which represents a middle way between simply ignoring all NA
s and not
even trying to compute an estimate. Instead, it only removes the minimum
number of NA
s necessary, because some distributions have a known median
even if some of their values are missing. By keeping track of the removed
NA
s, median_df()
quantifies the uncertainty about its estimates.
Examples
# Use a list of numeric vectors:
my_list <- list(
a = 1:15,
b = c(1, 1, NA),
c = c(4, 4, NA, NA, NA, NA),
d = c(96, 24, 3, NA)
)
median_df(my_list)
#> # A tibble: 4 × 8
#> term estimate certainty na_ignored na_total rate_ignored_na sum_total
#> <chr> <dbl> <lgl> <int> <int> <dbl> <int>
#> 1 a 8 TRUE 0 0 NaN 15
#> 2 b 1 FALSE 1 1 1 3
#> 3 c 4 FALSE 3 4 0.75 6
#> 4 d 24 FALSE 1 1 1 4
#> # ℹ 1 more variable: rate_ignored_sum <dbl>
# Data frames are allowed:
median_df(iris[1:4])
#> # A tibble: 4 × 8
#> term estimate certainty na_ignored na_total rate_ignored_na sum_total
#> <chr> <dbl> <lgl> <int> <int> <dbl> <int>
#> 1 Sepal.Length 5.8 TRUE 0 0 NaN 150
#> 2 Sepal.Width 3 TRUE 0 0 NaN 150
#> 3 Petal.Length 4.35 TRUE 0 0 NaN 150
#> 4 Petal.Width 1.3 TRUE 0 0 NaN 150
#> # ℹ 1 more variable: rate_ignored_sum <dbl>