median_df() takes a data frame (or another list) of numeric
vectors and computes the median of each element. Where the true median is
unknown due to missing values, more and more missings are ignored until an
estimate for the median is found.
Estimates are presented along with information about whether they are known to be the true median, how many missing had to be ignored during estimation, the rate of ignored values, etc.
Usage
median_df(x, even = c("mean", "low", "high"), ...)Value
Data frame with these columns:
term: the names ofxelements. Only present if any are named.estimate: the medians ofxelements, ignoring as manyNAs as necessary.certainty:TRUEif the corresponding estimate is certain to be the true median, andFALSEif this is unclear due to missing values.na_ignored: the number of missing values that had to be ignored to arrive at the estimate.na_total: the total number of missing values.rate_ignored_na: the proportion of missing values that had to be ignored from among all missing values.sum_total: the total number of values, missing or not.rate_ignored_sum: the proportion of missing values that had to be ignored from among all values, missing or not.
Details
The function deals with missing values (NAs) by first checking
whether they make the true median unknown. If they do, it removes one NA,
then checks again; and so on until an estimate is found.
This strategy is based on median2() and its na.rm.amount argument,
which represents a middle way between simply ignoring all NAs and not
even trying to compute an estimate. Instead, it only removes the minimum
number of NAs necessary, because some distributions have a known median
even if some of their values are missing. By keeping track of the removed
NAs, median_df() quantifies the uncertainty about its estimates.
Examples
# Use a list of numeric vectors:
my_list <- list(
a = 1:15,
b = c(1, 1, NA),
c = c(4, 4, NA, NA, NA, NA),
d = c(96, 24, 3, NA)
)
median_df(my_list)
#> # A tibble: 4 × 8
#> term estimate certainty na_ignored na_total rate_ignored_na sum_total
#> <chr> <dbl> <lgl> <int> <int> <dbl> <int>
#> 1 a 8 TRUE 0 0 NaN 15
#> 2 b 1 FALSE 1 1 1 3
#> 3 c 4 FALSE 3 4 0.75 6
#> 4 d 24 FALSE 1 1 1 4
#> # ℹ 1 more variable: rate_ignored_sum <dbl>
# Data frames are allowed:
median_df(iris[1:4])
#> # A tibble: 4 × 8
#> term estimate certainty na_ignored na_total rate_ignored_na sum_total
#> <chr> <dbl> <lgl> <int> <int> <dbl> <int>
#> 1 Sepal.Length 5.8 TRUE 0 0 NaN 150
#> 2 Sepal.Width 3 TRUE 0 0 NaN 150
#> 3 Petal.Length 4.35 TRUE 0 0 NaN 150
#> 4 Petal.Width 1.3 TRUE 0 0 NaN 150
#> # ℹ 1 more variable: rate_ignored_sum <dbl>