Follow these steps to implement naidem’s median algorithm in another language. It is tailored to numeric distributions in which some values are missing. For a proof, see Proving the algorithm.
This article uses pseudo-code to present two distinct algorithms for
distributions of odd and even length (i.e., number of values). The R
implementation, i.e., the default method for median2()
,
condenses both into one. That is because vectorization in R allows for
code that deemphasizes the difference in length. Other languages may
need more explicit distinctions. However, it is of course possible to
combine the two algorithms into a single one that first checks whether
the length is even or odd.
Below, x
is the input distribution.
length(d)
computes the number of values in a distribution
d
, sort(d)
is a sorting algorithm,
keep_missing(d)
returns all missing values in
d
(but no known values), and keep_known(d)
is
the opposite. NA
denotes a single missing value. Unlike R,
these algorithms use zero-indexing. Step 1 of the algorithm for odd
lengths is therefore equivalent to integer division for such
distributions.
For odd-length distributions
# 1. Determine the central index of `x`.
half := length(x) / 2 - 0.5
# 2. Count the missing values in `x`.
nna := length(keep_missing(x))
# 3. If the offset index is negative,
# the median cannot be determined.
if nna > half:
return NA
# 4. Remove all missing values from `x`, then sort `x`.
x := sort(keep_known(x))
# 5. Check values at two specific indices for equality.
# If they are equal, return the central `x` value, i.e.,
# the median. If they are not equal, return a missing value.
if x[half - nna] == x[half]:
return x[half]
else:
return NA
For even-length distributions
# 1. Determine the upper central index of `x`.
half_2 := length(x) / 2
# 2. Determine the lower central index.
half_1 := half_2 - 1
# 3. Count the missing values in `x`.
nna := length(keep_missing(x))
# 4. Neither offset index can be negative.
# If any is, the median cannot be determined.
if nna > half_1:
return NA
# 5. Remove all missing values from `x`, then sort `x`.
x := sort(keep_known(x))
# 6. Check whether both pairs are equal.
# If they are, return the mean of the central
# values (which is equal to both members of the pair).
# If one or both are not equal, return a missing value.
if x[half_1 - nna] == x[half_1] && x[half_2 - nna] == x[half_2]:
return x[half_1]
else:
return NA
Explanation
Missing values may or may not make it impossible to determine the median. Implementations of the median should check whether they do for any given distribution, instead of simply assuming that they do every time.
The algorithms check whether the putative median can be changed by a shift due to missing values. If so, the median depends on the position of the missing values in the sorted distribution — and thus, on their values. Since these are unknown, the median cannot be determined. However, if the median remains the same after the shift, it is indifferent to the missing values: the median is identical for all possible values behind the missing ones. See Proving the algorithm for details.
Subtraction is used for the shift because x
is still
indexed at half
(the midpoint of the input x
)
even though all missing values were already removed from x
by that point. Therefore, x[half]
is not actually the
midpoint of x
. The maximum possible number of steps by
which missing values can shift the median is nna
, so this
number is subtracted.
It is generally easier to determine the median for distributions of
an odd length. The same is true when missing values are present. The
even-length algorithm operates with two central indices instead of one
because even-length distributions don’t have a single central value. The
algorithm needs to compare x
at each of these indices with
the respective index offset by the number of missing values. In this
way, it effectively checks whether the pair of x
values at
the central indices in the offset case is equal to the pair without the
offset. If so, the two values are the same, so there is no need to
compute the average.
The algorithms return NA
if any offset central index is
negative, i.e., if half(_1) - nna < 0
. (This is step 3
for odd lengths and step 4 for even lengths. The pseudo-code in the
algorithm is more efficient.) In the even-length case, only
half_1
needs to be tested because it is less than
half_2
, so it is negative whenever half_2
is.
Why return NA
here? The median is never known in such
cases, and indices must not be negative when testing for equality in the
final condition. Negative indices are not well defined, and programming
languages such as R and Python would handle them in unintended ways.
As an aside, median2()
has an even
argument
that allows users to opt for the value just above or below the median if
the length of the distribution is even. This has no effect if
x
contains any missing values: If the median can be
determined in this case, it is equal to both values around it.
R details
The R implementation in naidem is more succinct than the pseudo-code,
combining both partial algorithms into one. This plays into the
strengths of R, but it does make the code look a little different from
the above. Again, note that the pseudo-code uses zero-indexing but R
uses one-indexing. This is why any(nna + 1L > half)
checks for non-positive indices in general, not just for negative
ones.
The final condition is wrapped into isTRUE(all(.))
, so
that all()
reduces two logical test results to one if the
input has an even length. If any comparison between the possible median
values returns either NA
or FALSE
, the
isTRUE()
wrapping returns FALSE
and the
function returns NA
because the median cannot be determined
with certainty.
isTRUE()
will only return TRUE
if both
comparisons do. Recall that the half
object is either
length 1 (the index of the median of an odd-length distribution) or 2
(the indices of the two values right above and below the median of an
even-length distribution). If the median can be determined in the
even-length scenario, it is equal
to both values around it. This is why the function returns
x[half[1L]]
, i.e., the first of these two values. For an
odd-length distribution, the index is redundant because
half[1L]
equals half
. Thus,
x[half[1L]]
is the correct return value in both cases in
which the median can be determined.