Follow these steps to implement naidem’s median algorithm in another language. It is tailored to numeric arrays in which some values are missing. For a proof, see Proving the algorithm.
This article uses pseudo-code to present two distinct sub-algorithms for arrays of odd and even length (i.e., number of values). Then, it explains the algorithm as a whole. Finally, it discusses the R implementation in naidem.
Below, x
is the input array. length(a)
computes the number of values in an array a
,
sort(a)
is a sorting algorithm, and
keep_known(a)
returns all known values in a
(but no missing values). NA
denotes a single missing value.
Unlike R, this algorithm uses zero-indexing. Step 4 of the sub-algorithm
for odd lengths is therefore equivalent to integer division for such
arrays.
Median pseudo-code
# 1. Count the elements of `x`.
n := length(x)
# 2. Sort `x` after removing missing values from it.
x := sort(keep_known(x))
# 3. Infer the number of missing values.
nna := n - length(x)
# For odd-length arrays:
if n mod 2 == 1:
# 4. Determine the central index of `x` (including missing values).
half := n / 2 - 0.5
# 5. If the missing values could potentially form a block
# that starts at one end and extends into the central position,
# the median is uncertain.
if nna > half:
return NA
# 6. Check values at two specific indices for equality.
# If they are equal, return the central `x` value,
# which is the median in this case.
if x[half - nna] == x[half]:
return x[half]
# 7. If they are not equal, return a missing value.
return NA
# For even-length arrays:
else:
# 4a. Determine the upper central index of `x`.
half_2 := n / 2
# 4b. Determine the lower central index.
half_1 := half_2 - 1
# 5. If the missing values could potentially form a block
# that starts at one end and extends into either central position,
# the median is uncertain.
if nna > half_1:
return NA
# 6. Check whether both pairs are equal.
# If they are, return the mean of the two central
# values (which is equal to each member of the pair).
if x[half_1 - nna] == x[half_1] && x[half_2 - nna] == x[half_2]:
return x[half_1]
# 7. If one or both are not equal, return a missing value.
return NA
Explanation
Missing values may or may not make it impossible to determine the median. Implementations of the median should check whether they do for any given input, rather than simply assuming that they do every time.
The algorithm checks whether the putative median can be changed by a shift due to missing values. If so, the median depends on the positions of the missing elements within the sorted array — and thus, on their values. Since these are unknown, the median cannot be determined. However, if the median remains the same after the shift, it is indifferent to the missing values: the median is identical for all possible values behind the missing ones. See Proving the algorithm for details.
Subtraction is used for the shift because x
is still
indexed at half
(the midpoint of the input x
)
even though all missing values were already removed from x
by that point. Therefore, x[half]
is not necessarily the
midpoint of x
. The maximum possible number of steps by
which missing values can shift the median is the total number of missing
values (nna
), so this quantity is subtracted.
In practice, it might be important to use a “safe” comparison in step
6. Exact equality is too high of a bar for floating-point numbers
because there can be spurious differences between them. For example,
0.1 + 0.2 == 0.3
is false for deep technical reasons that
have no place in this context, so we need functions that disregard such
artifacts. However, this problem does not occur when comparing integers,
so testing for exact equality is safe in this case.
Median algorithms are generally more simple for arrays of an odd
length. This includes cases where missing values are present. The
even-length sub-algorithm operates with two central indices instead of
one because even-length arrays do not have a single central value. The
algorithm needs to compare x
at each of these indices with
the respective value offset by nna
. In this way, it
effectively checks whether the pair of x
values at the
central indices in the offset case is equal to the pair without the
offset. If so, the two values are the same, so there is no need to
compute the average, and x[half_1]
is returned.
In step 5, both sub-algorithms return NA
if
nna
is greater than half
or
half_1
, respectively. In the even-length case, only
half_1
needs to be tested because it is less than
half_2
, so it is less than nna
whenever
half_2
is. Why return NA
here? Imagine an
uninterrupted sequence of missing values at the start or end of the
array. If it extends into the central position(s), at least one central
value is missing. This case cannot be ruled out, so the median is
unknown. Furthermore, not returning here would allow for negative
indices in step 6.
As an aside, median2()
has an even
argument
that allows users to opt for the value just above or below the median if
the length of the array is even. This has no effect if x
contains any missing values: if the median can be determined in this
case, it is equal to both values around it.
R details
The R implementation in naidem::median2()
is more
succinct than the pseudo-code, condensing much of the two sub-algorithms
into one. This is because vectorization in R allows for code that
de-emphasizes the difference in length. Such elegant code plays into the
strengths of R, but it does look a little different from the above. R
connoisseurs are asked for patience with more pedestrian languages that
might need more explicit instructions.
Again, note that the pseudo-code uses zero-indexing but R uses
one-indexing. This is why nna >= half
in
median2()
casts a broader net than
nna > half
here.
The offset comparison near(x[half - nna], x[half])
is
wrapped into isTRUE(all(.))
, so that all()
reduces two logical test results to one if the input has an even length.
(More on near()
below.) If any comparison between the
possible median values returns either NA
or
FALSE
, the isTRUE()
wrapping returns
FALSE
and the function returns NA
because the
median cannot be determined with certainty.
isTRUE()
will only return TRUE
if all
comparisons do. Their number is either 1 or 2: recall that the
half
object is either length 1 (the index of the median of
an odd-length vector) or 2 (the indices of the two values right above
and below the median of an even-length vector). If the median can be
determined in the even-length scenario, it is equal
to both values around it. This is why the function returns
x[half[1L]]
, i.e., the first of these two values. For an
odd-length vector, the index is redundant because half[1L]
equals half
. Thus, x[half[1L]]
is the correct
return value whenever the median can be determined, regardless of the
length of the vector.
Note the use of near()
, which was adapted from
dplyr::near()
but without depending on dplyr. Consistent
with the explanation above, near()
compares two numeric
vectors while disregarding any spurious floating-point differences that
==
would take seriously (see also R for
Data Science on this point). So
near(0.1 + 0.2, 0.3)
is TRUE
even though
0.1 + 0.2 == 0.3
is FALSE
. Nevertheless,
==
is used elsewhere in median2()
when
comparing integers, where there is no such problem.