Skip to contents

Follow these steps to implement naidem’s median algorithm in another language. It is tailored to numeric distributions in which some values are missing. For a proof, see Proving the algorithm.

This article uses pseudo-code to present two distinct algorithms for distributions of odd and even length (i.e., number of values). The R implementation, i.e., the default method for median2(), condenses both into one. That is because vectorization in R allows for code that deemphasizes the difference in length. Other languages may need more explicit distinctions. However, it is of course possible to combine the two algorithms into a single one that first checks whether the length is even or odd.

Below, x is the input distribution. length(d) computes the number of values in a distribution d, sort(d) is a sorting algorithm, keep_missing(d) returns all missing values in d (but no known values), and keep_known(d) is the opposite. NA denotes a single missing value. Unlike R, these algorithms use zero-indexing. Step 1 of the algorithm for odd lengths is therefore equivalent to integer division for such distributions.

For odd-length distributions

# 1. Determine the central index of `x`.
half := length(x) / 2 - 0.5

# 2. Count the missing values in `x`.
nna := length(keep_missing(x))

# 3. If the offset index is negative,
# the median cannot be determined.
if nna > half:
   return NA

# 4. Remove all missing values from `x`, then sort `x`.
x := sort(keep_known(x))

# 5. Check values at two specific indices for equality.
# If they are equal, return the central `x` value, i.e.,
# the median. If they are not equal, return a missing value.
if x[half - nna] == x[half]:
   return x[half]
else:
   return NA

For even-length distributions

# 1. Determine the upper central index of `x`.
half_2 := length(x) / 2

# 2. Determine the lower central index.
half_1 := half_2 - 1

# 3. Count the missing values in `x`.
nna := length(keep_missing(x))

# 4. Neither offset index can be negative.
# If any is, the median cannot be determined.
if nna > half_1:
   return NA

# 5. Remove all missing values from `x`, then sort `x`.
x := sort(keep_known(x))

# 6. Check whether both pairs are equal.
# If they are, return the mean of the central
# values (which is equal to both members of the pair).
# If one or both are not equal, return a missing value.
if x[half_1 - nna] == x[half_1] && x[half_2 - nna] == x[half_2]:
   return x[half_1]
else:
   return NA

Explanation

Missing values may or may not make it impossible to determine the median. Implementations of the median should check whether they do for any given distribution, instead of simply assuming that they do every time.

The algorithms check whether the putative median can be changed by a shift due to missing values. If so, the median depends on the position of the missing values in the sorted distribution — and thus, on their values. Since these are unknown, the median cannot be determined. However, if the median remains the same after the shift, it is indifferent to the missing values: the median is identical for all possible values behind the missing ones. See Proving the algorithm for details.

Subtraction is used for the shift because x is still indexed at half (the midpoint of the input x) even though all missing values were already removed from x by that point. Therefore, x[half] is not actually the midpoint of x. The maximum possible number of steps by which missing values can shift the median is nna, so this number is subtracted.

It is generally easier to determine the median for distributions of an odd length. The same is true when missing values are present. The even-length algorithm operates with two central indices instead of one because even-length distributions don’t have a single central value. The algorithm needs to compare x at each of these indices with the respective index offset by the number of missing values. In this way, it effectively checks whether the pair of x values at the central indices in the offset case is equal to the pair without the offset. If so, the two values are the same, so there is no need to compute the average.

The algorithms return NA if any offset central index is negative, i.e., if half(_1) - nna < 0. (This is step 3 for odd lengths and step 4 for even lengths. The pseudo-code in the algorithm is more efficient.) In the even-length case, only half_1 needs to be tested because it is less than half_2, so it is negative whenever half_2 is. Why return NA here? The median is never known in such cases, and indices must not be negative when testing for equality in the final condition. Negative indices are not well defined, and programming languages such as R and Python would handle them in unintended ways.

As an aside, median2() has an even argument that allows users to opt for the value just above or below the median if the length of the distribution is even. This has no effect if x contains any missing values: If the median can be determined in this case, it is equal to both values around it.

R details

The R implementation in naidem is more succinct than the pseudo-code, combining both partial algorithms into one. This plays into the strengths of R, but it does make the code look a little different from the above. Again, note that the pseudo-code uses zero-indexing but R uses one-indexing. This is why any(nna + 1L > half) checks for non-positive indices in general, not just for negative ones.

The final condition is wrapped into isTRUE(all(.)), so that all() reduces two logical test results to one if the input has an even length. If any comparison between the possible median values returns either NA or FALSE, the isTRUE() wrapping returns FALSE and the function returns NA because the median cannot be determined with certainty.

isTRUE() will only return TRUE if both comparisons do. Recall that the half object is either length 1 (the index of the median of an odd-length distribution) or 2 (the indices of the two values right above and below the median of an even-length distribution). If the median can be determined in the even-length scenario, it is equal to both values around it. This is why the function returns x[half[1L]], i.e., the first of these two values. For an odd-length distribution, the index is redundant because half[1L] equals half. Thus, x[half[1L]] is the correct return value in both cases in which the median can be determined.