Implementing the algorithm • naidem

Follow these steps to implement naidem’s median algorithm in another language. It is tailored to numeric arrays in which some values are missing. For a proof, see Proving the algorithm.

This article uses pseudo-code to present two distinct sub-algorithms for arrays of odd and even length (i.e., number of values). Then, it explains the algorithm as a whole. Finally, it discusses the R implementation in naidem.

Below, x is the input array. length(a) computes the number of values in an array a, sort(a) is a sorting algorithm, and keep_known(a) returns all known values in a (but no missing values). NA denotes a single missing value. Unlike R, this algorithm uses zero-indexing. Step 4 of the sub-algorithm for odd lengths is therefore equivalent to integer division for such arrays.

Median pseudo-code

# 1. Count the elements of `x`.
n := length(x)

# 2. Sort `x` after removing missing values from it.
x := sort(keep_known(x))

# 3. Infer the number of missing values.
nna := n - length(x)

# For odd-length arrays:
if n mod 2 == 1:
   
   # 4. Determine the central index of `x` (including missing values).
   half := n / 2 - 0.5
   
   # 5. If the missing values could potentially form a block
   # that starts at one end and extends into the central position,
   # the median is uncertain.
   if nna > half:
      return NA
   
   # 6. Check values at two specific indices for equality.
   # If they are equal, return the central `x` value,
   # which is the median in this case.
   if x[half - nna] == x[half]:
      return x[half]
   
   # 7. If they are not equal, return a missing value.
   return NA
   
# For even-length arrays:
else:
   
   # 4a. Determine the upper central index of `x`.
   half_2 := n / 2
   
   # 4b. Determine the lower central index.
   half_1 := half_2 - 1
   
   # 5. If the missing values could potentially form a block
   # that starts at one end and extends into either central position,
   # the median is uncertain.
   if nna > half_1:
      return NA
   
   # 6. Check whether both pairs are equal.
   # If they are, return the mean of the two central
   # values (which is equal to each member of the pair).
   if x[half_1 - nna] == x[half_1] && x[half_2 - nna] == x[half_2]:
      return x[half_1]
   
   # 7. If one or both are not equal, return a missing value.
   return NA

Explanation

Missing values may or may not make it impossible to determine the median. Implementations of the median should check whether they do for any given input, rather than simply assuming that they do every time.

The algorithm checks whether the putative median can be changed by a shift due to missing values. If so, the median depends on the positions of the missing elements within the sorted array — and thus, on their values. Since these are unknown, the median cannot be determined. However, if the median remains the same after the shift, it is indifferent to the missing values: the median is identical for all possible values behind the missing ones. See Proving the algorithm for details.

Subtraction is used for the shift because x is still indexed at half (the midpoint of the input x) even though all missing values were already removed from x by that point. Therefore, x[half] is not necessarily the midpoint of x. The maximum possible number of steps by which missing values can shift the median is the total number of missing values (nna), so this quantity is subtracted.

In practice, it might be important to use a “safe” comparison in step 6. Exact equality is too high of a bar for floating-point numbers because there can be spurious differences between them. For example, 0.1 + 0.2 == 0.3 is false for deep technical reasons that have no place in this context, so we need functions that disregard such artifacts. However, this problem does not occur when comparing integers, so testing for exact equality is safe in this case.

Median algorithms are generally more simple for arrays of an odd length. This includes cases where missing values are present. The even-length sub-algorithm operates with two central indices instead of one because even-length arrays do not have a single central value. The algorithm needs to compare x at each of these indices with the respective value offset by nna. In this way, it effectively checks whether the pair of x values at the central indices in the offset case is equal to the pair without the offset. If so, the two values are the same, so there is no need to compute the average, and x[half_1] is returned.

In step 5, both sub-algorithms return NA if nna is greater than half or half_1, respectively. In the even-length case, only half_1 needs to be tested because it is less than half_2, so it is less than nna whenever half_2 is. Why return NA here? Imagine an uninterrupted sequence of missing values at the start or end of the array. If it extends into the central position(s), at least one central value is missing. This case cannot be ruled out, so the median is unknown. Furthermore, not returning here would allow for negative indices in step 6.

As an aside, median2() has an even argument that allows users to opt for the value just above or below the median if the length of the array is even. This has no effect if x contains any missing values: if the median can be determined in this case, it is equal to both values around it.

R details

The R implementation in naidem::median2() is more succinct than the pseudo-code, condensing much of the two sub-algorithms into one. This is because vectorization in R allows for code that de-emphasizes the difference in length. Such elegant code plays into the strengths of R, but it does look a little different from the above. R connoisseurs are asked for patience with more pedestrian languages that might need more explicit instructions.

Again, note that the pseudo-code uses zero-indexing but R uses one-indexing. This is why nna >= half in median2() casts a broader net than nna > half here.

The offset comparison near(x[half - nna], x[half]) is wrapped into isTRUE(all(.)), so that all() reduces two logical test results to one if the input has an even length. (More on near() below.) If any comparison between the possible median values returns either NA or FALSE, the isTRUE() wrapping returns FALSE and the function returns NA because the median cannot be determined with certainty.

isTRUE() will only return TRUE if all comparisons do. Their number is either 1 or 2: recall that the half object is either length 1 (the index of the median of an odd-length vector) or 2 (the indices of the two values right above and below the median of an even-length vector). If the median can be determined in the even-length scenario, it is equal to both values around it. This is why the function returns x[half[1L]], i.e., the first of these two values. For an odd-length vector, the index is redundant because half[1L] equals half. Thus, x[half[1L]] is the correct return value whenever the median can be determined, regardless of the length of the vector.

Note the use of near(), which was adapted from dplyr::near() but without depending on dplyr. Consistent with the explanation above, near() compares two numeric vectors while disregarding any spurious floating-point differences that == would take seriously (see also R for Data Science on this point). So near(0.1 + 0.2, 0.3) is TRUE even though 0.1 + 0.2 == 0.3 is FALSE. Nevertheless, == is used elsewhere in median2() when comparing integers, where there is no such problem.