Proving the algorithm • naidem

This proof begins with the algorithm for odd lengths and then extends it to the one for even lenghts.

Each partial algorithm only contains two steps that need to be proven; all other steps are merely definitional. See Implementing the algorithm, which also contains all definitions used here except for is_missing(a). This predicate tests whether the scalar (length 1) value a is missing.

Proof for odd lengths

half := length(x) / 2 - 0.5
nna := length(keep_missing(x))

The median is the midpoint of the sorted x distribution, which includes missing values — i.e., x[half] after this step:

x := sort(x)

Sorted missing observations are still unknown, but the purpose here is to make some statements about all possible values they might have:

sorted_na := keep_missing(x)
sorted_known := keep_known(x)

These logical values will be important below. They describe two possible ways in which x might be ordered — all known values are less than all missing values, and the reverse:

all_known_values_less := sorted_known[length(sorted_known) - 1] < sorted_na[0]
all_missing_values_less := sorted_na[length(sorted_na) - 1] < sorted_known[0]

Proof of step 3

If there are more missing than known values, the true order of x might be such that one of the missings is the median. Even if not, however, there would be no way to know from such data.

if nna > half:
    length(sorted_na) > length(sorted_known)
    if all_known_values_less || all_missing_values_less:
        is_missing(x[half])

Proof of step 5

Consider two extreme cases:

All known values are less than all missing values, and thus, the median is equal to the vector of known values indexed at the midpoint of the entire input x vector:

if all_known_values_less:
    x[half] == sorted_known[half]

All missing values are less than all known values, and thus, the median is equal to the vector of known values indexed at the midpoint of x offset by the number of missing values:

if all_missing_values_less:
    x[half] == sorted_known[half - nna]

These two points put bounds on the median:

sorted_known[half - nna] <= x[half] <= sorted_known[half]

Thus, if they are equal, they are also equal to the median. If the bounds are not equal, however, either could be the median, in addition to any values that might exist between them. In this case, there are at least two different possible medians, which renders the true median unknown:

if sorted_known[half - nna] == sorted_known[half]:
    sorted_known[half] == x[half]
    set possible_medians == { sorted_known[half] }
  else:
    set possible_medians == { sorted_known[half - nna]; ...; sorted_known[half] }

Whenever the median can be determined, it is equivalent to the median of sorted_known. In other words, if the algorithm returns a non-NA value, that value is equal to the median of x after removing all missing values from it:

sorted_known_median := sorted_known[length(sorted_known) / 2 - 0.5]
sorted_known[half - nna] <= sorted_known_median <= sorted_known[half]

if sorted_known[half - nna] == sorted_known[half]:
    sorted_known[half] == x[half]
    sorted_known[half] == sorted_known_median
    set possible_medians == { sorted_known_median }
  else:
    set possible_medians == { sorted_known[half - nna]; ...; sorted_known[half] }

Proof for even lengths

The definitions are the same as above, except for the fact that there are two midpoints here:

half_2 := length(x) / 2
half_1 := half_2 - 1
nna := length(keep_missing(x))
x := sort(x)
sorted_na := keep_missing(x)
sorted_known := keep_known(x)
all_known_values_less := sorted_known[length(sorted_known) - 1] < sorted_na[0]
all_missing_values_less := sorted_na[length(sorted_na) - 1] < sorted_known[0]

The target quantity is the mean of these two midpoints:

(x[half_1] + x[half_2]) / 2

Proof of step 4

If more than half of the x values are missing, this might include half_1 or half_2. Either has to be missing, again assuming that the sequences of known and missing values don’t alternate. We can’t rule this out, so it’s unknown whether it’s the case. Since it is unknown whether at least one of half_1 and half_2 is missing, it is not possible to determine both of them. Therefore, the median is unknown as well, since half_1 and half_2 jointly form the basis for computing it:

if nna > half_1:
    length(sorted_na) > length(sorted_known)
    if all_known_values_less || all_missing_values_less:
        is_missing(x[half_1]) || is_missing(x[half_2])
        is_missing((x[half_1] + x[half_2]) / 2)

Proof of step 6

Analogous to case (1) in the proof for odd lengths:

if all_known_values_less:
    x[half_1] == sorted_known[half_1]
    x[half_2] == sorted_known[half_2]

Analogous to case (2):

if all_missing_values_less:
    x[half_1] == sorted_known[half_1 - nna]
    x[half_2] == sorted_known[half_2 - nna]

As in the previous proof, the extremes put bounds on the midpoints:

sorted_known[half_1 - nna] <= x[half_1] <= sorted_known[half_1]
sorted_known[half_2 - nna] <= x[half_2] <= sorted_known[half_2]

Both pairs of extremes need to be equal, or else there are at least two possible medians, and the true median is unknown:

if sorted_known[half_1 - nna] == sorted_known[half_1] &&
   sorted_known[half_2 - nna] == sorted_known[half_2]:
      (sorted_known[half_1] + sorted_known[half_2]) / 2 == (x[half_1] + x[half_2]) / 2
      set possible_medians == { (sorted_known[half_1] + sorted_known[half_2]) / 2 }
   else:
      set possible_medians == { 
        (sorted_known[half_1 - nna] + sorted_known[half_2 - nna]) / 2;
        ...;
        (sorted_known[half_1] + sorted_known[half_2]) / 2
      }

The return value can be simplified from (sorted_known[half_1] + sorted_known[half_2]) / 2 to x[half_1]:

# Two general reminders:
nna > 0
sorted_known[half_2 - 1] == sorted_known[half_1]

if sorted_known[half_1 - nna] == sorted_known[half_1] &&
   sorted_known[half_2 - nna] == sorted_known[half_2]:
     # Is this possible?
     if sorted_known[half_1] != sorted_known[half_2]:
          sorted_known[half_2 - 1] != sorted_known[half_2]
          # It isn't, because the median would then depend
          # on the missing values' position within
          # the rank order:
          if all_known_values_less:
               x[half_1] < (x[half_1] + x[half_2]) / 2 < x[half_2]
          if all_missing_values_less:
               x[half_1] <= (x[half_1] + x[half_2]) / 2 <= x[half_2]

As a consequence, more than one value is possible for each of x[half_1] and x[half_2] given the missing values. This, in turn, contradicts the basic premise that each of the two values around the median remains equal when offset by nna. If this premise is false, however, the median cannot be determined, as shown above. This is why half_1 and half_2 must be equal in order for the median to be determined, and since the mean of two equal values is equal to each of them, the median is equal to both x[half_1] and x[half_2].

If the median of an even-length distribution with missing values can be determined at all, the algorithm can simply return x[half_1] instead of (x[half_1] + x[half_2]) / 2. Simplifying the algorithm in this way makes it more efficient.