持续更新中位数 + space 效率

Continually updating the median + space efficiency

也许我 looking/searching 没有找到合适的关键字(我找不到解决方案)。

我正在尝试以 space 有效的方式计算数字列表(不断更新)的中位数。

要计算均值,有一个很好的方法,即记住列表中元素的数量并对旧均值加权。例如(伪代码):

// Initialize values
noList   = [8,10,4,6]
mean     = 0
noItems  = 0

// Now we want to update the mean continually with further values.
for (value : noList) {
  mean    = (noItems / (noItems + 1)) * mean + (1 / (noItems + 1)) * value
  noItems = noItems + 1
}

// After iteration 1: wholeList = [8]       ; mean = 8   ; noItems = 1
// After iteration 2: wholeList = [8,10]    ; mean = 9   ; noItems = 2
// After iteration 3: wholeList = [8,10,4]  ; mean = 7.33; noItems = 3
// After iteration 4: wholeList = [8,10,4,6]; mean = 7   ; noItems = 4

问题: 有没有类似的(space-高效)的方法来计算中位数?

已更新 我更新了问题(感谢@WillemVanOnsem)。我不仅在寻找不断更新中位数,而且还在寻找一种 space 高效的方法。 根据他的提示,我们可以保留两个数据结构。

Example:

// 1) We have a list for which we want to find the median.
noList   = [9,10,4,6,13,12]

// 2) We devide it into two list or datastructures (additionally we sort it).
smallerList = [4,6,9]
biggerList  = [10,12,13]

// 3) Both list have the same length, so the median is between the last element of smallerList und the first element of biggerList.
median = (9 + 10) / 2 = 9.5

// 4) Next, we add a further element and want to update our median.
// We add the number 5 to our datastructures. So the new list is:
noList   = [9,10,4,6,13,12,5]

// 5) Obviously 5 is smaller than our current median of 9.5. So we insert it in a sorted way into smallerList:
smallerList = [4,5,6,9]
biggerList  = [10,12,13]

// 6) Now length(smallerList) > length(biggerList), So, we know, that the updated median should be the last element of smallerList.
median = 9

// 7) Next, we add a further element and want to update our median.
// We add the number 2 to our datastructures. So the new list is:
noList   = [9,10,4,6,13,12,5,2]

// 8) Obviously 2 is smaller than our current median of 9. So we insert it again in a sorted way into smallerList:
smallerList = [2,4,5,6,9]
biggerList  = [10,12,13]

// 9) Now the length of smallerList is much bigger than the length of biggerList and we need to "balance" our list by taking one element from one list and inserting it into the other list.
// We remove the element 9 from smallerList and insert it into biggerList.
smallerList = [2,4,5,6]
biggerList  = [9,10,12,13]

// 10) Both list have the same length, so the median is between the last element of smallerList und the first element of biggerList.
median = (6 + 9) / 2 = 7.5

希望,这说明了一切。我猜,这是你的提示 (@WillemVanOnsem)。

是的,这可能会回答我最初的问题...但此解决方案的问题是,两个列表(smallerList 和 biggerList)可能会增长到相当大的规模。假设我们有一个 10^18 数字流,我们想在不超出内存的情况下找到所有数字的中位数。如何以 space 高效的方式解决这个问题?

如果不记住你见过的所有数字,就无法做到这一点,因为在任何时候,你过去见过的任何数字都可能成为未来的中位数。

如果你到目前为止已经看过 n 个数字,那么对于任何 ii其中第一个最小的可能成为中位数:

  • 如果i > n/2,那么如果下一个2i - n个数字就会发生更大。

  • 如果i <= n/2,那么如果下一个n - 2i + 1 数字较小。