如何有效地找到数字流中元素的排名？

Question

最近我试图在满足以下条件的数字流中找到中位数：

3 遍算法
O(nlog(n)) 次
O(平方(n))space

输入重复 3 次，包括整数个数 n，后接 n 个整数 a_i 使得：

n 为奇数
1≤n≤10^7
|a_i| ≤ 2^{30}

输入数据格式如下：

我目前的代码如下所示：

#ifdef STREAMING_JUDGE
#include "io.h"
#define next_token io.next_token
#else
#include<string>
#include<iostream>
using namespace std; 
string next_token()
{
    string s;
    cin >> s;
    return s;
}
#endif

#include<cstdio>
#include<cstdlib>
#include<vector>
#include<algorithm>
#include<iostream>
#include<math.h>

using namespace std;

int main()
{
    srand(time(NULL));
    //1st pass: randomly choose sqrt(n) numbers from the given stream of numbers
    int n = atoi(next_token().c_str());
    int p = (int)ceil(sqrt(n));
    vector<int> a;
    for(int i=0; i<n; i++)
    {
        int s=atoi(next_token().c_str());
        if( rand()%p == 0 && (int)a.size() < p )
        {
            a.push_back(s);
        }
    }
    sort(a.begin(), a.end());
    //2nd pass: find the k such that the median lies in a[k] and a[k+1], and find the rank of the median between a[k] and a[k+1]
    next_token();
    vector<int> rank(a.size(),0);
    for( int j = 0; j < (int)a.size(); j++ )
    {
        rank.push_back(0);
    }
    for( int i = 0; i < n; i++ )
    {
        int s=atoi(next_token().c_str());
        for( int j = 0; j < (int)rank.size(); j++ )
        {
            if( s<=a[j] )
            {
                rank[j]++;
            }
        }
    }
    int median = 0;
    int middle = (n+1)/2;
    int k;
    if( (int)a.size() == 1 && rank.front() == middle )
    {
        median=a.front();
        cout << median << endl;
        return 0;
    }
    for( int j = 0; j < (int)rank.size(); j++ )
    {
        if( rank[j] == middle )
        {
            cout << rank[j] << endl;
            return 0;
        }
        else if( rank[j] < middle && rank[j+1] > middle )
        {
            k = j;
            break;
        }
    }
    //3rd pass: sort the numbers in (a[k], a[k+1]) to find the median
    next_token();
    vector<int> FinalRun;
    if( rank.empty() )
    {
        for(int i=0; i<n; i++)
        {
            a.push_back(atoi(next_token().c_str()));
        }
        sort(a.begin(), a.end());
        cout << a[n>>1] << endl;
        return 0;
    }
    else if( rank.front() > middle )
    {
        for( int i = 0; i < n; i++ )
        {
            int s = atoi(next_token().c_str());
            if( s < a.front() )  FinalRun.push_back(s);
        }
        sort( FinalRun.begin(), FinalRun.end() );
        cout << FinalRun[middle-1] << endl;
        return 0;
    }
    else if ( rank.back() < middle )
    {
        for( int i = 0; i < n; i++ )
        {
            int s = atoi(next_token().c_str());
            if( s > a.back() )  FinalRun.push_back(s);
        }
        sort( FinalRun.begin(), FinalRun.end() );
        cout << FinalRun[middle-rank.back()-1] << endl;
        return 0;
    }
    else
    {
        for( int i = 0; i < n; i++ )
        {
            int s = atoi(next_token().c_str());
            if( s > a[k] && s < a[k+1] )  FinalRun.push_back(s);
        }
        sort( FinalRun.begin(), FinalRun.end() );
        cout << FinalRun[middle-rank[k]-1] << endl;
        return 0;
    }
}

但是我还是达不到O(nlogn)的时间复杂度。我猜想瓶颈在排名部分（即通过在数字。）在第二遍。这部分在我的代码中有 O(nsqrt(n))。

但是不知道怎么提高排名效率…… 有什么提高效率的建议吗？提前致谢！

进一步解释"rank"：采样数的排名计算流中小于等于采样数的数的个数。例如：在如上给出的输入中，如果对数字a[0]=2、a[1]=4、a[2]=5进行采样，则rank[0]=2，因为有两个数字（ 1 和 2) 在流中小于或等于 a[0].

感谢您的帮助。特别是@alexeykuzmin0 的建议确实可以加快第二次传递到 O(n*logn) 时间。但是还有一个问题：在第 1 遍中，我以 1/sqrt(n) 的概率对数字进行采样。当没有采样数时（最坏情况），向量a为空，导致后面的pass无法执行（即发生segmentation fault（core dumped））。 @Aconcagua，"select all remaining elements, if there aren't more than required any more" 是什么意思？谢谢

Answer 1

你说得对，你的第二部分在 O(n√n) 时间完成：

for( int i = 0; i < n; i++ )                    // <= n iterations
  ...
    for( int j = 0; j < (int)rank.size(); j++ ) // <= √n iterations

要解决这个问题，我们需要摆脱内循环。例如，我们可以先计算落入每个区间的数组元素的数量，而不是直接计算初始数组中小于阈值的元素数量：

// Same as in your code
for (int i = 0; i < n; ++i) {
    int s = atoi(next_token().c_str());
    // Find index of interval in O(log n) time
    int idx = std::upper_bound(a.begin(), a.end(), s) - a.begin();
    // Increase the rank of only that interval
    ++rank[idx];
}

然后计算你的阈值元素的排名：

std::partial_sum(rank.begin(), rank.end(), rank.begin());

最终的复杂度是 O(n log n) + O(n) = O(n log n)。

这里我使用了两种STL算法：

std::upper_bound 使用二进制搜索方法在对数时间内找到排序数组中严格大于给定数的第一个元素。
std::partial_sum 计算线性时间内给出的数组的部分和。

如何有效地找到数字流中元素的排名？

How to find the rank of an element in a stream of numbers efficiently?

c++

algorithm

ranking

median

data-stream