如何计算高于 Int 列表平均值十分之一的值的百分比

Question

我有一长串整数，我想计算高于或高于均值十分之一的数字的百分比。也就是我要计算分数的百分位数mean / 10。这是一个天真的方法（在 Python 中，但这并不重要）：

ls = [35,35,73,23,40,60,5,7,3,4,1,1,1,1,1]
length = 0
summ = 0
for i in ls:
    length += 1
    summ += i

mean = float(summ) / float(length)
print('The input value list is: {}'.format(ls))
print('The mean is: {}'.format(mean))
tenth_mean = mean / 10
print('One tenth of the mean is: {}'.format(tenth_mean))

summ = 0
for i in ls:
    if (i >= tenth_mean):
        summ += 1
result = float(summ) / float(length)
print('The percentage of values equal or above one tenth of the mean is: {}'.format(result))

输出：

The input value list is: [35, 35, 73, 23, 40, 60, 5, 7, 3, 4, 1, 1, 1, 1, 1]
The mean is: 19.3333333333
One tenth of the mean is: 1.93333333333
The percentage of values equal or above one tenth of the mean is: 0.666666666667

这种方法的问题是我必须遍历列表两次。有什么聪明的方法可以避免这种情况吗？

我看不到任何值，因为我首先需要计算平均值才能知道要在计数中保留哪些值（第二个循环）。

此外，我想针对多个百分比（即平均值的十分之一、平均值的五分之一等）执行此操作。这可以在第二个循环中轻松实现。我只是想指出这一点。

输入数组不服从任何分布。

编辑： 可能值的范围只有几千。总价值约30亿

编辑： 修正了上面单词 "percentile" 的用法。

Answer 1

如果列表中有很多查询，进行一些预处理以将时间复杂度降低到 O(log(n)) 可能会有所帮助。

如果对列表进行排序并计算列表的平均值（使用 python 函数），则可以使用二分查找在列表中找到百分位数。因此，查询时间为 O(log(n)).

Answer 2

这是 well-known 统计和信息科学的结果：您无法一次性获得所有信息。 @OmG 已经给了你最好的复杂性。根据您的分数分布，您可以通过插值搜索来改善搜索时间（但不是复杂性）。

如果您有一个庞大的数据集，您还可以在搜索过程中通过对均值的部分估计来改进搜索的起点。

Answer 3

根据其他人的回答，我提出了以下改进搜索的方法：关键的见解是，对于每个可能的值 x，都可以对所有出现的小于或等于 x 的值进行计数和排序。独立地，平均值可以并行计算（即在同一个循环中）。然后可以在元组列表中进行线性或二进制搜索以计算任意分数。当可能的不同值的数量远小于值的总数时，这种方法非常有效。

下面是 bash/awk 中的一个简单实现：

# The "tee >(awk ... > meant.txt) calculates the mean on the fly
# The second awk ("... value2count ...") counts the occurences of each value
# The sort simply sorts the output of awk (could be done within awk, too)
# The third awk ("... value2maxline ...") counts the number of lines having value x or less ("prevc" = previous count, "prevv" = previous value)
# The sort simply sorts the output of awk (could be done within awk, too)
echo -n "10\n15\n15\n20\n20\n25" | tee >(awk '{ sum += ; } END { print sum / NR; }' > mean.txt) | awk '{ value2count[]++ } END { for (value in value2count) { print value, value2count[value] } }' | sort --numeric-sort --stable -k 1,1 | awk 'BEGIN { prevc = 0 ; prevv = -1 } { if (prevv != ) { value2maxline[] = prevc +  ; prevc +=  ; prevv =  } } END { for (value in value2maxline) { print value, value2maxline[value] } }' | sort --numeric-sort --stable -k 1,1 > counts.txt

cat mean.txt
17.5

cat counts.txt
10 1 # one line with value 10
15 3 # 3 lines with value 15 or less
20 5 # 5 lines with value 20 or less
25 6 # 6 lines with value 25 or less, 6 is also the total number of values

在上面的示例中，如果我对值的百分比感兴趣 >= 均值的 70%，我会计算

int(0.7 * 17.5) = 12

然后找到（在元组列表中使用线性或二进制搜索）第 1 行（共 6 行）被少于 12 ("10 1”还在下方，“15 3”已经在上方）。最后，我计算 (6-1) / 6 = 0.83：83% 的值高于或等于平均值的 70%。

如何计算高于 Int 列表平均值十分之一的值的百分比

How to calculate the percentage of values higher than one tenth of the mean of an Int list

algorithm

math

percentile