pandas describe() - top 在多个元素具有最高计数时如何工作？

Question

上下文:

我想了解 describe() 的 top 属性如何在 python (3.7.3) pandas (0.24.2) 中工作。

迄今为止的努力:

我查看了 pandas.DataFrame.describe 的文档。它指出：

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

我试图了解代码的哪一部分恰好归因于 "arbitrary" 输出。
我进入了describe依次调用的代码。我的回溯如下：

describe()  #pandas.core.generic
describe_1d()  #pandas.core.generic
describe_categorical_1d()  #pandas.core.generic
value_counts()  #pandas.core.base
value_counts()  #pandas.core.algorithms
_value_counts_arraylike()  #pandas.core.algorithms
# In the above step it uses hash-table, to find keys and their counts
# I am not able to step further, as further implementations are in C.

样本试用:

import pandas as pd
sample = pd.Series(["Down","Up","Up","Down"])
sample.describe()["top"]

上面的代码可以随机给出Down或Up，正如预期的那样。

问题:

traceback 中的哪个方法有助于输出的随机性？
是不是hash-table得到key的顺序是原因？

如果是，

-- 是不是每次都一样，同一个键有相同的哈希值，按相同的顺序获取？

-- 键如何散列、迭代（用于获取所有键）和从哈希中获取-table?

非常感谢任何指点！提前致谢:)

Answer 1

正如上面所指出的，它给"Down"是任意的，但不是随机的。在具有相同 Pandas 版本的同一台机器上，运行上面的代码应该始终产生相同的结果（尽管文档不保证，请参阅下面的评论）。

让我们重现正在发生的事情。

给定这个系列：

abc = pd.Series(list("abcdefghijklmnoppqq"))

value_counts implementation 归结为：

import pandas._libs.hashtable as htable
keys, counts = htable.value_count_object(np.asarray(abc), True)
result = pd.Series(counts, index=keys)

结果：

g    1
e    1
f    1
h    1
o    1
d    1
b    1
q    2
j    1
k    1
i    1
p    2
n    1
l    1
c    1
m    1
a    1
dtype: int64

结果的顺序由散列 table 的实现给出。每次调用都一样。

您可以查看 value_count_object, which calls build_count_table_object, which uses the khash implementation 的实现以获取有关散列的更多详细信息。

计算 table 后，value_counts 实现是 sorting 快速排序的结果。这种排序不是 stable 并且使用这个特殊构造的示例重新排序 "p" 和 "q":

result.sort_values(ascending=False)

q    2
p    2
a    1
e    1
f    1
h    1
o    1
d    1
b    1
j    1
m    1
k    1
i    1
n    1
l    1
c    1
g    1
dtype: int64

因此，排序可能有两个因素：首先是散列，其次是 non-stable 排序。

显示的最高值就是排序列表的 first entry，在这种情况下，"q"。

在我的机器上，快速排序在 17 个条目时变为 non-stable，这就是我选择上面示例的原因。

我们可以通过以下直接比较来测试 non-stable 排序：

pd.Series(list("abcdefghijklmnoppqq")).describe().top
'q'

pd.Series(list(               "ppqq")).describe().top
'p'

pandas describe() - top 在多个元素具有最高计数时如何工作？

How pandas describe() - top works when multiple elements have highest count?

python

hashtable

describe

python-3.x

pandas