为什么 pandas "rank" 百分位数不在 0 和 1 之间？

Question

我经常使用 pandas 并且经常执行类似于以下的代码：

df['var_rank'] = df['var'].rank(pct=True)
print( df.var_rank.max() )

并且通常会得到大于 1 的值。无论我保留还是删除 'na' 值，它仍然会发生。这显然很容易解决（只需除以排名最高的值），所以我不要求解决方法。我只是好奇为什么会发生这种情况并且没有在网上找到任何线索。

有人知道为什么会这样吗？

一些非常简单的示例数据 here（dropbox link - pickled pandas 系列）。

我从 df.rank(pct=True).max() 得到值 1.0156。我有其他值高达 4 或 5 的数据。我通常使用非常混乱的数据。

Answer 1

您的数据有误。

>>> s.rank(pct=True).max()
1.015625

s.sort(inplace=True)
>>> s.tail(7)
8      202512882
6      253661077
102            -
101            -
99             -
58             -
116            -
Name: Total Assets, dtype: object

>>> s[s != u'-'].rank(pct=True).max()
1.0

在Pandas0.18.0（上周发布）中，可以指定numeric only:

s.rank(pct=True, numeric_only=True)

我已经在 0.18.0 中尝试了上面的方法，但似乎无法让它工作，所以你也可以这样做来对所有 float 和 int 值进行排序：

>>> s[s.apply(lambda x: isinstance(x, (int, float)))].rank(pct=True).max()
1.0

它创建一个布尔值掩码，确保每个值都是 int 或 float，然后对过滤结果进行排名。

为什么 pandas "rank" 百分位数不在 0 和 1 之间？

why aren't pandas "rank" percentiles bounded between 0 and 1?

python

rank

percentile

pandas