如何使用 pandas 提取与数据框中前 20% 值对应的列的名称

Question

我有一个这种类型的数据框：

   i1  i3  i4  i9  i14  i16  i17  i18  i19  i20  i22  i26  i27  i28
0   4   2   1   4    1    3    3    4    2    2    4    1    4    3

而且我必须提取其值在行中所有值的前 20% 中的列的名称。

结果如下：

[i1,i9,i18,i22,i27]

我需要对相同类型的其他行重复相同的过程。

Answer 1

Pandas quantile

0.8 = 1 - 0.2

df.columns[df.loc[0,:] >= df.loc[0,:].quantile(0.8)]

Index(['i1', 'i9', 'i18', 'i22', 'i27'], dtype='object')

正如@Nko3 所建议的，更通用的方法是对每一行应用公式-

df.apply(lambda x: x >= x.quantile(0.8),1).dot(df.columns + ', ').str.strip(', ')

0    i1, i9, i18, i22, i27
dtype: object

Answer 2

这个有效：

top20 = df.ge(df.quantile(0.8, axis=1), axis=0)
top_cols = top20.apply(lambda x: x.index[x], axis=1)

示例结果：

>>> df = pd.DataFrame(data=np.random.randint(1, 5, (5, 10)),
                      columns=[f"i{i}" for i in range(10)])
>>> df
   i0  i1  i2  i3  i4  i5  i6  i7  i8  i9
0   4   4   2   4   2   4   4   3   2   4
1   2   3   4   2   2   3   1   1   4   1
2   3   2   2   1   2   2   3   1   2   4
3   3   1   1   3   3   4   2   2   1   3
4   3   2   3   2   4   1   4   2   4   2

>>> top20 = df.ge(df.quantile(0.8, axis=1), axis=0)
>>> top20
      i0     i1     i2     i3     i4     i5     i6     i7     i8     i9
0   True   True  False   True  False   True   True  False  False   True
1  False  False   True  False  False  False  False  False   True  False
2   True  False  False  False  False  False   True  False  False   True
3   True  False  False   True   True   True  False  False  False   True
4  False  False  False  False   True  False   True  False   True  False

>>> top20.apply(lambda x: x.index[x], axis=1)
0    Index(['i0', 'i1', 'i3', 'i5', 'i6', 'i9'], dt...
1                  Index(['i2', 'i8'], dtype='object')
2            Index(['i0', 'i6', 'i9'], dtype='object')
3    Index(['i0', 'i3', 'i4', 'i5', 'i9'], dtype='o...
4            Index(['i4', 'i6', 'i8'], dtype='object')
dtype: object

如何使用 pandas 提取与数据框中前 20% 值对应的列的名称

How to extract the name of the columns corresponding to the top 20% values in a dataframe with pandas

python

max

dataframe

pandas