有没有办法对 pandas 数据框中的某些项目进行排名并排除其他项目?

Is there a way to rank some items in a pandas dataframe and exclude others?

我有一个名为 ranks 的 pandas 数据框,其中包含我的集群及其关键指标。我使用 rank() 对它们进行排名,但是我希望有两个特定的集群与其他集群排名不同。

ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
                                   '3', '4', '5','6', '7', '8', '9'],
                        'No. Customers': [145118, 
                                        2, 
                                        1236, 
                                        219847, 
                                        9837,
                                        64865,
                                        3855,
                                        219549,
                                        34171,
                                        3924120],  
                        'Ave. Recency': [39.0197, 
                                        47.0, 
                                        15.9716, 
                                        41.9736, 
                                        23.9330,
                                        24.8281,
                                        26.5647,
                                        17.7493,
                                        23.5205,
                                        24.7933],
                        'Ave. Frequency': [1.7264, 
                                        19.0, 
                                        24.9101, 
                                        3.0682, 
                                        3.2735,
                                        1.8599,
                                        3.9304,
                                        3.3356,
                                        9.1703,
                                        1.1684],
                        'Ave. Monetary': [14971.85, 
                                        237270.00, 
                                        126992.79, 
                                        17701.64, 
                                        172642.35,
                                        13159.21,
                                        54333.56,
                                        17570.67,
                                        42136.68,
                                        4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
   Cluster   No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0    0           145118        39.0197       1.7264         14,971.85     8,672.07
1    1           2             47.0          19.0          237,270.00    12,487.89
2    2           1236          15.9716       24.9101       126,992.79     5,098.02
3    3           219847        41.9736       3.0682         17,701.64     5,769.23
4    4           9837          23.9330       3.2735        172,642.35    52,738.42
5    5           64865         24.8281       1.8599         13,159.21     7,075.19
6    6           3855          26.5647       3.9304         54,333.56    13,823.64
7    7           219549        17.7493       3.3356         17,570.67     5,267.52
8    8           34171         23.5205       9.1703         42,136.68     4,594.89
9    9           3924120       24.7933       1.1684          4,754.76     4,069.21 

然后我像这样应用 rank() 方法:

ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')

这给了我这个:

   Cluster  No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0    0          145118       39.0197      1.7264        14,971.85    8,672.07     8     9       8      4      29        9     
1    1          2            47.0         19.0         237,270.00   12,487.89     10    2       1      3      16        3 
2    2          1236         15.9716      24.9101      126,992.79    5,098.02     1     1       3      8      13        1
3    3          219847       41.9736      3.0682        17,701.64    5,769.23     9     7       6      6      28        7
4    4          9837         23.9330      3.2735       172,642.35   52,738.42     4     6       2      1      13        2
5    5          64865        24.8281      1.8599        13,159.21    7,075.19     6     8       9      5      28        8
6    6          3855         26.5647      3.9304        54,333.56   13,823.64     7     4       4      2      17        4
7    7          219549       17.7493      3.3356        17,570.67    5,267.52     2     5       7      7      21        6
8    8          34171        23.5205      9.1703        42,136.68    4,594.89     3     3       5      9      20        5
9    9          3924120      24.7933      1.1684         4,754.76    4,069.21     5     10      10     10     35        10

这是它应该做的,但是具有最高 Ave. Spend 的集群需要始终排在第 1 位,而具有最高 Ave. Recency 的集群需要排在最后次。

所以我将上面的代码修改为如下所示:

if(ranks['s_rank'].min() == 1):
    ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
    ranks['overall_rank_2'] = len(ranks)
else:
    ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
    ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
    ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
    ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
    ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
    ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
    ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')

然后我明白了

   Cluster  No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0    0          145118       39.0197      1.7264        14,971.85    8,672.07     8     9       8      4      29        9             1     
1    1          2            47.0         19.0         237,270.00   12,487.89     10    2       1      3      16        3             1 
2    2          1236         15.9716      24.9101      126,992.79    5,098.02     1     1       3      8      13        1             1
3    3          219847       41.9736      3.0682        17,701.64    5,769.23     9     7       6      6      28        7             1
4    4          9837         23.9330      3.2735       172,642.35   52,738.42     4     6       2      1      13        2             1
5    5          64865        24.8281      1.8599        13,159.21    7,075.19     6     8       9      5      28        8             1
6    6          3855         26.5647      3.9304        54,333.56   13,823.64     7     4       4      2      17        4             1
7    7          219549       17.7493      3.3356        17,570.67    5,267.52     2     5       7      7      21        6             1
8    8          34171        23.5205      9.1703        42,136.68    4,594.89     3     3       5      9      20        5             1
9    9          3924120      24.7933      1.1684         4,754.76    4,069.21     5     10      10     10     35        10            1

请帮我修改上面的 if 语句,或者推荐一个完全不同的方法。这当然需要尽可能动态。

所以你想在你的数据框上进行自定义排名,其中具有最高 Ave. Spend 的集群(/行)始终排名 1,而具有最高 Ave. Recency 的集群(/行)始终排名最后。

解决方案是五行。备注:

  • 您对 DataFrame.drop() 的想法是正确的,只需使用 idxmax() 获取需要特殊处理的两行的索引并存储它,因此您不需要drop.
  • 中巨大笨拙的逻辑过滤器表达式
  • 不需要做那么多临时列,或者临时副本ranks_2 = ranks.drop(...);只需将 drop() 的结果传递给 rank() ...
  • ...通过所需列上的 .sum(axis=1),无需定义 lambda,或将其输出保存在临时列 'overall'.
  • ...然后我们只需将这些秩和输入 rank(),这将为我们提供 1..8 的值,因此我们加 1 以抵消 rank() 的结果为 2..9。 (你可以概括这个)。
  • 我们手动为 Ave. SpendAve. Recency 行设置 'overall_rank'。
  • (是的,您也可以将所有这些实现为自定义函数,其输入是四个 Ave. 列或四个 *_rank 列。)

代码:(请参阅底部的样板以读取您的数据框,下次请制作您的示例 MCVE,以帮助我们帮助您)

# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)

# Find the indices of both the highest AveSpend and AveRecency    
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()

# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')

# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1 
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)

这是提取数据的样板文件:

import pandas as pd

from io import StringIO

# """Cluster   No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0           145118        39.0197       1.7264         14,971.85     8,672.07
1           2             47.0          19.0          237,270.00    12,487.89
2           1236          15.9716       24.9101       126,992.79     5,098.02
3           219847        41.9736       3.0682         17,701.64     5,769.23
4           9837          23.9330       3.2735        172,642.35    52,738.42
5           64865         24.8281       1.8599         13,159.21     7,075.19
6           3855          26.5647       3.9304         54,333.56    13,823.64
7           219549        17.7493       3.3356         17,570.67     5,267.52
8           34171         23.5205       9.1703         42,136.68     4,594.89
9           3924120       24.7933       1.1684          4,754.76     4,069.21 """

# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')

ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
    "Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))