Pandas 字符串列排名

Question

我有一个这样的示例数据框。基本上我想根据 item_number 和 location_id 进行排名。我本可以在 SQL 中使用 window 函数（dense_rank，over() partition by）做类似的事情。

df = pd.DataFrame({'item_number': [1029980, 1029980, 1029980, 1029980, 1029980, 
                                   1029980, 1029980, 1029980, 1029980, 1029980],
                   'location_id': ['L3-25-AA-05-B', 'L3-25-AA-05-B', 'L3-25-AA-05-B', 'L3-25-AA-05-B', 'L3-25-AA-05-B', 
                                   'L4-25-AA-05-B', 'L4-25-AA-05-B','L4-25-AA-05-B', 'L4-25-AA-05-B', 'L4-25-AA-05-B'],
                   'Date': ['2021-10-01', '2021-10-02', '2021-10-03', '2021-10-04', '2021-10-05', 
                            '2021-10-01', '2021-10-02', '2021-10-03', '2021-10-04', '2021-10-05']})

item_number	location_id	Date
1029980	L3-25-AA-05-B	2021-10-01
1029980	L3-25-AA-05-B	2021-10-02
1029980	L3-25-AA-05-B	2021-10-03
1029980	L3-25-AA-05-B	2021-10-04
1029980	L3-25-AA-05-B	2021-10-05
1029980	L4-25-AA-05-B	2021-10-01
1029980	L4-25-AA-05-B	2021-10-02
1029980	L4-25-AA-05-B	2021-10-03
1029980	L4-25-AA-05-B	2021-10-04
1029980	L4-25-AA-05-B	2021-10-05

我希望数据是这样的。排名按 item_number 和 location_id 分组。如果 item_number 和 location_id 相同，则认为它在同一组中，应根据日期进行排名。

item_number	location_id	Date	Rank
1029980	L3-25-AA-05-B	2021-10-01	5
1029980	L3-25-AA-05-B	2021-10-02	4
1029980	L3-25-AA-05-B	2021-10-03	3
1029980	L3-25-AA-05-B	2021-10-04	2
1029980	L3-25-AA-05-B	2021-10-05	1
1029980	L4-25-AA-05-B	2021-10-01	5
1029980	L4-25-AA-05-B	2021-10-02	4
1029980	L4-25-AA-05-B	2021-10-03	3
1029980	L4-25-AA-05-B	2021-10-04	2
1029980	L4-25-AA-05-B	2021-10-05	1

我试过这段代码，但它给出了一个错误，因为列都是字符串。

test['rank'] = test.groupby(['item_number','location_id']).rank()

上面的代码给了我这个错误。

DataError: No numeric types to aggregate

谁能帮我解决这方面的问题？

Answer 1

您可以使用：

test.dtypes

查看您正在使用的列是什么类型（如果它们不是数字）然后可能使用 astype：https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

test.astype({"item_number": "int"}).groupby(['item_number','location_id']).rank()

虽然我不确定这是否适用于 location_id。

Answer 2

IIUC，你可以反转数据帧，groupby on item_number and location_id and cumcount Dates:

df['rank'] = df.groupby(['item_number','location_id'], as_index=False)['Date'].cumcount(ascending=False)+1

输出：

   item_number    location_id        Date  rank
0      1029980  L3-25-AA-05-B  2021-10-01     5
1      1029980  L3-25-AA-05-B  2021-10-02     4
2      1029980  L3-25-AA-05-B  2021-10-03     3
3      1029980  L3-25-AA-05-B  2021-10-04     2
4      1029980  L3-25-AA-05-B  2021-10-05     1
5      1029980  L4-25-AA-05-B  2021-10-01     5
6      1029980  L4-25-AA-05-B  2021-10-02     4
7      1029980  L4-25-AA-05-B  2021-10-03     3
8      1029980  L4-25-AA-05-B  2021-10-04     2
9      1029980  L4-25-AA-05-B  2021-10-05     1

Answer 3

你的情况

df['new'] = df.groupby(['item_number','location_id'])['Date'].rank(ascending=False)
0    5.0
1    4.0
2    3.0
3    2.0
4    1.0
5    5.0
6    4.0
7    3.0
8    2.0
9    1.0
Name: Date, dtype: float64

Pandas 字符串列排名

Pandas Ranking for String Columns

python

rank

pandas

ranking-functions

pandas-groupby