Pandas DataFrame GroupBy Rank
Pandas DataFrame GroupBy Rank
数据框:
account_id plan_id policy_group_nbr plan_type split_eff_date splits
0 470804 8739131 Conversion732 Onsite Medical Center 1/19/2022 Bob Smith (28.2) | John Doe (35.9) | A...
1 470804 8739131 Conversion732 Onsite Medical Center 1/21/2022 Bob Smith (19.2) | John Doe (34.6) | A...
2 470809 2644045 801790 401(k) 1/18/2022 Jim Jones (100)
3 470809 2644045 801790 401(k) 1/5/2022 Other Name (50) | Jim Jones (50)
4 470809 2738854 801789 401(k) 1/18/2022 Jim Jones (100)
... ... ... ... ... ... ...
1720 3848482 18026734 24794 Accident 1/20/2022 Bill Underwood (50) | Jim Jones (50)
1721 3848482 18026781 BCSC FSA Admin 1/20/2022 Bill Underwood (50) | Jim Jones (50)
1722 3927880 19602958 Consulting Other 1/20/2022 Bill Brown (50) | Tim Scott (50)
1723 3927880 19863300 Producer Expense 5500 Filing 1/20/2022 Bill Brown (50) | Tim Scott (50)
1724 3927880 19863300 Producer Expense 5500 Filing 1/21/2022 Bill Brown (50) | Tim Scott (50)
我需要分组(account_id、plan_id、policy_group_nbr、plan_type),按 split_eff_date(desc)排序,以便删除该组的所有行,但保留所有列的最新日期。我可以获得排名,但是,在尝试将参数传递给 lambda 函数时,我收到了 TypeError。
按预期工作:
splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values().rank())
类型错误:插入列的索引与框架索引不兼容
splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values(ascending=False).rank())
传递轴参数似乎也没有帮助...这是一个简单的语法问题,还是我没有正确理解函数?
使用 .transform()
.
更容易 - 通常更快 - 做到这一点
更容易,因为当您降序排序时,当您尝试分配回原始 DataFrame 时索引不匹配。我尝试不在 .groupby()
中使用索引,但无法正常工作。
link 到关于 .transform()
的文档:https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.transform.html
我建议像这样使用 .transform()
,并且一定要向 .rank()
提供 ascending=False
kwarg:
df["rank2"] = df.groupby(["account_id", "plan_id", "policy_group_nbr", "plan_type"])[
"split_eff_date"
].transform(
lambda x: x.sort_values(ascending=False).rank(ascending=False, method="first")
)
两种排名的结果——我只从您的示例数据中提取了前 5 行:
In [93]: df
Out[93]:
account_id plan_id policy_group_nbr plan_type split_eff_date rank rank2
3 470809 2644045 801790 401(k) 2022-01-05 1.0 2.0
2 470809 2644045 801790 401(k) 2022-01-18 2.0 1.0
4 470809 2738854 801789 401(k) 2022-01-18 1.0 1.0
0 470804 8739131 Conversion732 Onsite Medical Center 2022-01-19 1.0 2.0
1 470804 8739131 Conversion732 Onsite Medical Center 2022-01-21 2.0 1.0
数据框:
account_id plan_id policy_group_nbr plan_type split_eff_date splits
0 470804 8739131 Conversion732 Onsite Medical Center 1/19/2022 Bob Smith (28.2) | John Doe (35.9) | A...
1 470804 8739131 Conversion732 Onsite Medical Center 1/21/2022 Bob Smith (19.2) | John Doe (34.6) | A...
2 470809 2644045 801790 401(k) 1/18/2022 Jim Jones (100)
3 470809 2644045 801790 401(k) 1/5/2022 Other Name (50) | Jim Jones (50)
4 470809 2738854 801789 401(k) 1/18/2022 Jim Jones (100)
... ... ... ... ... ... ...
1720 3848482 18026734 24794 Accident 1/20/2022 Bill Underwood (50) | Jim Jones (50)
1721 3848482 18026781 BCSC FSA Admin 1/20/2022 Bill Underwood (50) | Jim Jones (50)
1722 3927880 19602958 Consulting Other 1/20/2022 Bill Brown (50) | Tim Scott (50)
1723 3927880 19863300 Producer Expense 5500 Filing 1/20/2022 Bill Brown (50) | Tim Scott (50)
1724 3927880 19863300 Producer Expense 5500 Filing 1/21/2022 Bill Brown (50) | Tim Scott (50)
我需要分组(account_id、plan_id、policy_group_nbr、plan_type),按 split_eff_date(desc)排序,以便删除该组的所有行,但保留所有列的最新日期。我可以获得排名,但是,在尝试将参数传递给 lambda 函数时,我收到了 TypeError。
按预期工作:
splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values().rank())
类型错误:插入列的索引与框架索引不兼容
splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values(ascending=False).rank())
传递轴参数似乎也没有帮助...这是一个简单的语法问题,还是我没有正确理解函数?
使用 .transform()
.
更容易,因为当您降序排序时,当您尝试分配回原始 DataFrame 时索引不匹配。我尝试不在 .groupby()
中使用索引,但无法正常工作。
link 到关于 .transform()
的文档:https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.transform.html
我建议像这样使用 .transform()
,并且一定要向 .rank()
提供 ascending=False
kwarg:
df["rank2"] = df.groupby(["account_id", "plan_id", "policy_group_nbr", "plan_type"])[
"split_eff_date"
].transform(
lambda x: x.sort_values(ascending=False).rank(ascending=False, method="first")
)
两种排名的结果——我只从您的示例数据中提取了前 5 行:
In [93]: df
Out[93]:
account_id plan_id policy_group_nbr plan_type split_eff_date rank rank2
3 470809 2644045 801790 401(k) 2022-01-05 1.0 2.0
2 470809 2644045 801790 401(k) 2022-01-18 2.0 1.0
4 470809 2738854 801789 401(k) 2022-01-18 1.0 1.0
0 470804 8739131 Conversion732 Onsite Medical Center 2022-01-19 1.0 2.0
1 470804 8739131 Conversion732 Onsite Medical Center 2022-01-21 2.0 1.0