Pandas DataFrame GroupBy Rank

Pandas DataFrame GroupBy Rank

数据框:

        account_id  plan_id     policy_group_nbr    plan_type               split_eff_date  splits
0       470804      8739131     Conversion732       Onsite Medical Center   1/19/2022       Bob Smith (28.2) | John Doe (35.9) | A...
1       470804      8739131     Conversion732       Onsite Medical Center   1/21/2022       Bob Smith (19.2) | John Doe (34.6) | A...
2       470809      2644045     801790              401(k)                  1/18/2022       Jim Jones (100)
3       470809      2644045     801790              401(k)                  1/5/2022        Other Name (50) | Jim Jones (50)
4       470809      2738854     801789              401(k)                  1/18/2022       Jim Jones (100)
... ... ... ... ... ... ...
1720    3848482     18026734    24794               Accident                1/20/2022       Bill Underwood (50) | Jim Jones (50)
1721    3848482     18026781    BCSC                FSA Admin               1/20/2022       Bill Underwood (50) | Jim Jones (50)
1722    3927880     19602958    Consulting          Other                   1/20/2022       Bill Brown (50) | Tim Scott (50)
1723    3927880     19863300    Producer Expense    5500 Filing             1/20/2022       Bill Brown (50) | Tim Scott (50)
1724    3927880     19863300    Producer Expense    5500 Filing             1/21/2022       Bill Brown (50) | Tim Scott (50)

我需要分组(account_id、plan_id、policy_group_nbr、plan_type),按 split_eff_date(desc)排序,以便删除该组的所有行,但保留所有列的最新日期。我可以获得排名,但是,在尝试将参数传递给 lambda 函数时,我收到了 TypeError。

按预期工作:

splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values().rank())

类型错误:插入列的索引与框架索引不兼容

splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values(ascending=False).rank())

传递轴参数似乎也没有帮助...这是一个简单的语法问题,还是我没有正确理解函数?

使用 .transform().

更容易 - 通常更快 - 做到这一点

更容易,因为当您降序排序时,当您尝试分配回原始 DataFrame 时索引不匹配。我尝试不在 .groupby() 中使用索引,但无法正常工作。

link 到关于 .transform() 的文档:https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.transform.html

我建议像这样使用 .transform(),并且一定要向 .rank() 提供 ascending=False kwarg:

df["rank2"] = df.groupby(["account_id", "plan_id", "policy_group_nbr", "plan_type"])[ 
    "split_eff_date" 
].transform( 
    lambda x: x.sort_values(ascending=False).rank(ascending=False, method="first") 
)     

两种排名的结果——我只从您的示例数据中提取了前 5 行:

In [93]: df
Out[93]: 
   account_id  plan_id policy_group_nbr              plan_type split_eff_date  rank  rank2
3      470809  2644045           801790                 401(k)     2022-01-05   1.0    2.0
2      470809  2644045           801790                 401(k)     2022-01-18   2.0    1.0
4      470809  2738854           801789                 401(k)     2022-01-18   1.0    1.0
0      470804  8739131    Conversion732  Onsite Medical Center     2022-01-19   1.0    2.0
1      470804  8739131    Conversion732  Onsite Medical Center     2022-01-21   2.0    1.0