根据 Pandas Groupby 中的组合条件添加新列

Question

继我之前的问题之后（感谢那些回应者），我再次陷入了使用 Pandas 中的 groupby 实现我怀疑可能实现的目标。这就是我想要实现的目标。使用以下示例数据框：

data_initial = {
"account_id": ['1001', '1001', '1001', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1003', '1003', '1003', '1003', '1003', '1003',],
"data_type": ['payment', 'payment', 'payment', 'payment', 'payment', 'plan', 'payment', 'plan', 'plan', 'payment', 'payment', 'payment', 'payment', 'payment', 'plan', 'payment', 'payment', 'payment', 'payment',],
"transaction_date": ['2022-04-01', '2022-04-12', '2022-05-02', '2022-02-02', '2022-03-01', '2022-03-15', '2022-04-01', '2022-04-01', '2022-04-13', '2022-04-26', '2022-05-01', '2022-05-04', '2022-05-10', '2022-03-10', '2022-03-25', '2022-04-05', '2022-04-16', '2022-04-24', '2022-05-05',],
"amount": ['-50', '-40', '-60', '-30', '-25', '250', '-50', '200', '200', '-25', '-25', '-25', '-25', '-20', '100', '-25', '-25', '-25', '-25',],}

我希望有效地对 account_id 进行分组，然后应用以下逻辑：

IF data_type 是“payment” AND {account_id 没有 data_type = “plan” 或者记录的 transaction_date 是 BEFORE任何 data_type = "计划" 记录} 然后新列 classification = "receipt_not_plan_related"
如果 data_type 是“付款”并且 {account_id 有一个 data_type = “计划”并且 transaction_date 在任何 [=14 之后=] = "计划" 记录} 然后新列 classification = "receipt_on_plan"
如果 data_type 是“计划”是“计划”的唯一实例，那么新列 classification = “仅”
如果 data_type 是“计划”并且是“计划”的第一个实例，那么新列 classification = “初始”
如果 data_type 是“计划”并且不是“计划”的第一个也不是最后一个实例，那么新列 classification = “已过期”
如果 data_type 是“计划”并且是“计划”的最后一个实例，那么新列 classification = “当前”

因此，示例数据帧的结果如下：

再次感谢任何可以提供帮助的人。非常感谢。

Answer 1

您可以使用 np.select 和几个辅助列来完成：

import numpy as np

df['plans'] = df.groupby('account_id')['data_type'].transform(lambda x: x.eq('plan').cumsum())
df['n_plans'] = df.groupby('account_id')['plans'].transform('max')

is_payment = df['data_type'].eq('payment')
is_plan = df['data_type'].eq('plan')
df['classification'] = np.select([is_payment & df['plans'].eq(0),
                                  is_payment & df['plans'].gt(0),
                                  is_plan & df['n_plans'].eq(1),
                                  is_plan & df['plans'].eq(1),
                                  is_plan & df['plans'].gt(1) & df['plans'].lt(df['n_plans']),
                                  is_plan & df['plans'].eq(df['n_plans'])], 
                                ['receipt_not_plan_related',
                                 'receipt_on_plan',
                                 'only',
                                 'initial',
                                 'expired',
                                 'current'])

print(df.drop(columns=['plans', 'n_plans']))

   account_id data_type transaction_date amount            classification
0        1001   payment       2022-04-01    -50  receipt_not_plan_related
1        1001   payment       2022-04-12    -40  receipt_not_plan_related
2        1001   payment       2022-05-02    -60  receipt_not_plan_related
3        1002   payment       2022-02-02    -30  receipt_not_plan_related
4        1002   payment       2022-03-01    -25  receipt_not_plan_related
5        1002      plan       2022-03-15    250                   initial
6        1002   payment       2022-04-01    -50           receipt_on_plan
7        1002      plan       2022-04-01    200                   expired
8        1002      plan       2022-04-13    200                   current
9        1002   payment       2022-04-26    -25           receipt_on_plan
10       1002   payment       2022-05-01    -25           receipt_on_plan
11       1002   payment       2022-05-04    -25           receipt_on_plan
12       1002   payment       2022-05-10    -25           receipt_on_plan
13       1003   payment       2022-03-10    -20  receipt_not_plan_related
14       1003      plan       2022-03-25    100                      only
15       1003   payment       2022-04-05    -25           receipt_on_plan
16       1003   payment       2022-04-16    -25           receipt_on_plan
17       1003   payment       2022-04-24    -25           receipt_on_plan
18       1003   payment       2022-05-05    -25           receipt_on_plan

请注意，记录需要在每个 'account_id' 中按 'transaction_date' 升序排序才能工作，因为检查了“之前”、“第一个”和“最后一个”条件使用 GroupBy.transform() 计算累计总和。

根据 Pandas Groupby 中的组合条件添加新列

Adding new column based on combined criteria in Pandas Groupby

python

conditional-statements

pandas

pandas-groupby