根据 Pandas Groupby 中的组合条件添加新列
Adding new column based on combined criteria in Pandas Groupby
继我之前的问题之后(感谢那些回应者),我再次陷入了使用 Pandas 中的 groupby
实现我怀疑可能实现的目标。这就是我想要实现的目标。使用以下示例数据框:
data_initial = {
"account_id": ['1001', '1001', '1001', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1003', '1003', '1003', '1003', '1003', '1003',],
"data_type": ['payment', 'payment', 'payment', 'payment', 'payment', 'plan', 'payment', 'plan', 'plan', 'payment', 'payment', 'payment', 'payment', 'payment', 'plan', 'payment', 'payment', 'payment', 'payment',],
"transaction_date": ['2022-04-01', '2022-04-12', '2022-05-02', '2022-02-02', '2022-03-01', '2022-03-15', '2022-04-01', '2022-04-01', '2022-04-13', '2022-04-26', '2022-05-01', '2022-05-04', '2022-05-10', '2022-03-10', '2022-03-25', '2022-04-05', '2022-04-16', '2022-04-24', '2022-05-05',],
"amount": ['-50', '-40', '-60', '-30', '-25', '250', '-50', '200', '200', '-25', '-25', '-25', '-25', '-20', '100', '-25', '-25', '-25', '-25',],}
我希望有效地对 account_id
进行分组,然后应用以下逻辑:
IF data_type
是“payment” AND {account_id
没有 data_type
= “plan” 或者记录的 transaction_date
是 BEFORE任何 data_type
= "计划" 记录} 然后新列 classification
= "receipt_not_plan_related"
如果 data_type
是“付款”并且 {account_id
有一个 data_type
= “计划”并且 transaction_date
在任何 [=14 之后=] = "计划" 记录} 然后新列 classification
= "receipt_on_plan"
如果 data_type
是“计划”是“计划”的唯一实例,那么新列 classification
= “仅”
如果 data_type
是“计划”并且是“计划”的第一个实例,那么新列 classification
= “初始”
如果 data_type
是“计划”并且不是“计划”的第一个也不是最后一个实例,那么新列 classification
= “已过期”
如果 data_type
是“计划”并且是“计划”的最后一个实例,那么新列 classification
= “当前”
因此,示例数据帧的结果如下:
再次感谢任何可以提供帮助的人。非常感谢。
您可以使用 np.select
和几个辅助列来完成:
import numpy as np
df['plans'] = df.groupby('account_id')['data_type'].transform(lambda x: x.eq('plan').cumsum())
df['n_plans'] = df.groupby('account_id')['plans'].transform('max')
is_payment = df['data_type'].eq('payment')
is_plan = df['data_type'].eq('plan')
df['classification'] = np.select([is_payment & df['plans'].eq(0),
is_payment & df['plans'].gt(0),
is_plan & df['n_plans'].eq(1),
is_plan & df['plans'].eq(1),
is_plan & df['plans'].gt(1) & df['plans'].lt(df['n_plans']),
is_plan & df['plans'].eq(df['n_plans'])],
['receipt_not_plan_related',
'receipt_on_plan',
'only',
'initial',
'expired',
'current'])
print(df.drop(columns=['plans', 'n_plans']))
account_id data_type transaction_date amount classification
0 1001 payment 2022-04-01 -50 receipt_not_plan_related
1 1001 payment 2022-04-12 -40 receipt_not_plan_related
2 1001 payment 2022-05-02 -60 receipt_not_plan_related
3 1002 payment 2022-02-02 -30 receipt_not_plan_related
4 1002 payment 2022-03-01 -25 receipt_not_plan_related
5 1002 plan 2022-03-15 250 initial
6 1002 payment 2022-04-01 -50 receipt_on_plan
7 1002 plan 2022-04-01 200 expired
8 1002 plan 2022-04-13 200 current
9 1002 payment 2022-04-26 -25 receipt_on_plan
10 1002 payment 2022-05-01 -25 receipt_on_plan
11 1002 payment 2022-05-04 -25 receipt_on_plan
12 1002 payment 2022-05-10 -25 receipt_on_plan
13 1003 payment 2022-03-10 -20 receipt_not_plan_related
14 1003 plan 2022-03-25 100 only
15 1003 payment 2022-04-05 -25 receipt_on_plan
16 1003 payment 2022-04-16 -25 receipt_on_plan
17 1003 payment 2022-04-24 -25 receipt_on_plan
18 1003 payment 2022-05-05 -25 receipt_on_plan
请注意,记录需要在每个 'account_id'
中按 'transaction_date'
升序排序才能工作,因为检查了“之前”、“第一个”和“最后一个”条件使用 GroupBy.transform()
计算累计总和。
继我之前的问题之后(感谢那些回应者),我再次陷入了使用 Pandas 中的 groupby
实现我怀疑可能实现的目标。这就是我想要实现的目标。使用以下示例数据框:
data_initial = {
"account_id": ['1001', '1001', '1001', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1003', '1003', '1003', '1003', '1003', '1003',],
"data_type": ['payment', 'payment', 'payment', 'payment', 'payment', 'plan', 'payment', 'plan', 'plan', 'payment', 'payment', 'payment', 'payment', 'payment', 'plan', 'payment', 'payment', 'payment', 'payment',],
"transaction_date": ['2022-04-01', '2022-04-12', '2022-05-02', '2022-02-02', '2022-03-01', '2022-03-15', '2022-04-01', '2022-04-01', '2022-04-13', '2022-04-26', '2022-05-01', '2022-05-04', '2022-05-10', '2022-03-10', '2022-03-25', '2022-04-05', '2022-04-16', '2022-04-24', '2022-05-05',],
"amount": ['-50', '-40', '-60', '-30', '-25', '250', '-50', '200', '200', '-25', '-25', '-25', '-25', '-20', '100', '-25', '-25', '-25', '-25',],}
我希望有效地对 account_id
进行分组,然后应用以下逻辑:
IF
data_type
是“payment” AND {account_id
没有data_type
= “plan” 或者记录的transaction_date
是 BEFORE任何data_type
= "计划" 记录} 然后新列classification
= "receipt_not_plan_related"如果
data_type
是“付款”并且 {account_id
有一个data_type
= “计划”并且transaction_date
在任何 [=14 之后=] = "计划" 记录} 然后新列classification
= "receipt_on_plan"如果
data_type
是“计划”是“计划”的唯一实例,那么新列classification
= “仅”如果
data_type
是“计划”并且是“计划”的第一个实例,那么新列classification
= “初始”如果
data_type
是“计划”并且不是“计划”的第一个也不是最后一个实例,那么新列classification
= “已过期”如果
data_type
是“计划”并且是“计划”的最后一个实例,那么新列classification
= “当前”
因此,示例数据帧的结果如下:
再次感谢任何可以提供帮助的人。非常感谢。
您可以使用 np.select
和几个辅助列来完成:
import numpy as np
df['plans'] = df.groupby('account_id')['data_type'].transform(lambda x: x.eq('plan').cumsum())
df['n_plans'] = df.groupby('account_id')['plans'].transform('max')
is_payment = df['data_type'].eq('payment')
is_plan = df['data_type'].eq('plan')
df['classification'] = np.select([is_payment & df['plans'].eq(0),
is_payment & df['plans'].gt(0),
is_plan & df['n_plans'].eq(1),
is_plan & df['plans'].eq(1),
is_plan & df['plans'].gt(1) & df['plans'].lt(df['n_plans']),
is_plan & df['plans'].eq(df['n_plans'])],
['receipt_not_plan_related',
'receipt_on_plan',
'only',
'initial',
'expired',
'current'])
print(df.drop(columns=['plans', 'n_plans']))
account_id data_type transaction_date amount classification
0 1001 payment 2022-04-01 -50 receipt_not_plan_related
1 1001 payment 2022-04-12 -40 receipt_not_plan_related
2 1001 payment 2022-05-02 -60 receipt_not_plan_related
3 1002 payment 2022-02-02 -30 receipt_not_plan_related
4 1002 payment 2022-03-01 -25 receipt_not_plan_related
5 1002 plan 2022-03-15 250 initial
6 1002 payment 2022-04-01 -50 receipt_on_plan
7 1002 plan 2022-04-01 200 expired
8 1002 plan 2022-04-13 200 current
9 1002 payment 2022-04-26 -25 receipt_on_plan
10 1002 payment 2022-05-01 -25 receipt_on_plan
11 1002 payment 2022-05-04 -25 receipt_on_plan
12 1002 payment 2022-05-10 -25 receipt_on_plan
13 1003 payment 2022-03-10 -20 receipt_not_plan_related
14 1003 plan 2022-03-25 100 only
15 1003 payment 2022-04-05 -25 receipt_on_plan
16 1003 payment 2022-04-16 -25 receipt_on_plan
17 1003 payment 2022-04-24 -25 receipt_on_plan
18 1003 payment 2022-05-05 -25 receipt_on_plan
请注意,记录需要在每个 'account_id'
中按 'transaction_date'
升序排序才能工作,因为检查了“之前”、“第一个”和“最后一个”条件使用 GroupBy.transform()
计算累计总和。