根据组内条件过滤 Pandas Groupby 中的行
Filtering rows in Pandas Groupby based on a condition within the group
尽管进行了大量搜索,我已经为此苦苦挣扎了几天。我遇到过许多类似的问题,但我无法找到适合我的解决方案。
这是我的起始数据框:
data = {
"account_id": ["1001", "1001", "1002", "1002", "1002", "1002", "1002", "1003", "1003", "1003"],
"data_type": ["initial_balance", "payment", "payment", "initial_balance", "payment", "payment", "payment", "payment", "initial_balance", "payment"],
"transaction_date": ["2022-04-01", "2022-04-14", "2022-03-01", "2022-04-02", "2022-04-13", "2022-05-01", "2022-05-03", "2022-03-13", "2022-04-10", "2022-04-20"],
"amount": [100, -20, -30, 200, -20, -20, -20, -10, 150, -50],}
其中,在 Pandas 中一次变为:
df
我希望按 account_id
分组并删除 data_type
=“initial_balance”条目之前的任何条目。一旦我得到这个,我就可以 cumsum
遍历剩余的组行以达到当前余额。所以期望的结果(包括 cumsum
“account_balance” 列)是:
Desired result including the cumsum
"account_balance" column
我试过以下方法:
df.groupby("account_id").filter(lambda x:x["transaction_date"]>=x[x["data_type"]=="initial_balance"]["transaction_date"])
但这只会产生错误:ValueError: Can only compare identically-labelled Series objects
我希望我已经提供了足够的信息来帮助别人。非常感谢您的帮助。
这样就可以了。
grouped_df = df.groupby("account_id")
groups = []
for group in df["account_id"].unique():
group_df = grouped_df.get_group(group)
group_df = group_df.loc[group_df[group_df["data_type"] == "initial_balance"].index[0]:, :]
group_df["amount"] = group_df["amount"].cumsum()
groups.append(group_df)
df = pd.concat(groups)
输出-
account_id | data_type | transaction_date | amount | |
---|---|---|---|---|
0 | 1001 | initial_balance | 2022-04-01 | 100 |
1 | 1001 | payment | 2022-04-14 | 80 |
3 | 1002 | initial_balance | 2022-04-02 | 200 |
4 | 1002 | payment | 2022-04-13 | 180 |
5 | 1002 | payment | 2022-05-01 | 160 |
你可以这样做:
m = (df['data_type'] == "initial_balance").groupby(df['account_id']).cummax()
df_out = df[m].groupby('account_id')['amount'].cumsum()\
.reset_index(name='account_balance')\
.merge(df, left_on='index', right_index=True)
df_out
输出:
index account_balance account_id data_type transaction_date amount
0 0 100 1001 initial_balance 2022-04-01 100
1 1 80 1001 payment 2022-04-14 -20
2 3 200 1002 initial_balance 2022-04-02 200
3 4 180 1002 payment 2022-04-13 -20
4 5 160 1002 payment 2022-05-01 -20
5 6 140 1002 payment 2022-05-03 -20
6 8 150 1003 initial_balance 2022-04-10 150
7 9 100 1003 payment 2022-04-20 -50
详细信息,创建一个布尔系列,当 data_type 等于 initial_balance 时为真,然后使用 cummax() 通过 account_id 对该系列进行分组,为原始数据框创建掩码保持 inital_balance 和之后的记录 account_id.
接下来,通过 account_id 过滤 dataframea 和 cumsum 的 groupby,使用内部联接将此数据合并回原始数据帧以删除不需要的记录。
您可以分组 account_id
并在第一个 initial_balance
之前过滤行,然后在 amount
列
cumsum()
out = df.groupby('account_id').apply(lambda g: g[g['data_type'].eq('initial_balance').cumsum().eq(1)]).reset_index(drop=True)
out['amount'] = out.groupby('account_id')['amount'].cumsum()
print(out)
account_id data_type transaction_date amount
0 1001 initial_balance 2022-04-01 100
1 1001 payment 2022-04-14 80
2 1002 initial_balance 2022-04-02 200
3 1002 payment 2022-04-13 180
4 1002 payment 2022-05-01 160
5 1002 payment 2022-05-03 140
6 1003 initial_balance 2022-04-10 150
7 1003 payment 2022-04-20 100