根据组内条件过滤 Pandas Groupby 中的行

Filtering rows in Pandas Groupby based on a condition within the group

尽管进行了大量搜索,我已经为此苦苦挣扎了几天。我遇到过许多类似的问题,但我无法找到适合我的解决方案。

这是我的起始数据框:

data = {
"account_id": ["1001", "1001", "1002", "1002", "1002", "1002", "1002", "1003", "1003", "1003"],
"data_type": ["initial_balance", "payment", "payment", "initial_balance", "payment", "payment", "payment", "payment", "initial_balance", "payment"],
"transaction_date": ["2022-04-01", "2022-04-14", "2022-03-01", "2022-04-02", "2022-04-13", "2022-05-01", "2022-05-03", "2022-03-13", "2022-04-10", "2022-04-20"],
"amount": [100, -20, -30, 200, -20, -20, -20, -10, 150, -50],}

其中,在 Pandas 中一次变为:

df

我希望按 account_id 分组并删除 data_type =“initial_balance”条目之前的任何条目。一旦我得到这个,我就可以 cumsum 遍历剩余的组行以达到当前余额。所以期望的结果(包括 cumsum “account_balance” 列)是:

Desired result including the cumsum "account_balance" column

我试过以下方法:

df.groupby("account_id").filter(lambda x:x["transaction_date"]>=x[x["data_type"]=="initial_balance"]["transaction_date"])

但这只会产生错误:ValueError: Can only compare identically-labelled Series objects

我希望我已经提供了足够的信息来帮助别人。非常感谢您的帮助。

这样就可以了。

grouped_df = df.groupby("account_id")
groups = []

for group in df["account_id"].unique():
  group_df = grouped_df.get_group(group)
  group_df = group_df.loc[group_df[group_df["data_type"] == "initial_balance"].index[0]:, :]
  group_df["amount"] = group_df["amount"].cumsum()
  groups.append(group_df)

df = pd.concat(groups)

输出-

account_id data_type transaction_date amount
0 1001 initial_balance 2022-04-01 100
1 1001 payment 2022-04-14 80
3 1002 initial_balance 2022-04-02 200
4 1002 payment 2022-04-13 180
5 1002 payment 2022-05-01 160

你可以这样做:

m = (df['data_type'] == "initial_balance").groupby(df['account_id']).cummax()

df_out = df[m].groupby('account_id')['amount'].cumsum()\
              .reset_index(name='account_balance')\
              .merge(df, left_on='index', right_index=True)
df_out

输出:

   index  account_balance account_id        data_type transaction_date  amount
0      0              100       1001  initial_balance       2022-04-01     100
1      1               80       1001          payment       2022-04-14     -20
2      3              200       1002  initial_balance       2022-04-02     200
3      4              180       1002          payment       2022-04-13     -20
4      5              160       1002          payment       2022-05-01     -20
5      6              140       1002          payment       2022-05-03     -20
6      8              150       1003  initial_balance       2022-04-10     150
7      9              100       1003          payment       2022-04-20     -50

详细信息,创建一个布尔系列,当 data_type 等于 initial_balance 时为真,然后使用 cummax() 通过 account_id 对该系列进行分组,为原始数据框创建掩码保持 inital_balance 和之后的记录 account_id.

接下来,通过 account_id 过滤 dataframea 和 cumsum 的 groupby,使用内部联接将此数据合并回原始数据帧以删除不需要的记录。

您可以分组 account_id 并在第一个 initial_balance 之前过滤行,然后在 amount

上过滤 cumsum()
out = df.groupby('account_id').apply(lambda g: g[g['data_type'].eq('initial_balance').cumsum().eq(1)]).reset_index(drop=True)
out['amount'] = out.groupby('account_id')['amount'].cumsum()
print(out)

  account_id        data_type transaction_date  amount
0       1001  initial_balance       2022-04-01     100
1       1001          payment       2022-04-14      80
2       1002  initial_balance       2022-04-02     200
3       1002          payment       2022-04-13     180
4       1002          payment       2022-05-01     160
5       1002          payment       2022-05-03     140
6       1003  initial_balance       2022-04-10     150
7       1003          payment       2022-04-20     100