Pandas:仅当时间戳大于另一列时,获取该列的累计和

Pandas: get the cumulative sum of a column only if the timestamp is greater than that of another column

对于每个客户,我只想在时间戳 1 小于时间戳 2 时获取列的累计总和(美元价值)。我可以根据客户对值进行笛卡尔连接或遍历dataframe,但想看看是否有更简单的方法可以使用 groupby 和 apply 来做到这一点。

数据帧:

df = pd.DataFrame({'Customer': ['Alice','Brian','Alice','Alice','Alice','Brian', 'Brian'], 'Timestamp': [1,2,3,4,5,3,6], 'Timestamp 2': [2,5,4,6,7,5,7], 'Dollar Value':[0,1,3,5,3,2,3]})

排序值:

df = df.sort_values(['Customer','Timestamp'])

预期结果:

df['Desired_result'] = [0,0,0,3,0,0,3]

这可行

获取条件匹配的行然后做cumsum

cond = df["Timestamp"]>df["Timestamp 2"]
df["Dollar Value"].where(cond, 0).groupby([cond, df["Customer"]]).cumsum()

编辑 根据您的评论,这可能就是您想要的

df = pd.DataFrame({'Customer': ['Alice','Brian','Alice','Alice','Alice','Brian', 'Brian'], 'Timestamp': [1,2,3,4,5,3,6], 'Timestamp 2': [2,5,4,6,7,5,7], 'Dollar Value':[0,1,3,5,3,2,3]})

def sum_dollar_value(group):
    group = group.copy()
    last_row = group.iloc[-1, :]
    cond = group["Timestamp 2"]<last_row["Timestamp"]
    group.loc[last_row.name, "result"] = np.sum(group["Dollar Value"].where(cond, 0))
    return group

df.groupby("Customer").apply(sum_dollar_value).reset_index(level=0, drop=True)

我建议设置条件,然后按客户分组:

# set condition
cond = df["Timestamp"]<df["Timestamp 2"]
df[cond].groupby('Customer')['Dollar Value'].sum()

Note: I borrowed the syntax of condition from the previous answer by