如何将所有行放在字符串匹配数据框下方

How to drop all rows below string match dataframe

我有一个数据框,我只对 string text = "purchase" by session 上面的数据感兴趣。 input dataframe

session Date action flag_purchase
T001 01-01-2021 00.01 click 1
T001 01-01-2021 00.15 play 1
T001 01-01-2021 02.15 pause 1
T001 01-01-2021 03.15 play 1
T001 01-01-2021 04.15 purchase 1
T001 02-01-2021 10.15 play 1
T001 02-01-2021 12.00 pause 1
T001 02-01-2021 13.15 play 1
T002 01-01-2021 00.01 play 0
T002 03-01-2021 00.15 play 0
T002 03-01-2021 02.15 pause 0
T002 03-01-2021 03.15 play 0

我想删除 action = "purchase" 下面的所有行,如果会话中的所有操作都没有文本匹配,会话将保留所有行,所以我想要的输出如下所示:

final result

session Date action flag_purchase
T001 01-01-2021 00.01 click 1
T001 01-01-2021 00.15 play 1
T001 01-01-2021 02.15 pause 1
T001 01-01-2021 03.15 play 1
T001 01-01-2021 04.15 purchase 1
T002 01-01-2021 00.01 play 0
T002 03-01-2021 00.15 play 0
T002 03-01-2021 02.15 pause 0
T002 03-01-2021 03.15 play 0

如果我理解正确,那么您可以执行以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame({"id":[1,1,1,1,2,2,2,2,3,3],
"action":["pause","play","purchase","purchase","play","purchase","pause","play","play","pause"]})

print(df)

#   id  action
# 0  1  pause
# 1  1  play
# 2  1  purchase
# 3  1  purchase
# 4  2  play
# 5  2  purchase
# 6  2  pause
# 7  2  play
# 8  3  play
# 9  3  pause


def get_idx(row):
    """
    Gets the first index of where "purchase" occurs, then 
    return the rows untill and incl that index
    """

    idx = np.argwhere(row.values=="purchase") #get index
    if idx.size>0: #check if it exists
        idx = idx[0][0]+1
        return row[:idx] #return the rows
    return row #else, return the original rows

df_clean = df.groupby("id")["action"].apply(get_idx).reset_index(drop=False,level=0)

#    id action
# 0  1  pause
# 1  1  play
# 2  1  purchase
# 4  2  play
# 5  2  purchase
# 8  3  play
# 9  3  pause

尝试:

to_remove = lambda x: ~x.shift().eq('purchase').cumsum().astype(bool)
out = df[df.groupby('session')['action'].apply(to_remove)]
print(out)

# Output
   session              Date    action  flag_purchase
0     T001  01-01-2021 00.01     click              1
1     T001  01-01-2021 00.15      play              1
2     T001  01-01-2021 02.15     pause              1
3     T001  01-01-2021 03.15      play              1
4     T001  01-01-2021 04.15  purchase              1
8     T002  01-01-2021 00.01      play              0
9     T002  03-01-2021 00.15      play              0
10    T002  03-01-2021 02.15     pause              0
11    T002  03-01-2021 03.15      play              0