如何将所有行放在字符串匹配数据框下方

Question

我有一个数据框，我只对 string text = "purchase" by session 上面的数据感兴趣。 input dataframe

session	Date	action	flag_purchase
T001	01-01-2021 00.01	click	1
T001	01-01-2021 00.15	play	1
T001	01-01-2021 02.15	pause	1
T001	01-01-2021 03.15	play	1
T001	01-01-2021 04.15	purchase	1
T001	02-01-2021 10.15	play	1
T001	02-01-2021 12.00	pause	1
T001	02-01-2021 13.15	play	1
T002	01-01-2021 00.01	play	0
T002	03-01-2021 00.15	play	0
T002	03-01-2021 02.15	pause	0
T002	03-01-2021 03.15	play	0

我想删除 action = "purchase" 下面的所有行，如果会话中的所有操作都没有文本匹配，会话将保留所有行，所以我想要的输出如下所示：

final result

session	Date	action	flag_purchase
T001	01-01-2021 00.01	click	1
T001	01-01-2021 00.15	play	1
T001	01-01-2021 02.15	pause	1
T001	01-01-2021 03.15	play	1
T001	01-01-2021 04.15	purchase	1
T002	01-01-2021 00.01	play	0
T002	03-01-2021 00.15	play	0
T002	03-01-2021 02.15	pause	0
T002	03-01-2021 03.15	play	0

Answer 1

如果我理解正确，那么您可以执行以下操作：

import pandas as pd
import numpy as np

df = pd.DataFrame({"id":[1,1,1,1,2,2,2,2,3,3],
"action":["pause","play","purchase","purchase","play","purchase","pause","play","play","pause"]})

print(df)

#   id  action
# 0  1  pause
# 1  1  play
# 2  1  purchase
# 3  1  purchase
# 4  2  play
# 5  2  purchase
# 6  2  pause
# 7  2  play
# 8  3  play
# 9  3  pause


def get_idx(row):
    """
    Gets the first index of where "purchase" occurs, then 
    return the rows untill and incl that index
    """

    idx = np.argwhere(row.values=="purchase") #get index
    if idx.size>0: #check if it exists
        idx = idx[0][0]+1
        return row[:idx] #return the rows
    return row #else, return the original rows

df_clean = df.groupby("id")["action"].apply(get_idx).reset_index(drop=False,level=0)

#    id action
# 0  1  pause
# 1  1  play
# 2  1  purchase
# 4  2  play
# 5  2  purchase
# 8  3  play
# 9  3  pause

Answer 2

尝试：

to_remove = lambda x: ~x.shift().eq('purchase').cumsum().astype(bool)
out = df[df.groupby('session')['action'].apply(to_remove)]
print(out)

# Output
   session              Date    action  flag_purchase
0     T001  01-01-2021 00.01     click              1
1     T001  01-01-2021 00.15      play              1
2     T001  01-01-2021 02.15     pause              1
3     T001  01-01-2021 03.15      play              1
4     T001  01-01-2021 04.15  purchase              1
8     T002  01-01-2021 00.01      play              0
9     T002  03-01-2021 00.15      play              0
10    T002  03-01-2021 02.15     pause              0
11    T002  03-01-2021 03.15      play              0

如何将所有行放在字符串匹配数据框下方

How to drop all rows below string match dataframe

python

dataframe

pandas

drop