如何将所有行放在字符串匹配数据框下方
How to drop all rows below string match dataframe
我有一个数据框,我只对 string text = "purchase" by session 上面的数据感兴趣。
input dataframe
session
Date
action
flag_purchase
T001
01-01-2021 00.01
click
1
T001
01-01-2021 00.15
play
1
T001
01-01-2021 02.15
pause
1
T001
01-01-2021 03.15
play
1
T001
01-01-2021 04.15
purchase
1
T001
02-01-2021 10.15
play
1
T001
02-01-2021 12.00
pause
1
T001
02-01-2021 13.15
play
1
T002
01-01-2021 00.01
play
0
T002
03-01-2021 00.15
play
0
T002
03-01-2021 02.15
pause
0
T002
03-01-2021 03.15
play
0
我想删除 action = "purchase" 下面的所有行,如果会话中的所有操作都没有文本匹配,会话将保留所有行,所以我想要的输出如下所示:
final result
session
Date
action
flag_purchase
T001
01-01-2021 00.01
click
1
T001
01-01-2021 00.15
play
1
T001
01-01-2021 02.15
pause
1
T001
01-01-2021 03.15
play
1
T001
01-01-2021 04.15
purchase
1
T002
01-01-2021 00.01
play
0
T002
03-01-2021 00.15
play
0
T002
03-01-2021 02.15
pause
0
T002
03-01-2021 03.15
play
0
如果我理解正确,那么您可以执行以下操作:
import pandas as pd
import numpy as np
df = pd.DataFrame({"id":[1,1,1,1,2,2,2,2,3,3],
"action":["pause","play","purchase","purchase","play","purchase","pause","play","play","pause"]})
print(df)
# id action
# 0 1 pause
# 1 1 play
# 2 1 purchase
# 3 1 purchase
# 4 2 play
# 5 2 purchase
# 6 2 pause
# 7 2 play
# 8 3 play
# 9 3 pause
def get_idx(row):
"""
Gets the first index of where "purchase" occurs, then
return the rows untill and incl that index
"""
idx = np.argwhere(row.values=="purchase") #get index
if idx.size>0: #check if it exists
idx = idx[0][0]+1
return row[:idx] #return the rows
return row #else, return the original rows
df_clean = df.groupby("id")["action"].apply(get_idx).reset_index(drop=False,level=0)
# id action
# 0 1 pause
# 1 1 play
# 2 1 purchase
# 4 2 play
# 5 2 purchase
# 8 3 play
# 9 3 pause
尝试:
to_remove = lambda x: ~x.shift().eq('purchase').cumsum().astype(bool)
out = df[df.groupby('session')['action'].apply(to_remove)]
print(out)
# Output
session Date action flag_purchase
0 T001 01-01-2021 00.01 click 1
1 T001 01-01-2021 00.15 play 1
2 T001 01-01-2021 02.15 pause 1
3 T001 01-01-2021 03.15 play 1
4 T001 01-01-2021 04.15 purchase 1
8 T002 01-01-2021 00.01 play 0
9 T002 03-01-2021 00.15 play 0
10 T002 03-01-2021 02.15 pause 0
11 T002 03-01-2021 03.15 play 0
我有一个数据框,我只对 string text = "purchase" by session 上面的数据感兴趣。 input dataframe
session | Date | action | flag_purchase |
---|---|---|---|
T001 | 01-01-2021 00.01 | click | 1 |
T001 | 01-01-2021 00.15 | play | 1 |
T001 | 01-01-2021 02.15 | pause | 1 |
T001 | 01-01-2021 03.15 | play | 1 |
T001 | 01-01-2021 04.15 | purchase | 1 |
T001 | 02-01-2021 10.15 | play | 1 |
T001 | 02-01-2021 12.00 | pause | 1 |
T001 | 02-01-2021 13.15 | play | 1 |
T002 | 01-01-2021 00.01 | play | 0 |
T002 | 03-01-2021 00.15 | play | 0 |
T002 | 03-01-2021 02.15 | pause | 0 |
T002 | 03-01-2021 03.15 | play | 0 |
我想删除 action = "purchase" 下面的所有行,如果会话中的所有操作都没有文本匹配,会话将保留所有行,所以我想要的输出如下所示:
final result
session | Date | action | flag_purchase |
---|---|---|---|
T001 | 01-01-2021 00.01 | click | 1 |
T001 | 01-01-2021 00.15 | play | 1 |
T001 | 01-01-2021 02.15 | pause | 1 |
T001 | 01-01-2021 03.15 | play | 1 |
T001 | 01-01-2021 04.15 | purchase | 1 |
T002 | 01-01-2021 00.01 | play | 0 |
T002 | 03-01-2021 00.15 | play | 0 |
T002 | 03-01-2021 02.15 | pause | 0 |
T002 | 03-01-2021 03.15 | play | 0 |
如果我理解正确,那么您可以执行以下操作:
import pandas as pd
import numpy as np
df = pd.DataFrame({"id":[1,1,1,1,2,2,2,2,3,3],
"action":["pause","play","purchase","purchase","play","purchase","pause","play","play","pause"]})
print(df)
# id action
# 0 1 pause
# 1 1 play
# 2 1 purchase
# 3 1 purchase
# 4 2 play
# 5 2 purchase
# 6 2 pause
# 7 2 play
# 8 3 play
# 9 3 pause
def get_idx(row):
"""
Gets the first index of where "purchase" occurs, then
return the rows untill and incl that index
"""
idx = np.argwhere(row.values=="purchase") #get index
if idx.size>0: #check if it exists
idx = idx[0][0]+1
return row[:idx] #return the rows
return row #else, return the original rows
df_clean = df.groupby("id")["action"].apply(get_idx).reset_index(drop=False,level=0)
# id action
# 0 1 pause
# 1 1 play
# 2 1 purchase
# 4 2 play
# 5 2 purchase
# 8 3 play
# 9 3 pause
尝试:
to_remove = lambda x: ~x.shift().eq('purchase').cumsum().astype(bool)
out = df[df.groupby('session')['action'].apply(to_remove)]
print(out)
# Output
session Date action flag_purchase
0 T001 01-01-2021 00.01 click 1
1 T001 01-01-2021 00.15 play 1
2 T001 01-01-2021 02.15 pause 1
3 T001 01-01-2021 03.15 play 1
4 T001 01-01-2021 04.15 purchase 1
8 T002 01-01-2021 00.01 play 0
9 T002 03-01-2021 00.15 play 0
10 T002 03-01-2021 02.15 pause 0
11 T002 03-01-2021 03.15 play 0