pandas 单元格条件下的多个切片行
Multiple slicing rows in pandas on a cell condition
Msgtype Date ConvID message
enquire 12/1 689 I want your car
reply 12/3 689 it is available
reply 12/4 689 rent please?
reply 12/6 689 0
accept 12/8 689 please pay through CC
reply 12/8 689 thank you, what about fuel?
reply 12/8 689 you have to take care
enquire 12/3 690 Looking for car
reply 12/4 690 available
accept 12/5 690 paid
reply 12/6 690 thank you
我想按 ConvID 对这些数据进行分组并按日期排序。我希望行直到 "Msgtype" = 接受那个特定的 ConvID。旨在分析消息数据,直到接受特定 ConvID 的预订请求。所以对于 ConvID = 689,我想要行直到 "Msgtype" = 接受。 "accept" 之后的其余行不需要。
例如:ConvID = 689不需要这两个
Msgtype Date ConvID message
reply 12/8 689 thank you, what about fuel?
reply 12/8 689 you have to take care
同样,ConvID = 690 不需要此行
Msgtype Date ConvID message
reply 12/6 690 thank you
我认为你可以使用:
mask1 = (df.Msgtype == 'accept')
mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0
print (df[mask].sort_values(['ConvID','Date']))
Msgtype Date ConvID message
0 enquire 12/1 689 I want your car
1 reply 12/3 689 it is available
2 reply 12/4 689 rent please?
3 reply 12/6 689 0
4 accept 12/8 689 please pay through CC
7 enquire 12/3 690 Looking for car
8 reply 12/4 690 available
9 accept 12/5 690 paid
解释:
#mask where is 'accept'
mask1 = (df.Msgtype == 'accept')
print (mask1)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
Name: Msgtype, dtype: bool
#per group shift, replace NaN by False and cumulative sum
print (mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()))
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 0
8 0
9 0
10 1
Name: Msgtype, dtype: int32
#where output of groupby is 0
mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0
print (mask)
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 True
10 False
Name: Msgtype, dtype: bool
#boolean indexing and sorting
print (df[mask].sort_values(['ConvID','Date']))
Msgtype Date ConvID message
0 enquire 12/1 689 I want your car
1 reply 12/3 689 it is available
2 reply 12/4 689 rent please?
3 reply 12/6 689 0
4 accept 12/8 689 please pay through CC
7 enquire 12/3 690 Looking for car
8 reply 12/4 690 available
9 accept 12/5 690 paid
简单:
for name, grp in df.groupby('ConvID'):
grp.sort_values('Date', inplace=True)
accept_date = grp.loc[grp['Msgtype'] == 'accept', 'Date']
req = grp[grp['Date'] < accept_date]
# Or, you can use index, like so:
# grp = grp.sort_values('Date').reset_index(drop=True)
# req = grp.iloc[:grp[grp['Msgtype'] == 'accept'].index.values[0], :]
req
将只包含可用于分析的所需行。
Msgtype Date ConvID message
enquire 12/1 689 I want your car
reply 12/3 689 it is available
reply 12/4 689 rent please?
reply 12/6 689 0
accept 12/8 689 please pay through CC
reply 12/8 689 thank you, what about fuel?
reply 12/8 689 you have to take care
enquire 12/3 690 Looking for car
reply 12/4 690 available
accept 12/5 690 paid
reply 12/6 690 thank you
我想按 ConvID 对这些数据进行分组并按日期排序。我希望行直到 "Msgtype" = 接受那个特定的 ConvID。旨在分析消息数据,直到接受特定 ConvID 的预订请求。所以对于 ConvID = 689,我想要行直到 "Msgtype" = 接受。 "accept" 之后的其余行不需要。
例如:ConvID = 689不需要这两个
Msgtype Date ConvID message
reply 12/8 689 thank you, what about fuel?
reply 12/8 689 you have to take care
同样,ConvID = 690 不需要此行
Msgtype Date ConvID message
reply 12/6 690 thank you
我认为你可以使用:
mask1 = (df.Msgtype == 'accept')
mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0
print (df[mask].sort_values(['ConvID','Date']))
Msgtype Date ConvID message
0 enquire 12/1 689 I want your car
1 reply 12/3 689 it is available
2 reply 12/4 689 rent please?
3 reply 12/6 689 0
4 accept 12/8 689 please pay through CC
7 enquire 12/3 690 Looking for car
8 reply 12/4 690 available
9 accept 12/5 690 paid
解释:
#mask where is 'accept'
mask1 = (df.Msgtype == 'accept')
print (mask1)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
Name: Msgtype, dtype: bool
#per group shift, replace NaN by False and cumulative sum
print (mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()))
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 0
8 0
9 0
10 1
Name: Msgtype, dtype: int32
#where output of groupby is 0
mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0
print (mask)
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 True
10 False
Name: Msgtype, dtype: bool
#boolean indexing and sorting
print (df[mask].sort_values(['ConvID','Date']))
Msgtype Date ConvID message
0 enquire 12/1 689 I want your car
1 reply 12/3 689 it is available
2 reply 12/4 689 rent please?
3 reply 12/6 689 0
4 accept 12/8 689 please pay through CC
7 enquire 12/3 690 Looking for car
8 reply 12/4 690 available
9 accept 12/5 690 paid
简单:
for name, grp in df.groupby('ConvID'):
grp.sort_values('Date', inplace=True)
accept_date = grp.loc[grp['Msgtype'] == 'accept', 'Date']
req = grp[grp['Date'] < accept_date]
# Or, you can use index, like so:
# grp = grp.sort_values('Date').reset_index(drop=True)
# req = grp.iloc[:grp[grp['Msgtype'] == 'accept'].index.values[0], :]
req
将只包含可用于分析的所需行。