pandas groupby 在第一次出现列值时应用条件
pandas groupby apply with condition on the first occurrence of a column value
我有一个数据框如下所示,pid
和 event_date
是应用 groupby
后的索引。这次我想再次申请groupby
,只申请到pid
,适用于两种情况:
- 一个人(pid=person)有两个或更多的True标签;
- 此人的第一个 True 实例发生在 he/she 未满 45 岁时;
如果以上两个条件满足,则在 groupby-ed 数据框中将此 person/pid 分配给 True。
age label
pid event_date
00000001 2000-08-28 76.334247 False
2000-10-17 76.471233 False
2000-10-31 76.509589 True
2000-11-02 76.512329 True
... ... ... ...
00000005 2014-08-15 42.769863 False
2015-04-04 43.476712 False
2015-11-06 44.057534 True
2017-03-06 45.386301 True
到目前为止,我只是为了实现第一个条件:
df = (df.groupby(['pid']).apply(lambda x: sum(x['label'])>1).to_frame('label'))
第二个对我来说很棘手。如何以某些列值的第一次出现为条件?非常欢迎任何建议!非常感谢!
更新示例数据框:
a = pd.DataFrame(columns=['pid', 'event_date', 'age', 'label'])
a['pid'] = [1, 1, 1, 1, 5, 5, 5, 5]
a['event_date'] = ['2000-08-28', '2000-08-28', '2000-08-28', '2000-08-28',\
'2000-08-28', '2000-08-28', '2000-08-28', '2000-08-28']
a['event_date'] = pd.to_datetime(a.event_date)
a['age'] = [76.334247, 76.471233, 76.509589, 76.512329, 42.769863, 43.476712, 44.057534, 45.386301]
a['label'] = [False, False, True, True, False, False, True, True]
a = (a.groupby(['pid', 'event_date', 'age']).apply(lambda x: x['label'].any()).to_frame('label'))
a.reset_index(level=['age'], inplace=True)
现在如果我申请 (a.groupby(['pid']).apply(lambda x: sum(x['label'])>1).to_frame('label'))
我会得到
label
pid
1 True
5 True
只满足第一个条件(嗯,因为我跳过了第二个)。添加第二个条件应该只标记 pid=5
True 因为只有这个 person/pid 在第一个 label=True
发生时小于 45。
半(有趣)小时后,我想到了这个:
condition = a.reset_index().groupby('pid')['label'].sum().ge(2) & a.reset_index().groupby('pid').apply(lambda x: x['age'][x['label'].idxmax()] < 45)
输出:
>>> condition
pid
1 False
5 True
dtype: bool
如果索引是正常的,而不是 pid
+ event_date
的多索引,它可以缩短一点(删除两个 .reset_index()
调用)。如果您无法从一开始就避免这种情况并且您不介意更改 a
:
a = a.reset_index()
condition = a.groupby('pid')['label'].sum().ge(2) & a.groupby('pid').apply(lambda x: x['age'][x['label'].idxmax()] < 45)
展开:
condition = (
a.groupby('pid') # Group by pid
['label'] # Get the label column for each group
.sum() # Compute the sum of the True values
.ge(2) # Are there two or more?
& # Boolean mask. The previous and the next bits of code are the two conditions, and they return a series, where the index is each unique pid, and the value is whether the condition is met for all the rows in that pid
a.groupby('pid') # Group by pid
.apply( # Call a function for each group, passing the group (a dataframe) to the function as its first parameter
lambda x: # Function start
x['age'][ # Get item from the age column at the specified index
x['label'].idxmax() # Get the index of the highest value of the label column (since they're only boolean values, the highest will be the first True value)
] < 45 # Check if it's less than 45
)
)
我有一个数据框如下所示,pid
和 event_date
是应用 groupby
后的索引。这次我想再次申请groupby
,只申请到pid
,适用于两种情况:
- 一个人(pid=person)有两个或更多的True标签;
- 此人的第一个 True 实例发生在 he/she 未满 45 岁时;
如果以上两个条件满足,则在 groupby-ed 数据框中将此 person/pid 分配给 True。
age label
pid event_date
00000001 2000-08-28 76.334247 False
2000-10-17 76.471233 False
2000-10-31 76.509589 True
2000-11-02 76.512329 True
... ... ... ...
00000005 2014-08-15 42.769863 False
2015-04-04 43.476712 False
2015-11-06 44.057534 True
2017-03-06 45.386301 True
到目前为止,我只是为了实现第一个条件:
df = (df.groupby(['pid']).apply(lambda x: sum(x['label'])>1).to_frame('label'))
第二个对我来说很棘手。如何以某些列值的第一次出现为条件?非常欢迎任何建议!非常感谢!
更新示例数据框:
a = pd.DataFrame(columns=['pid', 'event_date', 'age', 'label'])
a['pid'] = [1, 1, 1, 1, 5, 5, 5, 5]
a['event_date'] = ['2000-08-28', '2000-08-28', '2000-08-28', '2000-08-28',\
'2000-08-28', '2000-08-28', '2000-08-28', '2000-08-28']
a['event_date'] = pd.to_datetime(a.event_date)
a['age'] = [76.334247, 76.471233, 76.509589, 76.512329, 42.769863, 43.476712, 44.057534, 45.386301]
a['label'] = [False, False, True, True, False, False, True, True]
a = (a.groupby(['pid', 'event_date', 'age']).apply(lambda x: x['label'].any()).to_frame('label'))
a.reset_index(level=['age'], inplace=True)
现在如果我申请 (a.groupby(['pid']).apply(lambda x: sum(x['label'])>1).to_frame('label'))
我会得到
label
pid
1 True
5 True
只满足第一个条件(嗯,因为我跳过了第二个)。添加第二个条件应该只标记 pid=5
True 因为只有这个 person/pid 在第一个 label=True
发生时小于 45。
半(有趣)小时后,我想到了这个:
condition = a.reset_index().groupby('pid')['label'].sum().ge(2) & a.reset_index().groupby('pid').apply(lambda x: x['age'][x['label'].idxmax()] < 45)
输出:
>>> condition
pid
1 False
5 True
dtype: bool
如果索引是正常的,而不是 pid
+ event_date
的多索引,它可以缩短一点(删除两个 .reset_index()
调用)。如果您无法从一开始就避免这种情况并且您不介意更改 a
:
a = a.reset_index()
condition = a.groupby('pid')['label'].sum().ge(2) & a.groupby('pid').apply(lambda x: x['age'][x['label'].idxmax()] < 45)
展开:
condition = (
a.groupby('pid') # Group by pid
['label'] # Get the label column for each group
.sum() # Compute the sum of the True values
.ge(2) # Are there two or more?
& # Boolean mask. The previous and the next bits of code are the two conditions, and they return a series, where the index is each unique pid, and the value is whether the condition is met for all the rows in that pid
a.groupby('pid') # Group by pid
.apply( # Call a function for each group, passing the group (a dataframe) to the function as its first parameter
lambda x: # Function start
x['age'][ # Get item from the age column at the specified index
x['label'].idxmax() # Get the index of the highest value of the label column (since they're only boolean values, the highest will be the first True value)
] < 45 # Check if it's less than 45
)
)