对来自 pandas 数据帧的连续值进行分组
group consective values from pandas daataframe
数据集包含 3 列。 Time(in epoch),Counter, and counterDifference.
仅当值序列大于 6 时,才需要 从 CounterDiffrence 列中找到连续的值组(使用 0 单独标识的组)。如果识别出连续值组,则需要找到时间和计数器的最小值和最大值,然后将其附加到组号(从 0 开始)。因此新数据帧包含 GroupNo、StartTime、EndTime、StartCounter、EndCounter。
如图第一组只有4个连续值。所以它不能被分配为 group.But 第二组有 8 个连续值所以它可以被归类为组。
数据:
{'Time': {6412: 1635515680,
6413: 1635515681,
6414: 1635515681,
6415: 1635515681,
6416: 1635515682,
6418: 1635515690,
6419: 1635515700,
6422: 1635515720,
6423: 1635515724,
6424: 1635515726,
6425: 1635515726,
6426: 1635515730,
6427: 1635515740,
6428: 1635515751,
6429: 1635515756,
6430: 1635515757,
6431: 1635515757,
6432: 1635515760,
6435: 1635515774,
6436: 1635515775,
6437: 1635515775,
6438: 1635515775,
6439: 1635515779,
6440: 1635515780,
6441: 1635515780,
6442: 1635515790,
6443: 1635515794,
6444: 1635515795,
6445: 1635515795,
6446: 1635515797,
6447: 1635515801,
6448: 1635515802,
6449: 1635515807,
6451: 1635515820,
6454: 1635515840,
6455: 1635515850,
6456: 1635515860,
6457: 1635515871,
6458: 1635515881,
6461: 1635515901,
6462: 1635515911,
6463: 1635515921,
6464: 1635515930,
6465: 1635515940,
6468: 1635515960,
6469: 1635515972,
6470: 1635515980,
6471: 1635515991,
6472: 1635515997,
6473: 1635515998},
'Counter': {6412: 46219.0,
6413: 46219.0,
6414: 46219.0,
6415: 46219.0,
6416: 46219.0,
6418: 46222.0,
6419: 46226.0,
6422: 46234.0,
6423: 46236.0,
6424: 46236.0,
6425: 46236.0,
6426: 46236.0,
6427: 46236.0,
6428: 46236.0,
6429: 46236.0,
6430: 46236.0,
6431: 46236.0,
6432: 46236.0,
6435: 46236.0,
6436: 46236.0,
6437: 46236.0,
6438: 46236.0,
6439: 46236.0,
6440: 46236.0,
6441: 46236.0,
6442: 46236.0,
6443: 46236.0,
6444: 46236.0,
6445: 46236.0,
6446: 46236.0,
6447: 46236.0,
6448: 46236.0,
6449: 46236.0,
6451: 46241.0,
6454: 46249.0,
6455: 46253.0,
6456: 46257.0,
6457: 46261.0,
6458: 46265.0,
6461: 46273.0,
6462: 46277.0,
6463: 46281.0,
6464: 46285.0,
6465: 46289.0,
6468: 46297.0,
6469: 46301.0,
6470: 46305.0,
6471: 46309.0,
6472: 46311.0,
6473: 46311.0},
'CounterDifference': {6412: 0.0,
6413: 0.0,
6414: 0.0,
6415: 0.0,
6416: 0.0,
6418: 2.0,
6419: 4.0,
6422: 4.0,
6423: 2.0,
6424: 0.0,
6425: 0.0,
6426: 0.0,
6427: 0.0,
6428: 0.0,
6429: 0.0,
6430: 0.0,
6431: 0.0,
6432: 0.0,
6435: 0.0,
6436: 0.0,
6437: 0.0,
6438: 0.0,
6439: 0.0,
6440: 0.0,
6441: 0.0,
6442: 0.0,
6443: 0.0,
6444: 0.0,
6445: 0.0,
6446: 0.0,
6447: 0.0,
6448: 0.0,
6449: 0.0,
6451: 4.0,
6454: 4.0,
6455: 4.0,
6456: 4.0,
6457: 4.0,
6458: 4.0,
6461: 4.0,
6462: 4.0,
6463: 4.0,
6464: 4.0,
6465: 4.0,
6468: 4.0,
6469: 4.0,
6470: 4.0,
6471: 4.0,
6472: 2.0,
6473: 0.0}}
这可以通过使用 diff
and cumsum
to create appropriate groups and then using groupby
with filter
and agg
来完成。我们首先创建组:
df['group'] = df['CounterDifference'].eq(0).diff().cumsum()
这给出:
Time Counter CounterDifference group
6412 1635515680 46219.0 0.0 NaN
6413 1635515681 46219.0 0.0 0.0
6414 1635515681 46219.0 0.0 0.0
6415 1635515681 46219.0 0.0 0.0
6416 1635515682 46219.0 0.0 0.0
6418 1635515690 46222.0 2.0 1.0
6419 1635515700 46226.0 4.0 1.0
6422 1635515720 46234.0 4.0 1.0
6423 1635515724 46236.0 2.0 1.0
6424 1635515726 46236.0 0.0 2.0
6425 1635515726 46236.0 0.0 2.0
...
6449 1635515807 46236.0 0.0 2.0
6451 1635515820 46241.0 4.0 3.0
6454 1635515840 46249.0 4.0 3.0
...
6471 1635515991 46309.0 4.0 3.0
6472 1635515997 46311.0 2.0 3.0
6473 1635515998 46311.0 0.0 4.0
现在,我们可以筛选组然后聚合:
# filter away groups that are too small or the ones that contains zeros.
df = df.groupby('group').filter(lambda x: x['CounterDifference'].iloc[0] != 0 and len(x) > 6)
# groupby and aggregate the wanted metrics from each group
df = df.groupby('group').agg({'Time': [min, max], 'CounterDifference': [min, max]})
# postprocessing
df = df.reset_index(drop=True).reset_index()
df.columns = ['GroupNo', 'StartTime', 'EndTime', 'StartCounter', 'EndCounter']
最终结果:
GroupNo StartTime EndTime StartCounter EndCounter
0 0 1635515820 1635515997 2.0 4.0
要分配组,groupby CounterDifference,找到大小并有条件地分配。在下面的代码中,我只将组分配给大于 0 的值,并且存在于 4 个及以上的连续组中。
df['group']=np.where(df.groupby('CounterDifference')['Counter'].transform('size').gt(4)&df['CounterDifference'].ne(0),'group','not_valid_group')
Time Counter CounterDifference group
6412 1635515680 46219.0 0.0 not_valid_group
6413 1635515681 46219.0 0.0 not_valid_group
6414 1635515681 46219.0 0.0 not_valid_group
6415 1635515681 46219.0 0.0 not_valid_group
6416 1635515682 46219.0 0.0 not_valid_group
6418 1635515690 46222.0 2.0 not_valid_group
6419 1635515700 46226.0 4.0 not_valid_group
6422 1635515720 46234.0 4.0 not_valid_group
6423 1635515724 46236.0 2.0 not_valid_group
6424 1635515726 46236.0 0.0 not_valid_group
6425 1635515726 46236.0 0.0 not_valid_group
6426 1635515730 46236.0 0.0 not_valid_group
6427 1635515740 46236.0 0.0 not_valid_group
6428 1635515751 46236.0 0.0 not_valid_group
6429 1635515756 46236.0 0.0 not_valid_group
6430 1635515757 46236.0 0.0 not_valid_group
6431 1635515757 46236.0 0.0 not_valid_group
6432 1635515760 46236.0 0.0 not_valid_group
6435 1635515774 46236.0 0.0 not_valid_group
6436 1635515775 46236.0 0.0 not_valid_group
6437 1635515775 46236.0 0.0 not_valid_group
6438 1635515775 46236.0 0.0 not_valid_group
6439 1635515779 46236.0 0.0 not_valid_group
6440 1635515780 46236.0 0.0 not_valid_group
6441 1635515780 46236.0 0.0 not_valid_group
6442 1635515790 46236.0 0.0 not_valid_group
6443 1635515794 46236.0 0.0 not_valid_group
6444 1635515795 46236.0 0.0 not_valid_group
6445 1635515795 46236.0 0.0 not_valid_group
6446 1635515797 46236.0 0.0 not_valid_group
6447 1635515801 46236.0 0.0 not_valid_group
6448 1635515802 46236.0 0.0 not_valid_group
6449 1635515807 46236.0 0.0 not_valid_group
6451 1635515820 46241.0 4.0 group
6454 1635515840 46249.0 4.0 group
6455 1635515850 46253.0 4.0 group
6456 1635515860 46257.0 4.0 group
6457 1635515871 46261.0 4.0 group
6458 1635515881 46265.0 4.0 group
6461 1635515901 46273.0 4.0 group
6462 1635515911 46277.0 4.0 group
6463 1635515921 46281.0 4.0 group
6464 1635515930 46285.0 4.0 group
6465 1635515940 46289.0 4.0 group
6468 1635515960 46297.0 4.0 group
6469 1635515972 46301.0 4.0 group
6470 1635515980 46305.0 4.0 group
6471 1635515991 46309.0 4.0 group
6472 1635515997 46311.0 2.0 not_valid_group
6473 1635515998 46311.0 0.0 not_valid_group
数据集包含 3 列。 Time(in epoch),Counter, and counterDifference.
仅当值序列大于 6 时,才需要 从 CounterDiffrence 列中找到连续的值组(使用 0 单独标识的组)。如果识别出连续值组,则需要找到时间和计数器的最小值和最大值,然后将其附加到组号(从 0 开始)。因此新数据帧包含 GroupNo、StartTime、EndTime、StartCounter、EndCounter。
如图第一组只有4个连续值。所以它不能被分配为 group.But 第二组有 8 个连续值所以它可以被归类为组。
数据:
{'Time': {6412: 1635515680,
6413: 1635515681,
6414: 1635515681,
6415: 1635515681,
6416: 1635515682,
6418: 1635515690,
6419: 1635515700,
6422: 1635515720,
6423: 1635515724,
6424: 1635515726,
6425: 1635515726,
6426: 1635515730,
6427: 1635515740,
6428: 1635515751,
6429: 1635515756,
6430: 1635515757,
6431: 1635515757,
6432: 1635515760,
6435: 1635515774,
6436: 1635515775,
6437: 1635515775,
6438: 1635515775,
6439: 1635515779,
6440: 1635515780,
6441: 1635515780,
6442: 1635515790,
6443: 1635515794,
6444: 1635515795,
6445: 1635515795,
6446: 1635515797,
6447: 1635515801,
6448: 1635515802,
6449: 1635515807,
6451: 1635515820,
6454: 1635515840,
6455: 1635515850,
6456: 1635515860,
6457: 1635515871,
6458: 1635515881,
6461: 1635515901,
6462: 1635515911,
6463: 1635515921,
6464: 1635515930,
6465: 1635515940,
6468: 1635515960,
6469: 1635515972,
6470: 1635515980,
6471: 1635515991,
6472: 1635515997,
6473: 1635515998},
'Counter': {6412: 46219.0,
6413: 46219.0,
6414: 46219.0,
6415: 46219.0,
6416: 46219.0,
6418: 46222.0,
6419: 46226.0,
6422: 46234.0,
6423: 46236.0,
6424: 46236.0,
6425: 46236.0,
6426: 46236.0,
6427: 46236.0,
6428: 46236.0,
6429: 46236.0,
6430: 46236.0,
6431: 46236.0,
6432: 46236.0,
6435: 46236.0,
6436: 46236.0,
6437: 46236.0,
6438: 46236.0,
6439: 46236.0,
6440: 46236.0,
6441: 46236.0,
6442: 46236.0,
6443: 46236.0,
6444: 46236.0,
6445: 46236.0,
6446: 46236.0,
6447: 46236.0,
6448: 46236.0,
6449: 46236.0,
6451: 46241.0,
6454: 46249.0,
6455: 46253.0,
6456: 46257.0,
6457: 46261.0,
6458: 46265.0,
6461: 46273.0,
6462: 46277.0,
6463: 46281.0,
6464: 46285.0,
6465: 46289.0,
6468: 46297.0,
6469: 46301.0,
6470: 46305.0,
6471: 46309.0,
6472: 46311.0,
6473: 46311.0},
'CounterDifference': {6412: 0.0,
6413: 0.0,
6414: 0.0,
6415: 0.0,
6416: 0.0,
6418: 2.0,
6419: 4.0,
6422: 4.0,
6423: 2.0,
6424: 0.0,
6425: 0.0,
6426: 0.0,
6427: 0.0,
6428: 0.0,
6429: 0.0,
6430: 0.0,
6431: 0.0,
6432: 0.0,
6435: 0.0,
6436: 0.0,
6437: 0.0,
6438: 0.0,
6439: 0.0,
6440: 0.0,
6441: 0.0,
6442: 0.0,
6443: 0.0,
6444: 0.0,
6445: 0.0,
6446: 0.0,
6447: 0.0,
6448: 0.0,
6449: 0.0,
6451: 4.0,
6454: 4.0,
6455: 4.0,
6456: 4.0,
6457: 4.0,
6458: 4.0,
6461: 4.0,
6462: 4.0,
6463: 4.0,
6464: 4.0,
6465: 4.0,
6468: 4.0,
6469: 4.0,
6470: 4.0,
6471: 4.0,
6472: 2.0,
6473: 0.0}}
这可以通过使用 diff
and cumsum
to create appropriate groups and then using groupby
with filter
and agg
来完成。我们首先创建组:
df['group'] = df['CounterDifference'].eq(0).diff().cumsum()
这给出:
Time Counter CounterDifference group
6412 1635515680 46219.0 0.0 NaN
6413 1635515681 46219.0 0.0 0.0
6414 1635515681 46219.0 0.0 0.0
6415 1635515681 46219.0 0.0 0.0
6416 1635515682 46219.0 0.0 0.0
6418 1635515690 46222.0 2.0 1.0
6419 1635515700 46226.0 4.0 1.0
6422 1635515720 46234.0 4.0 1.0
6423 1635515724 46236.0 2.0 1.0
6424 1635515726 46236.0 0.0 2.0
6425 1635515726 46236.0 0.0 2.0
...
6449 1635515807 46236.0 0.0 2.0
6451 1635515820 46241.0 4.0 3.0
6454 1635515840 46249.0 4.0 3.0
...
6471 1635515991 46309.0 4.0 3.0
6472 1635515997 46311.0 2.0 3.0
6473 1635515998 46311.0 0.0 4.0
现在,我们可以筛选组然后聚合:
# filter away groups that are too small or the ones that contains zeros.
df = df.groupby('group').filter(lambda x: x['CounterDifference'].iloc[0] != 0 and len(x) > 6)
# groupby and aggregate the wanted metrics from each group
df = df.groupby('group').agg({'Time': [min, max], 'CounterDifference': [min, max]})
# postprocessing
df = df.reset_index(drop=True).reset_index()
df.columns = ['GroupNo', 'StartTime', 'EndTime', 'StartCounter', 'EndCounter']
最终结果:
GroupNo StartTime EndTime StartCounter EndCounter
0 0 1635515820 1635515997 2.0 4.0
要分配组,groupby CounterDifference,找到大小并有条件地分配。在下面的代码中,我只将组分配给大于 0 的值,并且存在于 4 个及以上的连续组中。
df['group']=np.where(df.groupby('CounterDifference')['Counter'].transform('size').gt(4)&df['CounterDifference'].ne(0),'group','not_valid_group')
Time Counter CounterDifference group
6412 1635515680 46219.0 0.0 not_valid_group
6413 1635515681 46219.0 0.0 not_valid_group
6414 1635515681 46219.0 0.0 not_valid_group
6415 1635515681 46219.0 0.0 not_valid_group
6416 1635515682 46219.0 0.0 not_valid_group
6418 1635515690 46222.0 2.0 not_valid_group
6419 1635515700 46226.0 4.0 not_valid_group
6422 1635515720 46234.0 4.0 not_valid_group
6423 1635515724 46236.0 2.0 not_valid_group
6424 1635515726 46236.0 0.0 not_valid_group
6425 1635515726 46236.0 0.0 not_valid_group
6426 1635515730 46236.0 0.0 not_valid_group
6427 1635515740 46236.0 0.0 not_valid_group
6428 1635515751 46236.0 0.0 not_valid_group
6429 1635515756 46236.0 0.0 not_valid_group
6430 1635515757 46236.0 0.0 not_valid_group
6431 1635515757 46236.0 0.0 not_valid_group
6432 1635515760 46236.0 0.0 not_valid_group
6435 1635515774 46236.0 0.0 not_valid_group
6436 1635515775 46236.0 0.0 not_valid_group
6437 1635515775 46236.0 0.0 not_valid_group
6438 1635515775 46236.0 0.0 not_valid_group
6439 1635515779 46236.0 0.0 not_valid_group
6440 1635515780 46236.0 0.0 not_valid_group
6441 1635515780 46236.0 0.0 not_valid_group
6442 1635515790 46236.0 0.0 not_valid_group
6443 1635515794 46236.0 0.0 not_valid_group
6444 1635515795 46236.0 0.0 not_valid_group
6445 1635515795 46236.0 0.0 not_valid_group
6446 1635515797 46236.0 0.0 not_valid_group
6447 1635515801 46236.0 0.0 not_valid_group
6448 1635515802 46236.0 0.0 not_valid_group
6449 1635515807 46236.0 0.0 not_valid_group
6451 1635515820 46241.0 4.0 group
6454 1635515840 46249.0 4.0 group
6455 1635515850 46253.0 4.0 group
6456 1635515860 46257.0 4.0 group
6457 1635515871 46261.0 4.0 group
6458 1635515881 46265.0 4.0 group
6461 1635515901 46273.0 4.0 group
6462 1635515911 46277.0 4.0 group
6463 1635515921 46281.0 4.0 group
6464 1635515930 46285.0 4.0 group
6465 1635515940 46289.0 4.0 group
6468 1635515960 46297.0 4.0 group
6469 1635515972 46301.0 4.0 group
6470 1635515980 46305.0 4.0 group
6471 1635515991 46309.0 4.0 group
6472 1635515997 46311.0 2.0 not_valid_group
6473 1635515998 46311.0 0.0 not_valid_group