对来自 pandas 数据帧的连续值进行分组

group consective values from pandas daataframe

数据集包含 3 列。 Time(in epoch),Counter, and counterDifference.

仅当值序列大于 6 时,才需要 从 CounterDiffrence 列中找到连续的值组(使用 0 单独标识的组)。如果识别出连续值组,则需要找到时间和计数器的最小值和最大值,然后将其附加到组号(从 0 开始)。因此新数据帧包含 GroupNo、StartTime、EndTime、StartCounter、EndCounter。

如图第一组只有4个连续值。所以它不能被分配为 group.But 第二组有 8 个连续值所以它可以被归类为组。

数据:

{'Time': {6412: 1635515680,
  6413: 1635515681,
  6414: 1635515681,
  6415: 1635515681,
  6416: 1635515682,
  6418: 1635515690,
  6419: 1635515700,
  6422: 1635515720,
  6423: 1635515724,
  6424: 1635515726,
  6425: 1635515726,
  6426: 1635515730,
  6427: 1635515740,
  6428: 1635515751,
  6429: 1635515756,
  6430: 1635515757,
  6431: 1635515757,
  6432: 1635515760,
  6435: 1635515774,
  6436: 1635515775,
  6437: 1635515775,
  6438: 1635515775,
  6439: 1635515779,
  6440: 1635515780,
  6441: 1635515780,
  6442: 1635515790,
  6443: 1635515794,
  6444: 1635515795,
  6445: 1635515795,
  6446: 1635515797,
  6447: 1635515801,
  6448: 1635515802,
  6449: 1635515807,
  6451: 1635515820,
  6454: 1635515840,
  6455: 1635515850,
  6456: 1635515860,
  6457: 1635515871,
  6458: 1635515881,
  6461: 1635515901,
  6462: 1635515911,
  6463: 1635515921,
  6464: 1635515930,
  6465: 1635515940,
  6468: 1635515960,
  6469: 1635515972,
  6470: 1635515980,
  6471: 1635515991,
  6472: 1635515997,
  6473: 1635515998},
 'Counter': {6412: 46219.0,
  6413: 46219.0,
  6414: 46219.0,
  6415: 46219.0,
  6416: 46219.0,
  6418: 46222.0,
  6419: 46226.0,
  6422: 46234.0,
  6423: 46236.0,
  6424: 46236.0,
  6425: 46236.0,
  6426: 46236.0,
  6427: 46236.0,
  6428: 46236.0,
  6429: 46236.0,
  6430: 46236.0,
  6431: 46236.0,
  6432: 46236.0,
  6435: 46236.0,
  6436: 46236.0,
  6437: 46236.0,
  6438: 46236.0,
  6439: 46236.0,
  6440: 46236.0,
  6441: 46236.0,
  6442: 46236.0,
  6443: 46236.0,
  6444: 46236.0,
  6445: 46236.0,
  6446: 46236.0,
  6447: 46236.0,
  6448: 46236.0,
  6449: 46236.0,
  6451: 46241.0,
  6454: 46249.0,
  6455: 46253.0,
  6456: 46257.0,
  6457: 46261.0,
  6458: 46265.0,
  6461: 46273.0,
  6462: 46277.0,
  6463: 46281.0,
  6464: 46285.0,
  6465: 46289.0,
  6468: 46297.0,
  6469: 46301.0,
  6470: 46305.0,
  6471: 46309.0,
  6472: 46311.0,
  6473: 46311.0},
 'CounterDifference': {6412: 0.0,
  6413: 0.0,
  6414: 0.0,
  6415: 0.0,
  6416: 0.0,
  6418: 2.0,
  6419: 4.0,
  6422: 4.0,
  6423: 2.0,
  6424: 0.0,
  6425: 0.0,
  6426: 0.0,
  6427: 0.0,
  6428: 0.0,
  6429: 0.0,
  6430: 0.0,
  6431: 0.0,
  6432: 0.0,
  6435: 0.0,
  6436: 0.0,
  6437: 0.0,
  6438: 0.0,
  6439: 0.0,
  6440: 0.0,
  6441: 0.0,
  6442: 0.0,
  6443: 0.0,
  6444: 0.0,
  6445: 0.0,
  6446: 0.0,
  6447: 0.0,
  6448: 0.0,
  6449: 0.0,
  6451: 4.0,
  6454: 4.0,
  6455: 4.0,
  6456: 4.0,
  6457: 4.0,
  6458: 4.0,
  6461: 4.0,
  6462: 4.0,
  6463: 4.0,
  6464: 4.0,
  6465: 4.0,
  6468: 4.0,
  6469: 4.0,
  6470: 4.0,
  6471: 4.0,
  6472: 2.0,
  6473: 0.0}}

这可以通过使用 diff and cumsum to create appropriate groups and then using groupby with filter and agg 来完成。我们首先创建组:

df['group'] = df['CounterDifference'].eq(0).diff().cumsum()

这给出:

            Time  Counter  CounterDifference group
6412  1635515680  46219.0                0.0   NaN
6413  1635515681  46219.0                0.0   0.0
6414  1635515681  46219.0                0.0   0.0
6415  1635515681  46219.0                0.0   0.0
6416  1635515682  46219.0                0.0   0.0
6418  1635515690  46222.0                2.0   1.0
6419  1635515700  46226.0                4.0   1.0
6422  1635515720  46234.0                4.0   1.0
6423  1635515724  46236.0                2.0   1.0
6424  1635515726  46236.0                0.0   2.0
6425  1635515726  46236.0                0.0   2.0
...
6449  1635515807  46236.0                0.0   2.0
6451  1635515820  46241.0                4.0   3.0
6454  1635515840  46249.0                4.0   3.0
...
6471  1635515991  46309.0                4.0   3.0
6472  1635515997  46311.0                2.0   3.0
6473  1635515998  46311.0                0.0   4.0

现在,我们可以筛选组然后聚合:

# filter away groups that are too small or the ones that contains zeros.
df = df.groupby('group').filter(lambda x: x['CounterDifference'].iloc[0] != 0 and len(x) > 6)

# groupby and aggregate the wanted metrics from each group
df = df.groupby('group').agg({'Time': [min, max], 'CounterDifference': [min, max]})

# postprocessing
df = df.reset_index(drop=True).reset_index()
df.columns = ['GroupNo', 'StartTime', 'EndTime', 'StartCounter', 'EndCounter']

最终结果:

   GroupNo   StartTime     EndTime  StartCounter  EndCounter
0        0  1635515820  1635515997           2.0         4.0

要分配组,groupby CounterDifference,找到大小并有条件地分配。在下面的代码中,我只将组分配给大于 0 的值,并且存在于 4 个及以上的连续组中。

  df['group']=np.where(df.groupby('CounterDifference')['Counter'].transform('size').gt(4)&df['CounterDifference'].ne(0),'group','not_valid_group')


           Time  Counter  CounterDifference            group
6412  1635515680  46219.0                0.0  not_valid_group
6413  1635515681  46219.0                0.0  not_valid_group
6414  1635515681  46219.0                0.0  not_valid_group
6415  1635515681  46219.0                0.0  not_valid_group
6416  1635515682  46219.0                0.0  not_valid_group
6418  1635515690  46222.0                2.0  not_valid_group
6419  1635515700  46226.0                4.0  not_valid_group
6422  1635515720  46234.0                4.0  not_valid_group
6423  1635515724  46236.0                2.0  not_valid_group
6424  1635515726  46236.0                0.0  not_valid_group
6425  1635515726  46236.0                0.0  not_valid_group
6426  1635515730  46236.0                0.0  not_valid_group
6427  1635515740  46236.0                0.0  not_valid_group
6428  1635515751  46236.0                0.0  not_valid_group
6429  1635515756  46236.0                0.0  not_valid_group
6430  1635515757  46236.0                0.0  not_valid_group
6431  1635515757  46236.0                0.0  not_valid_group
6432  1635515760  46236.0                0.0  not_valid_group
6435  1635515774  46236.0                0.0  not_valid_group
6436  1635515775  46236.0                0.0  not_valid_group
6437  1635515775  46236.0                0.0  not_valid_group
6438  1635515775  46236.0                0.0  not_valid_group
6439  1635515779  46236.0                0.0  not_valid_group
6440  1635515780  46236.0                0.0  not_valid_group
6441  1635515780  46236.0                0.0  not_valid_group
6442  1635515790  46236.0                0.0  not_valid_group
6443  1635515794  46236.0                0.0  not_valid_group
6444  1635515795  46236.0                0.0  not_valid_group
6445  1635515795  46236.0                0.0  not_valid_group
6446  1635515797  46236.0                0.0  not_valid_group
6447  1635515801  46236.0                0.0  not_valid_group
6448  1635515802  46236.0                0.0  not_valid_group
6449  1635515807  46236.0                0.0  not_valid_group
6451  1635515820  46241.0                4.0            group
6454  1635515840  46249.0                4.0            group
6455  1635515850  46253.0                4.0            group
6456  1635515860  46257.0                4.0            group
6457  1635515871  46261.0                4.0            group
6458  1635515881  46265.0                4.0            group
6461  1635515901  46273.0                4.0            group
6462  1635515911  46277.0                4.0            group
6463  1635515921  46281.0                4.0            group
6464  1635515930  46285.0                4.0            group
6465  1635515940  46289.0                4.0            group
6468  1635515960  46297.0                4.0            group
6469  1635515972  46301.0                4.0            group
6470  1635515980  46305.0                4.0            group
6471  1635515991  46309.0                4.0            group
6472  1635515997  46311.0                2.0  not_valid_group
6473  1635515998  46311.0                0.0  not_valid_group