Python:根据序列总和根据条件聚合数据帧值

Python: Aggregate dataframe values according to condition depending on series sum

我有以下虚拟数据:

data = {"trip_id": ["a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c", "c"],
        "distance": [5, 10, 2, 12, 20, 1, 15, 2, 3, 5, 0, 28],
        "value": [32, 46, 132, 45, 68, 123, 56, 54, 32, 89, 10, 39],
        "timestamp": ["2021-05-09 10:12:18", "2021-06-19 08:22:18", "2021-01-09 15:20:00", "2021-01-09 18:30:18",
                "2021-04-09 24:11:18", "2021-02-19 08:12:18", "2021-06-11 08:56:18", "2021-06-11 08:54:18",
                "2021-06-09 18:12:18", "2021-06-19 16:10:18", "2021-06-03 18:12:18", "2021-06-12 08:22:18"]
       }

df = pd.DataFrame(data, columns = ["trip_id", "distance", "value", "timestamp"])
df

>>
trip_id distance    value   timestamp
0   a   5   32  2021-05-09 10:12:18
1   a   10  46  2021-06-19 08:22:18
2   a   2   132 2021-01-09 15:20:00
3   a   12  45  2021-01-09 18:30:18
4   b   20  68  2021-04-09 24:11:18
5   b   1   123 2021-02-19 08:12:18
6   b   15  56  2021-06-11 08:56:18
7   c   2   54  2021-06-11 08:54:18
8   c   3   32  2021-06-09 18:12:18
9   c   5   89  2021-06-19 16:10:18
10  c   0   10  2021-06-03 18:12:18
11  c   28  39  2021-06-12 08:22:18

我想压缩每 trip_id 距离列中每 30 米的值。为此,我正在尝试:

current_trip_id = ""
dist_segment = []

for index,row in df.iterrows():
    
    if row["trip_id"] == current_trip_id:
        
        if sum(dist_segment) <= 30:
            # compute trip characteristics every 30m
            sorted_trip_char = df.sort_values(["trip_id", "timestamp"])
            grouped_trip_char = sorted_trip_char.groupby("trip_id")
            
            agg_trip_char = grouped_trip_char.agg({'trip_id': 'first',
                                       'distance':'sum',
                                       'value': 'mean',
                                       'timestamp':['first', 'last']
                                      })
            
        else:
            # store trip characteristics to another line up to 30m
            print("More than 30m in index", index)
            
            
    
    # Update loop
    current_trip_id = row["trip_id"]
    dist_segment.append(row["distance"])

但是,我看不出如何仍然保留 id 并跳转到新的值聚合。

这是我想要的输出:

agg_trip_char
>>

          trip_id   distance    value      timestamp
          first     sum         mean       first                  last
trip_id                 
a         a         29         63.75        2021-01-09 15:20:00   2021-06-19 08:22:18
b         b         21         95.50        2021-02-19 08:12:18   2021-06-11 08:56:18
b         b         15         56           2021-06-11 08:56:18   2021-06-11 08:56:18
c         c         10         46.25        2021-06-11 08:54:18   2021-06-03 18:12:18
c         c         28         39           2021-06-12 08:22:18   2021-06-12 08:22:18

您可以修改您的代码如下:

  1. 对列进行排序
df_out = df.sort_values(["trip_id", "timestamp"])
  1. 在同一 trip_id
  2. 内每隔 30 米创建一组
g = df_out.groupby('trip_id')['distance'].cumsum() // 30

在这里,我们使用.groupby() + .cumsum()得到特定trip_iddistance的累计和。然后用整数除以30,将每30m分成不同的组号。

  1. 聚合列
df_out2 = (df_out.groupby(['trip_id', g])
                 .agg({'trip_id': 'first',
                       'distance':'sum',
                       'value': 'mean',
                       'timestamp':['first', 'last']
                      }).droplevel(-1)
          )                 

结果:

print(df_out2)

        trip_id distance  value            timestamp                     
          first      sum   mean                first                 last
trip_id                                                                  
a             a       29  63.75  2021-01-09 15:20:00  2021-06-19 08:22:18
b             b       21  95.50  2021-02-19 08:12:18  2021-04-09 24:11:18
b             b       15  56.00  2021-06-11 08:56:18  2021-06-11 08:56:18
c             c        5  32.00  2021-06-03 18:12:18  2021-06-11 08:54:18
c             c       33  64.00  2021-06-12 08:22:18  2021-06-19 16:10:18

如果我们不对列进行排序,结果将更接近您的预期输出:

#df_out = df.sort_values(["trip_id", "timestamp"])
df_out = df
g = df_out.groupby('trip_id')['distance'].cumsum() // 30
df_out2 = (df_out.groupby(['trip_id', g])
                 .agg({'trip_id': 'first',
                       'distance':'sum',
                       'value': 'mean',
                       'timestamp':['first', 'last']
                      }).droplevel(-1)
          )                 

结果:

print(df_out2)

        trip_id distance  value            timestamp                     
          first      sum   mean                first                 last
trip_id                                                                  
a             a       29  63.75  2021-05-09 10:12:18  2021-01-09 18:30:18
b             b       21  95.50  2021-04-09 24:11:18  2021-02-19 08:12:18
b             b       15  56.00  2021-06-11 08:56:18  2021-06-11 08:56:18
c             c       10  46.25  2021-06-11 08:54:18  2021-06-03 18:12:18
c             c       28  39.00  2021-06-12 08:22:18  2021-06-12 08:22:18

不过,如果需要按时间顺序排序的话,trip_idtimestamp似乎也比较合理。因此,请结合您的实际情况,做出相应的选择。