Python:根据序列总和根据条件聚合数据帧值
Python: Aggregate dataframe values according to condition depending on series sum
我有以下虚拟数据:
data = {"trip_id": ["a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c", "c"],
"distance": [5, 10, 2, 12, 20, 1, 15, 2, 3, 5, 0, 28],
"value": [32, 46, 132, 45, 68, 123, 56, 54, 32, 89, 10, 39],
"timestamp": ["2021-05-09 10:12:18", "2021-06-19 08:22:18", "2021-01-09 15:20:00", "2021-01-09 18:30:18",
"2021-04-09 24:11:18", "2021-02-19 08:12:18", "2021-06-11 08:56:18", "2021-06-11 08:54:18",
"2021-06-09 18:12:18", "2021-06-19 16:10:18", "2021-06-03 18:12:18", "2021-06-12 08:22:18"]
}
df = pd.DataFrame(data, columns = ["trip_id", "distance", "value", "timestamp"])
df
>>
trip_id distance value timestamp
0 a 5 32 2021-05-09 10:12:18
1 a 10 46 2021-06-19 08:22:18
2 a 2 132 2021-01-09 15:20:00
3 a 12 45 2021-01-09 18:30:18
4 b 20 68 2021-04-09 24:11:18
5 b 1 123 2021-02-19 08:12:18
6 b 15 56 2021-06-11 08:56:18
7 c 2 54 2021-06-11 08:54:18
8 c 3 32 2021-06-09 18:12:18
9 c 5 89 2021-06-19 16:10:18
10 c 0 10 2021-06-03 18:12:18
11 c 28 39 2021-06-12 08:22:18
我想压缩每 trip_id 距离列中每 30 米的值。为此,我正在尝试:
current_trip_id = ""
dist_segment = []
for index,row in df.iterrows():
if row["trip_id"] == current_trip_id:
if sum(dist_segment) <= 30:
# compute trip characteristics every 30m
sorted_trip_char = df.sort_values(["trip_id", "timestamp"])
grouped_trip_char = sorted_trip_char.groupby("trip_id")
agg_trip_char = grouped_trip_char.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
})
else:
# store trip characteristics to another line up to 30m
print("More than 30m in index", index)
# Update loop
current_trip_id = row["trip_id"]
dist_segment.append(row["distance"])
但是,我看不出如何仍然保留 id 并跳转到新的值聚合。
这是我想要的输出:
agg_trip_char
>>
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-01-09 15:20:00 2021-06-19 08:22:18
b b 21 95.50 2021-02-19 08:12:18 2021-06-11 08:56:18
b b 15 56 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 10 46.25 2021-06-11 08:54:18 2021-06-03 18:12:18
c c 28 39 2021-06-12 08:22:18 2021-06-12 08:22:18
您可以修改您的代码如下:
- 对列进行排序
df_out = df.sort_values(["trip_id", "timestamp"])
- 在同一
trip_id
内每隔 30 米创建一组
g = df_out.groupby('trip_id')['distance'].cumsum() // 30
在这里,我们使用.groupby()
+ .cumsum()
得到特定trip_id
内distance
的累计和。然后用整数除以30,将每30m分成不同的组号。
- 聚合列
df_out2 = (df_out.groupby(['trip_id', g])
.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
}).droplevel(-1)
)
结果:
print(df_out2)
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-01-09 15:20:00 2021-06-19 08:22:18
b b 21 95.50 2021-02-19 08:12:18 2021-04-09 24:11:18
b b 15 56.00 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 5 32.00 2021-06-03 18:12:18 2021-06-11 08:54:18
c c 33 64.00 2021-06-12 08:22:18 2021-06-19 16:10:18
如果我们不对列进行排序,结果将更接近您的预期输出:
#df_out = df.sort_values(["trip_id", "timestamp"])
df_out = df
g = df_out.groupby('trip_id')['distance'].cumsum() // 30
df_out2 = (df_out.groupby(['trip_id', g])
.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
}).droplevel(-1)
)
结果:
print(df_out2)
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-05-09 10:12:18 2021-01-09 18:30:18
b b 21 95.50 2021-04-09 24:11:18 2021-02-19 08:12:18
b b 15 56.00 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 10 46.25 2021-06-11 08:54:18 2021-06-03 18:12:18
c c 28 39.00 2021-06-12 08:22:18 2021-06-12 08:22:18
不过,如果需要按时间顺序排序的话,trip_id
和timestamp
似乎也比较合理。因此,请结合您的实际情况,做出相应的选择。
我有以下虚拟数据:
data = {"trip_id": ["a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c", "c"],
"distance": [5, 10, 2, 12, 20, 1, 15, 2, 3, 5, 0, 28],
"value": [32, 46, 132, 45, 68, 123, 56, 54, 32, 89, 10, 39],
"timestamp": ["2021-05-09 10:12:18", "2021-06-19 08:22:18", "2021-01-09 15:20:00", "2021-01-09 18:30:18",
"2021-04-09 24:11:18", "2021-02-19 08:12:18", "2021-06-11 08:56:18", "2021-06-11 08:54:18",
"2021-06-09 18:12:18", "2021-06-19 16:10:18", "2021-06-03 18:12:18", "2021-06-12 08:22:18"]
}
df = pd.DataFrame(data, columns = ["trip_id", "distance", "value", "timestamp"])
df
>>
trip_id distance value timestamp
0 a 5 32 2021-05-09 10:12:18
1 a 10 46 2021-06-19 08:22:18
2 a 2 132 2021-01-09 15:20:00
3 a 12 45 2021-01-09 18:30:18
4 b 20 68 2021-04-09 24:11:18
5 b 1 123 2021-02-19 08:12:18
6 b 15 56 2021-06-11 08:56:18
7 c 2 54 2021-06-11 08:54:18
8 c 3 32 2021-06-09 18:12:18
9 c 5 89 2021-06-19 16:10:18
10 c 0 10 2021-06-03 18:12:18
11 c 28 39 2021-06-12 08:22:18
我想压缩每 trip_id 距离列中每 30 米的值。为此,我正在尝试:
current_trip_id = ""
dist_segment = []
for index,row in df.iterrows():
if row["trip_id"] == current_trip_id:
if sum(dist_segment) <= 30:
# compute trip characteristics every 30m
sorted_trip_char = df.sort_values(["trip_id", "timestamp"])
grouped_trip_char = sorted_trip_char.groupby("trip_id")
agg_trip_char = grouped_trip_char.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
})
else:
# store trip characteristics to another line up to 30m
print("More than 30m in index", index)
# Update loop
current_trip_id = row["trip_id"]
dist_segment.append(row["distance"])
但是,我看不出如何仍然保留 id 并跳转到新的值聚合。
这是我想要的输出:
agg_trip_char
>>
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-01-09 15:20:00 2021-06-19 08:22:18
b b 21 95.50 2021-02-19 08:12:18 2021-06-11 08:56:18
b b 15 56 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 10 46.25 2021-06-11 08:54:18 2021-06-03 18:12:18
c c 28 39 2021-06-12 08:22:18 2021-06-12 08:22:18
您可以修改您的代码如下:
- 对列进行排序
df_out = df.sort_values(["trip_id", "timestamp"])
- 在同一
trip_id
内每隔 30 米创建一组
g = df_out.groupby('trip_id')['distance'].cumsum() // 30
在这里,我们使用.groupby()
+ .cumsum()
得到特定trip_id
内distance
的累计和。然后用整数除以30,将每30m分成不同的组号。
- 聚合列
df_out2 = (df_out.groupby(['trip_id', g])
.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
}).droplevel(-1)
)
结果:
print(df_out2)
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-01-09 15:20:00 2021-06-19 08:22:18
b b 21 95.50 2021-02-19 08:12:18 2021-04-09 24:11:18
b b 15 56.00 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 5 32.00 2021-06-03 18:12:18 2021-06-11 08:54:18
c c 33 64.00 2021-06-12 08:22:18 2021-06-19 16:10:18
如果我们不对列进行排序,结果将更接近您的预期输出:
#df_out = df.sort_values(["trip_id", "timestamp"])
df_out = df
g = df_out.groupby('trip_id')['distance'].cumsum() // 30
df_out2 = (df_out.groupby(['trip_id', g])
.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
}).droplevel(-1)
)
结果:
print(df_out2)
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-05-09 10:12:18 2021-01-09 18:30:18
b b 21 95.50 2021-04-09 24:11:18 2021-02-19 08:12:18
b b 15 56.00 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 10 46.25 2021-06-11 08:54:18 2021-06-03 18:12:18
c c 28 39.00 2021-06-12 08:22:18 2021-06-12 08:22:18
不过,如果需要按时间顺序排序的话,trip_id
和timestamp
似乎也比较合理。因此,请结合您的实际情况,做出相应的选择。