Flightradar24 pandas groupby 和向量化。一个没有循环的解决方案
Flightradar24 pandas groupby and vectorize. A no looping solution
我希望对飞行雷达数据执行快速操作,以查看距离速度是否与报告的速度相符。我有多个航班,并被告知不要在 pandas 数据帧上 运行 双循环。这是一个示例数据框:
import pandas as pd
from datetime import datetime
from shapely.geometry import Point
from geopy.distance import distance
dates = ['2020-12-26 15:13:01', '2020-12-26 15:13:07','2020-12-26 15:13:19','2020-12-26 15:13:32','2020-12-26 15:13:38']
datetimes = [datetime.fromisoformat(date) for date in dates]
data = {'UTC': datetimes,
'Callsign': ["1", "1","2","2","2"],
'Position':[Point(30.542175,-91.13999200000001), Point(30.546204,-91.14020499999999),Point(30.551443,-91.14417299999999),Point(30.553909,-91.15136699999999),Point(30.554489,-91.155075)]
}
df = pd.DataFrame(data)
我想要做的是添加一个名为“dist”的新列。如果它是新呼号的第一个元素,则此列将为 0,否则将为一个点与前一个点之间的距离。
生成的 df 应如下所示:
df1 = df
dist = [0,0.27783309075379214,0,0.46131362750613436,0.22464461718704595]
df1['dist'] = dist
我试过的是先分配一个组索引:
df['group_index'] = df.groupby('Callsign').cumcount()
然后groupby
然后尝试应用函数:
df['dist'] = df.groupby('Callsign').apply(lambda g: 0 if g.group_index == 0 else distance((g.Position.x , g.Position.y),
(g.Position.shift().x , g.Position.shift().y)).miles)
我希望这会给我每个组的第一个索引 0,然后 运行 所有其他的距离函数和 return 一个以英里为单位的值。然而它不起作用。
代码出错至少有一个原因,即形状对象的 .x 和 .y 属性是在系列而不是对象上调用的。
任何关于如何解决这个问题的想法都将不胜感激。
- 先按呼号排序 df,然后按时间戳排序
- 使用移动点的临时列计算相邻行之间的距离
- 对于每个新呼号的第一行,将距离设置为 0
- 删除临时列
df = df.sort_values(by=['Callsign', 'UTC'])
df['Position_prev'] = df['Position'].shift().bfill()
def get_dist(row):
return distance((row['Position'].x, row['Position'].y),
(row['Position_prev'].x, row['Position_prev'].y)).miles
df['dist'] = df.apply(get_distances, axis=1)
# Flag row if callsign is different from previous row callsign
new_callsign_rows = df['Callsign'] != df['Callsign'].shift()
# Zero out the first distance of each callsign group
df.loc[new_callsign_rows, 'dist'] = 0.0
# Drop shifted column
df = df.drop(columns='Position_prev')
print(df)
UTC Callsign Position dist
0 2020-12-26 15:13:01 1 POINT (30.542175 -91.13999200000001) 0.000000
1 2020-12-26 15:13:07 1 POINT (30.546204 -91.14020499999999) 0.277833
2 2020-12-26 15:13:19 2 POINT (30.551443 -91.14417299999999) 0.000000
3 2020-12-26 15:13:32 2 POINT (30.553909 -91.15136699999999) 0.461314
4 2020-12-26 15:13:38 2 POINT (30.554489 -91.155075) 0.224645
我希望对飞行雷达数据执行快速操作,以查看距离速度是否与报告的速度相符。我有多个航班,并被告知不要在 pandas 数据帧上 运行 双循环。这是一个示例数据框:
import pandas as pd
from datetime import datetime
from shapely.geometry import Point
from geopy.distance import distance
dates = ['2020-12-26 15:13:01', '2020-12-26 15:13:07','2020-12-26 15:13:19','2020-12-26 15:13:32','2020-12-26 15:13:38']
datetimes = [datetime.fromisoformat(date) for date in dates]
data = {'UTC': datetimes,
'Callsign': ["1", "1","2","2","2"],
'Position':[Point(30.542175,-91.13999200000001), Point(30.546204,-91.14020499999999),Point(30.551443,-91.14417299999999),Point(30.553909,-91.15136699999999),Point(30.554489,-91.155075)]
}
df = pd.DataFrame(data)
我想要做的是添加一个名为“dist”的新列。如果它是新呼号的第一个元素,则此列将为 0,否则将为一个点与前一个点之间的距离。
生成的 df 应如下所示:
df1 = df
dist = [0,0.27783309075379214,0,0.46131362750613436,0.22464461718704595]
df1['dist'] = dist
我试过的是先分配一个组索引:
df['group_index'] = df.groupby('Callsign').cumcount()
然后groupby
然后尝试应用函数:
df['dist'] = df.groupby('Callsign').apply(lambda g: 0 if g.group_index == 0 else distance((g.Position.x , g.Position.y),
(g.Position.shift().x , g.Position.shift().y)).miles)
我希望这会给我每个组的第一个索引 0,然后 运行 所有其他的距离函数和 return 一个以英里为单位的值。然而它不起作用。
代码出错至少有一个原因,即形状对象的 .x 和 .y 属性是在系列而不是对象上调用的。
任何关于如何解决这个问题的想法都将不胜感激。
- 先按呼号排序 df,然后按时间戳排序
- 使用移动点的临时列计算相邻行之间的距离
- 对于每个新呼号的第一行,将距离设置为 0
- 删除临时列
df = df.sort_values(by=['Callsign', 'UTC'])
df['Position_prev'] = df['Position'].shift().bfill()
def get_dist(row):
return distance((row['Position'].x, row['Position'].y),
(row['Position_prev'].x, row['Position_prev'].y)).miles
df['dist'] = df.apply(get_distances, axis=1)
# Flag row if callsign is different from previous row callsign
new_callsign_rows = df['Callsign'] != df['Callsign'].shift()
# Zero out the first distance of each callsign group
df.loc[new_callsign_rows, 'dist'] = 0.0
# Drop shifted column
df = df.drop(columns='Position_prev')
print(df)
UTC Callsign Position dist
0 2020-12-26 15:13:01 1 POINT (30.542175 -91.13999200000001) 0.000000
1 2020-12-26 15:13:07 1 POINT (30.546204 -91.14020499999999) 0.277833
2 2020-12-26 15:13:19 2 POINT (30.551443 -91.14417299999999) 0.000000
3 2020-12-26 15:13:32 2 POINT (30.553909 -91.15136699999999) 0.461314
4 2020-12-26 15:13:38 2 POINT (30.554489 -91.155075) 0.224645