每次旅行的距离
distance for each trip
我想计算两个 gps 坐标(每个 tripId 的第一个和最后一个)之间的距离,以获得每次旅行的距离
我的数据框看起来像那样
tripId latitude longitude timestamp
0 1817603 53.155273 8.207176 2021-05-24 00:29:22
1 1817603 53.155271 8.206898 2021-05-24 00:29:38
2 1817603 53.155213 8.206314 2021-05-24 00:29:44
3 1817603 53.155135 8.206429 2021-05-24 00:29:50
4 1817603 53.154950 8.206565 2021-05-24 00:29:56
... ... ... ... ...
195 1817888 53.092805 8.212095 2021-05-24 08:27:54
196 1817888 53.093024 8.211756 2021-05-24 08:27:59
197 1817888 53.093305 8.211383 2021-05-24 08:28:05
198 1817888 53.093594 8.211026 2021-05-24 08:28:10
199 1817888 53.093853 8.210708 2021-05-24 08:28:15
我使用 s = pd.Series(haversine_vector(df, df.shift(),Unit.KILOMETERS), index=df.index, name='distance_K')
对每个步骤都这样做了
但我需要知道每个 id 的整个行程的距离
我已经将其用作测试并且它有效,但我需要知道每次旅行的确切持续时间(最终持续时间)
for i in range(1,df.shape[0]-1):
if df['tripId'][i]==df['tripId'][i+1]:
df['distance'][i]=df['distance'][i-1]+df['distance_K'][i]
else:
df['distance'][i]=df['distance_K'][i]
使用groupby_apply
计算每次行程的正弦距离:
# Inspired by
def haversine_series(sr):
lon1 = sr['longitude']
lat1 = sr['latitude']
lon2 = sr['longitude'].shift(fill_value=sr['longitude'].iloc[0])
lat2 = sr['latitude'].shift(fill_value=sr['latitude'].iloc[0])
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6371 * c
return km
df['distance_K'] = df.groupby('tripId').apply(haversine_series).droplevel(0)
注意:我想你的数据框已经按 timestamp
列排序了。
此时,您的数据框如下所示:
>>> df
tripId latitude longitude timestamp distance_K
0 1817603 53.155273 8.207176 2021-05-24 00:29:22 0.000000
1 1817603 53.155271 8.206898 2021-05-24 00:29:38 0.018538
2 1817603 53.155213 8.206314 2021-05-24 00:29:44 0.039470
3 1817603 53.155135 8.206429 2021-05-24 00:29:50 0.011577
4 1817603 53.154950 8.206565 2021-05-24 00:29:56 0.022481
195 1817888 53.092805 8.212095 2021-05-24 08:27:54 0.000000
196 1817888 53.093024 8.211756 2021-05-24 08:27:59 0.033248
197 1817888 53.093305 8.211383 2021-05-24 08:28:05 0.039958
198 1817888 53.093594 8.211026 2021-05-24 08:28:10 0.040012
199 1817888 53.093853 8.210708 2021-05-24 08:28:15 0.035781
现在每次旅行的总距离和时间很容易得到 groupby_agg
:
>>> df.groupby('tripId') \
.agg(total_distance=('distance_K', 'sum'),
total_time=('timestamp', lambda x: x.max()-x.min())) \
.reset_index()
tripId total_distance total_time
0 1817603 0.092066 0 days 00:00:34
1 1817888 0.148999 0 days 00:00:21
您可以使用
from haversine import haversine_vector
df = df.groupby('tripId').apply(
lambda g: g.assign(distance=lambda g: [0, *haversine_vector(
g.iloc[:-1][['latitude', 'longitude']].values,
g.iloc[1:][['latitude', 'longitude']].values,
)])
).droplevel(0)
df
# tripId latitude longitude timestamp distance
# 0 1817603 53.155273 8.207176 2021-05-24 00:29:22 0.000000
# 1 1817603 53.155271 8.206898 2021-05-24 00:29:38 0.018538
# 2 1817603 53.155213 8.206314 2021-05-24 00:29:44 0.039470
# 3 1817603 53.155135 8.206429 2021-05-24 00:29:50 0.011577
# 4 1817603 53.154950 8.206565 2021-05-24 00:29:56 0.022481
# 5 1817888 53.092805 8.212095 2021-05-24 08:27:54 0.000000
# 6 1817888 53.093024 8.211756 2021-05-24 08:27:59 0.033248
# 7 1817888 53.093305 8.211383 2021-05-24 08:28:05 0.039958
# 8 1817888 53.093594 8.211026 2021-05-24 08:28:10 0.040012
# 9 1817888 53.093853 8.210708 2021-05-24 08:28:15 0.035781
并获取总时间和距离
df.groupby('tripId').agg(
{
'timestamp': lambda g: g.iloc[-1] - g.iloc[0],
'distance':'sum'
}
)
# timestamp distance
# tripId
# 1817603 0 days 00:00:34 0.092066
# 1817888 0 days 00:00:21 0.148999
我想计算两个 gps 坐标(每个 tripId 的第一个和最后一个)之间的距离,以获得每次旅行的距离 我的数据框看起来像那样
tripId latitude longitude timestamp
0 1817603 53.155273 8.207176 2021-05-24 00:29:22
1 1817603 53.155271 8.206898 2021-05-24 00:29:38
2 1817603 53.155213 8.206314 2021-05-24 00:29:44
3 1817603 53.155135 8.206429 2021-05-24 00:29:50
4 1817603 53.154950 8.206565 2021-05-24 00:29:56
... ... ... ... ...
195 1817888 53.092805 8.212095 2021-05-24 08:27:54
196 1817888 53.093024 8.211756 2021-05-24 08:27:59
197 1817888 53.093305 8.211383 2021-05-24 08:28:05
198 1817888 53.093594 8.211026 2021-05-24 08:28:10
199 1817888 53.093853 8.210708 2021-05-24 08:28:15
我使用 s = pd.Series(haversine_vector(df, df.shift(),Unit.KILOMETERS), index=df.index, name='distance_K')
对每个步骤都这样做了
但我需要知道每个 id 的整个行程的距离
我已经将其用作测试并且它有效,但我需要知道每次旅行的确切持续时间(最终持续时间)
for i in range(1,df.shape[0]-1):
if df['tripId'][i]==df['tripId'][i+1]:
df['distance'][i]=df['distance'][i-1]+df['distance_K'][i]
else:
df['distance'][i]=df['distance_K'][i]
使用groupby_apply
计算每次行程的正弦距离:
# Inspired by
def haversine_series(sr):
lon1 = sr['longitude']
lat1 = sr['latitude']
lon2 = sr['longitude'].shift(fill_value=sr['longitude'].iloc[0])
lat2 = sr['latitude'].shift(fill_value=sr['latitude'].iloc[0])
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6371 * c
return km
df['distance_K'] = df.groupby('tripId').apply(haversine_series).droplevel(0)
注意:我想你的数据框已经按 timestamp
列排序了。
此时,您的数据框如下所示:
>>> df
tripId latitude longitude timestamp distance_K
0 1817603 53.155273 8.207176 2021-05-24 00:29:22 0.000000
1 1817603 53.155271 8.206898 2021-05-24 00:29:38 0.018538
2 1817603 53.155213 8.206314 2021-05-24 00:29:44 0.039470
3 1817603 53.155135 8.206429 2021-05-24 00:29:50 0.011577
4 1817603 53.154950 8.206565 2021-05-24 00:29:56 0.022481
195 1817888 53.092805 8.212095 2021-05-24 08:27:54 0.000000
196 1817888 53.093024 8.211756 2021-05-24 08:27:59 0.033248
197 1817888 53.093305 8.211383 2021-05-24 08:28:05 0.039958
198 1817888 53.093594 8.211026 2021-05-24 08:28:10 0.040012
199 1817888 53.093853 8.210708 2021-05-24 08:28:15 0.035781
现在每次旅行的总距离和时间很容易得到 groupby_agg
:
>>> df.groupby('tripId') \
.agg(total_distance=('distance_K', 'sum'),
total_time=('timestamp', lambda x: x.max()-x.min())) \
.reset_index()
tripId total_distance total_time
0 1817603 0.092066 0 days 00:00:34
1 1817888 0.148999 0 days 00:00:21
您可以使用
from haversine import haversine_vector
df = df.groupby('tripId').apply(
lambda g: g.assign(distance=lambda g: [0, *haversine_vector(
g.iloc[:-1][['latitude', 'longitude']].values,
g.iloc[1:][['latitude', 'longitude']].values,
)])
).droplevel(0)
df
# tripId latitude longitude timestamp distance
# 0 1817603 53.155273 8.207176 2021-05-24 00:29:22 0.000000
# 1 1817603 53.155271 8.206898 2021-05-24 00:29:38 0.018538
# 2 1817603 53.155213 8.206314 2021-05-24 00:29:44 0.039470
# 3 1817603 53.155135 8.206429 2021-05-24 00:29:50 0.011577
# 4 1817603 53.154950 8.206565 2021-05-24 00:29:56 0.022481
# 5 1817888 53.092805 8.212095 2021-05-24 08:27:54 0.000000
# 6 1817888 53.093024 8.211756 2021-05-24 08:27:59 0.033248
# 7 1817888 53.093305 8.211383 2021-05-24 08:28:05 0.039958
# 8 1817888 53.093594 8.211026 2021-05-24 08:28:10 0.040012
# 9 1817888 53.093853 8.210708 2021-05-24 08:28:15 0.035781
并获取总时间和距离
df.groupby('tripId').agg(
{
'timestamp': lambda g: g.iloc[-1] - g.iloc[0],
'distance':'sum'
}
)
# timestamp distance
# tripId
# 1817603 0 days 00:00:34 0.092066
# 1817888 0 days 00:00:21 0.148999