运行 一个函数并对下一行求和
run a function and do sum with next row
我的数据集 df
如下所示:
main_id time day lat long
1 2019-05-31 1 53.5501667 9.9716466
1 2019-05-31 1 53.6101545 9.9568781
1 2019-05-30 1 53.5501309 9.9716300
1 2019-05-30 2 53.5501309 9.9716300
1 2019-05-30 2 53.4561309 9.1246300
2 2019-06-31 4 53.5501667 9.9716466
2 2019-06-31 4 53.6101545 9.9568781
我想为每个 day
计算每个 main_id
项所覆盖的总距离。要计算两组坐标之间的距离,我可以使用这个函数:
def find_kms(coords_1, coords_2):
return geopy.distance.geodesic(coords_1, coords_2).km
但我不确定如何通过对 main_id
和 day
进行分组来求和。最终结果可能是这样的新 df:
main_id day total_dist time
1 1 ... 2019-05-31
1 2 .... 2019-05-31
2 4 .... 2019-05-31
其中导出的 time
是相应 main_id
和 day
时间列中的任意值或第一个值。
total_dist计算:
例如,对于第一行,main_id == 1
和第 1
天,total_dist 的计算方式如下:
find_kms(( 53.5501667,9.9716466),(53.6101545,9.9568781)) + find_kms((53.6101545, 9.9568781),(53.5501309,9.9716300)
请注意,您的函数未矢量化,因此使工作变得困难。
(df.assign(dist = df.join(df.groupby(['main_id', 'day'])[['lat', 'long']].
shift(), rsuffix='1').bfill().
reset_index().groupby('index').
apply(lambda x: find_kms(x[['lat','long']].values, x[['lat1','long1']].values))).
groupby(['main_id', 'day'])['dist'].sum().reset_index())
main_id day dist
0 1 1 13.499279
1 1 2 57.167034
2 2 4 6.747748
另一种选择是使用 reduce
:
from functools import reduce
def total_dist(x):
coords = x[['lat', 'long']].values
lam = lambda x,y: (find_kms(x[1],y) + x[0],y)
dist = reduce(lam, coords, (0,coords[0]))[0]
return pd.Series({'dist':dist})
df.groupby(['main_id', 'day']).apply(total_dist).reset_index()
main_id day dist
0 1 1 13.499351
1 1 2 57.167033
2 2 4 6.747775
编辑:
如果需要计数:
(pd.DataFrame({'count':df.groupby(['main_id', 'day']).main_id.count()}).
join(df.groupby(['main_id', 'day']).apply(total_dist)))
Out[157]:
count dist
main_id day
1 1 3 13.499351
2 2 57.167033
2 4 2 6.747775
不使用 find_kms
函数将 lat lon 转换为 utm 的另一种方法:
import pandas as pd
import numpy as np
import utm
u= utm.from_latlon(df.lat.values,df.long.values)
df['y'],df['x']=u[0],u[1] # lat, lon to utm meter (y=lat,x=lon)
a=df.groupby(['main_id', 'day'])[['x','y']].apply(lambda x: x.diff().replace(np.nan, 0))
df['dist']=np.sqrt(a.x**2 + a.y**2)/1000 # distance in km
df1=df.groupby(['main_id','day'])[['dist','day']].agg({'day':'count', 'dist': 'sum'}).rename(columns={'day':'day_count'})
df1 输出:
day_count dist
main_id day
1 1 3 13.494554
2 2 57.145276
2 4 2 6.745386
我的数据集 df
如下所示:
main_id time day lat long
1 2019-05-31 1 53.5501667 9.9716466
1 2019-05-31 1 53.6101545 9.9568781
1 2019-05-30 1 53.5501309 9.9716300
1 2019-05-30 2 53.5501309 9.9716300
1 2019-05-30 2 53.4561309 9.1246300
2 2019-06-31 4 53.5501667 9.9716466
2 2019-06-31 4 53.6101545 9.9568781
我想为每个 day
计算每个 main_id
项所覆盖的总距离。要计算两组坐标之间的距离,我可以使用这个函数:
def find_kms(coords_1, coords_2):
return geopy.distance.geodesic(coords_1, coords_2).km
但我不确定如何通过对 main_id
和 day
进行分组来求和。最终结果可能是这样的新 df:
main_id day total_dist time
1 1 ... 2019-05-31
1 2 .... 2019-05-31
2 4 .... 2019-05-31
其中导出的 time
是相应 main_id
和 day
时间列中的任意值或第一个值。
total_dist计算:
例如,对于第一行,main_id == 1
和第 1
天,total_dist 的计算方式如下:
find_kms(( 53.5501667,9.9716466),(53.6101545,9.9568781)) + find_kms((53.6101545, 9.9568781),(53.5501309,9.9716300)
请注意,您的函数未矢量化,因此使工作变得困难。
(df.assign(dist = df.join(df.groupby(['main_id', 'day'])[['lat', 'long']].
shift(), rsuffix='1').bfill().
reset_index().groupby('index').
apply(lambda x: find_kms(x[['lat','long']].values, x[['lat1','long1']].values))).
groupby(['main_id', 'day'])['dist'].sum().reset_index())
main_id day dist
0 1 1 13.499279
1 1 2 57.167034
2 2 4 6.747748
另一种选择是使用 reduce
:
from functools import reduce
def total_dist(x):
coords = x[['lat', 'long']].values
lam = lambda x,y: (find_kms(x[1],y) + x[0],y)
dist = reduce(lam, coords, (0,coords[0]))[0]
return pd.Series({'dist':dist})
df.groupby(['main_id', 'day']).apply(total_dist).reset_index()
main_id day dist
0 1 1 13.499351
1 1 2 57.167033
2 2 4 6.747775
编辑:
如果需要计数:
(pd.DataFrame({'count':df.groupby(['main_id', 'day']).main_id.count()}).
join(df.groupby(['main_id', 'day']).apply(total_dist)))
Out[157]:
count dist
main_id day
1 1 3 13.499351
2 2 57.167033
2 4 2 6.747775
不使用 find_kms
函数将 lat lon 转换为 utm 的另一种方法:
import pandas as pd
import numpy as np
import utm
u= utm.from_latlon(df.lat.values,df.long.values)
df['y'],df['x']=u[0],u[1] # lat, lon to utm meter (y=lat,x=lon)
a=df.groupby(['main_id', 'day'])[['x','y']].apply(lambda x: x.diff().replace(np.nan, 0))
df['dist']=np.sqrt(a.x**2 + a.y**2)/1000 # distance in km
df1=df.groupby(['main_id','day'])[['dist','day']].agg({'day':'count', 'dist': 'sum'}).rename(columns={'day':'day_count'})
df1 输出:
day_count dist
main_id day
1 1 3 13.494554
2 2 57.145276
2 4 2 6.745386