运行 一个函数并对下一行求和

run a function and do sum with next row

我的数据集 df 如下所示:

main_id    time               day      lat           long
1          2019-05-31         1        53.5501667    9.9716466  
1          2019-05-31         1        53.6101545    9.9568781
1          2019-05-30         1        53.5501309    9.9716300
1          2019-05-30         2        53.5501309    9.9716300
1          2019-05-30         2        53.4561309    9.1246300
2          2019-06-31         4        53.5501667    9.9716466
2          2019-06-31         4        53.6101545    9.9568781

我想为每个 day 计算每个 main_id 项所覆盖的总距离。要计算两组坐标之间的距离,我可以使用这个函数:

def find_kms(coords_1, coords_2):
    return geopy.distance.geodesic(coords_1, coords_2).km

但我不确定如何通过对 main_idday 进行分组来求和。最终结果可能是这样的新 df:

main_id      day      total_dist     time
1            1        ...            2019-05-31 
1            2        ....           2019-05-31 
2            4        ....           2019-05-31 

其中导出的 time 是相应 main_idday 时间列中的任意值或第一个值。

total_dist计算:

例如,对于第一行,main_id == 1 和第 1 天,total_dist 的计算方式如下:

find_kms(( 53.5501667,9.9716466),(53.6101545,9.9568781)) + find_kms((53.6101545,   9.9568781),(53.5501309,9.9716300)

请注意,您的函数未矢量化,因此使工作变得困难。

(df.assign(dist = df.join(df.groupby(['main_id', 'day'])[['lat', 'long']].
   shift(), rsuffix='1').bfill().
   reset_index().groupby('index').
   apply(lambda x: find_kms(x[['lat','long']].values, x[['lat1','long1']].values))).
   groupby(['main_id', 'day'])['dist'].sum().reset_index())

  main_id  day       dist
0        1    1  13.499279
1        1    2  57.167034
2        2    4   6.747748

另一种选择是使用 reduce:

from functools import reduce

def total_dist(x):
    coords = x[['lat', 'long']].values
    lam = lambda x,y: (find_kms(x[1],y) + x[0],y)
    dist = reduce(lam, coords, (0,coords[0]))[0]
    return pd.Series({'dist':dist})

df.groupby(['main_id', 'day']).apply(total_dist).reset_index()
 
   main_id  day       dist
0        1    1  13.499351
1        1    2  57.167033
2        2    4   6.747775

编辑:

如果需要计数:

(pd.DataFrame({'count':df.groupby(['main_id', 'day']).main_id.count()}).
   join(df.groupby(['main_id', 'day']).apply(total_dist)))
Out[157]: 
             count       dist
main_id day                  
1       1        3  13.499351
        2        2  57.167033
2       4        2   6.747775

不使用 find_kms 函数将 lat lon 转换为 utm 的另一种方法:

import pandas as pd
import numpy as np
import utm

u= utm.from_latlon(df.lat.values,df.long.values) 
df['y'],df['x']=u[0],u[1] # lat, lon to utm meter (y=lat,x=lon)

a=df.groupby(['main_id', 'day'])[['x','y']].apply(lambda x: x.diff().replace(np.nan, 0))

df['dist']=np.sqrt(a.x**2 + a.y**2)/1000 # distance in km

df1=df.groupby(['main_id','day'])[['dist','day']].agg({'day':'count', 'dist': 'sum'}).rename(columns={'day':'day_count'})

df1 输出:

             day_count       dist
main_id day                      
1       1            3  13.494554
        2            2  57.145276
2       4            2   6.745386