使用 for 循环通过 groupby 操作添加新值

Adding new value through groupby operation using for cycle

我需要添加一个列,其中包含不同阶段工作人员坐标的变化。我们有一个数据框:

import pandas as pd
from geopy.distance  import geodesic as GD

d = {'user_id': [26, 26, 26, 26, 26, 26, 9, 9, 9, 9],
            'worker_latitude': [55.114410, 55.114459, 55.114379, 
55.114462, 55.114372, 55.114389, 65.774064, 65.731034, 65.731034, 65.774057], 
            'worker_longitude': [38.927155, 38.927114, 38.927101, 38.927156,
 38.927258, 38.927120, 37.532380, 37.611746, 37.611746, 37.532346],
    'change':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

df = pd.DataFrame(data=d)

看起来像:

   user_id  worker_latitude  worker_longitude  change
0       26        55.114410         38.927155       0
1       26        55.114459         38.927114       0
2       26        55.114379         38.927101       0
3       26        55.114462         38.927156       0
4       26        55.114372         38.927258       0
5       26        55.114389         38.927120       0
6        9        65.774064         37.532380       0
7        9        65.731034         37.611746       0
8        9        65.731034         37.611746       0
9        9        65.774057         37.532346       0

然后我需要计算人之前和当前阶段之间的差异。所以我使用了一个函数:

for group in df.groupby(by='user_id'):
    group[1].reset_index(inplace=True,drop=True)
    for i in range(1,len(group[1])):
        first_xy=(group[1]['worker_latitude'][i-1],group[1]['worker_longitude'][i-1])
        second_xy=(group[1]['worker_latitude'][i],group[1]['worker_longitude'][i])
        print((round((GD(first_xy, second_xy).km),6)))
        group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)

然后我得到:

6.021576
0.0
6.021896
0.00605
0.008945
0.009884
0.011948
0.009007
display(df)
   user_id  worker_latitude  worker_longitude  change
0       26        55.114410         38.927155       0
1       26        55.114459         38.927114       0
2       26        55.114379         38.927101       0
3       26        55.114462         38.927156       0
4       26        55.114372         38.927258       0
5       26        55.114389         38.927120       0
6        9        65.774064         37.532380       0
7        9        65.731034         37.611746       0
8        9        65.731034         37.611746       0
9        9        65.774057         37.532346       0

这意味着值计算正确,但由于某些原因它们不适合 'change' 列。可以做什么?

我认为问题可能出在以下行:

        group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)

您正在更新 group 变量,您应该更新 df 变量。我对您修复此 属性 的建议是:

        df.loc[i, "change"] = round((GD(first_xy, second_xy).km),6)

考虑到 i 是您要更新的行号,"change" 是列名。

它不起作用,因为您正在访问 DataFrame 的副本并试图为其赋值。

但是,似乎不是在 groupby 中迭代 DataFrame,而是使用 groupby + shift 先获取 first_xy 似乎更直观;然后 applyfirst_xysecond_xy 之间的 GD 应用到每一行的自定义函数:

def func(x):
    if x.notna().all():
        first_xy = (x['prev_lat'], x['prev_long'])
        second_xy = (x['worker_latitude'], x['worker_longitude'])
        return round((GD(first_xy, second_xy).km), 6)
    else:
        return float('nan')

g = df.groupby('user_id')
df['prev_lat'] = g['worker_latitude'].shift()
df['prev_long'] = g['worker_longitude'].shift()
df['change'] = df.apply(func, axis=1)
df = df.drop(columns=['prev_lat','prev_long'])

输出:

   user_id  worker_latitude  worker_longitude    change
0       26        55.114410         38.927155       NaN
1       26        55.114459         38.927114  0.006050
2       26        55.114379         38.927101  0.008945
3       26        55.114462         38.927156  0.009884
4       26        55.114372         38.927258  0.011948
5       26        55.114389         38.927120  0.009007
6        9        65.774064         37.532380       NaN
7        9        65.731034         37.611746  6.021576
8        9        65.731034         37.611746  0.000000
9        9        65.774057         37.532346  6.021896