使用 for 循环通过 groupby 操作添加新值
Adding new value through groupby operation using for cycle
我需要添加一个列,其中包含不同阶段工作人员坐标的变化。我们有一个数据框:
import pandas as pd
from geopy.distance import geodesic as GD
d = {'user_id': [26, 26, 26, 26, 26, 26, 9, 9, 9, 9],
'worker_latitude': [55.114410, 55.114459, 55.114379,
55.114462, 55.114372, 55.114389, 65.774064, 65.731034, 65.731034, 65.774057],
'worker_longitude': [38.927155, 38.927114, 38.927101, 38.927156,
38.927258, 38.927120, 37.532380, 37.611746, 37.611746, 37.532346],
'change':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data=d)
看起来像:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
然后我需要计算人之前和当前阶段之间的差异。所以我使用了一个函数:
for group in df.groupby(by='user_id'):
group[1].reset_index(inplace=True,drop=True)
for i in range(1,len(group[1])):
first_xy=(group[1]['worker_latitude'][i-1],group[1]['worker_longitude'][i-1])
second_xy=(group[1]['worker_latitude'][i],group[1]['worker_longitude'][i])
print((round((GD(first_xy, second_xy).km),6)))
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
然后我得到:
6.021576
0.0
6.021896
0.00605
0.008945
0.009884
0.011948
0.009007
display(df)
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
这意味着值计算正确,但由于某些原因它们不适合 'change' 列。可以做什么?
我认为问题可能出在以下行:
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
您正在更新 group
变量,您应该更新 df
变量。我对您修复此 属性 的建议是:
df.loc[i, "change"] = round((GD(first_xy, second_xy).km),6)
考虑到 i
是您要更新的行号,"change"
是列名。
它不起作用,因为您正在访问 DataFrame 的副本并试图为其赋值。
但是,似乎不是在 groupby
中迭代 DataFrame,而是使用 groupby
+ shift
先获取 first_xy
似乎更直观;然后 apply
将 first_xy
和 second_xy
之间的 GD 应用到每一行的自定义函数:
def func(x):
if x.notna().all():
first_xy = (x['prev_lat'], x['prev_long'])
second_xy = (x['worker_latitude'], x['worker_longitude'])
return round((GD(first_xy, second_xy).km), 6)
else:
return float('nan')
g = df.groupby('user_id')
df['prev_lat'] = g['worker_latitude'].shift()
df['prev_long'] = g['worker_longitude'].shift()
df['change'] = df.apply(func, axis=1)
df = df.drop(columns=['prev_lat','prev_long'])
输出:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 NaN
1 26 55.114459 38.927114 0.006050
2 26 55.114379 38.927101 0.008945
3 26 55.114462 38.927156 0.009884
4 26 55.114372 38.927258 0.011948
5 26 55.114389 38.927120 0.009007
6 9 65.774064 37.532380 NaN
7 9 65.731034 37.611746 6.021576
8 9 65.731034 37.611746 0.000000
9 9 65.774057 37.532346 6.021896
我需要添加一个列,其中包含不同阶段工作人员坐标的变化。我们有一个数据框:
import pandas as pd
from geopy.distance import geodesic as GD
d = {'user_id': [26, 26, 26, 26, 26, 26, 9, 9, 9, 9],
'worker_latitude': [55.114410, 55.114459, 55.114379,
55.114462, 55.114372, 55.114389, 65.774064, 65.731034, 65.731034, 65.774057],
'worker_longitude': [38.927155, 38.927114, 38.927101, 38.927156,
38.927258, 38.927120, 37.532380, 37.611746, 37.611746, 37.532346],
'change':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data=d)
看起来像:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
然后我需要计算人之前和当前阶段之间的差异。所以我使用了一个函数:
for group in df.groupby(by='user_id'):
group[1].reset_index(inplace=True,drop=True)
for i in range(1,len(group[1])):
first_xy=(group[1]['worker_latitude'][i-1],group[1]['worker_longitude'][i-1])
second_xy=(group[1]['worker_latitude'][i],group[1]['worker_longitude'][i])
print((round((GD(first_xy, second_xy).km),6)))
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
然后我得到:
6.021576
0.0
6.021896
0.00605
0.008945
0.009884
0.011948
0.009007
display(df)
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
这意味着值计算正确,但由于某些原因它们不适合 'change' 列。可以做什么?
我认为问题可能出在以下行:
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
您正在更新 group
变量,您应该更新 df
变量。我对您修复此 属性 的建议是:
df.loc[i, "change"] = round((GD(first_xy, second_xy).km),6)
考虑到 i
是您要更新的行号,"change"
是列名。
它不起作用,因为您正在访问 DataFrame 的副本并试图为其赋值。
但是,似乎不是在 groupby
中迭代 DataFrame,而是使用 groupby
+ shift
先获取 first_xy
似乎更直观;然后 apply
将 first_xy
和 second_xy
之间的 GD 应用到每一行的自定义函数:
def func(x):
if x.notna().all():
first_xy = (x['prev_lat'], x['prev_long'])
second_xy = (x['worker_latitude'], x['worker_longitude'])
return round((GD(first_xy, second_xy).km), 6)
else:
return float('nan')
g = df.groupby('user_id')
df['prev_lat'] = g['worker_latitude'].shift()
df['prev_long'] = g['worker_longitude'].shift()
df['change'] = df.apply(func, axis=1)
df = df.drop(columns=['prev_lat','prev_long'])
输出:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 NaN
1 26 55.114459 38.927114 0.006050
2 26 55.114379 38.927101 0.008945
3 26 55.114462 38.927156 0.009884
4 26 55.114372 38.927258 0.011948
5 26 55.114389 38.927120 0.009007
6 9 65.774064 37.532380 NaN
7 9 65.731034 37.611746 6.021576
8 9 65.731034 37.611746 0.000000
9 9 65.774057 37.532346 6.021896