通过将计算距离应用于参考进行任务分组 table
Dask group by apply compute distance to reference table
动作
分组依据,应用,从其他数据框中检索参考,为组中的每个值计算到参考的距离。
问题
引入莫名其妙的NaN值,不同run结果不同
尝试次数
尝试了应用函数的计算(没有分组依据)并且工作正常。所以问题似乎不在计算中。
问题
是什么导致这些 NaN 值?为什么多次运行的计算不同?
例子
以下示例通过了所有断言,但给出了意外结果
import dask.dataframe as dd
import pandas as pd
import numpy as np
pdf = pd.DataFrame({'x':[232126.703, 232126.674, 232126.650, 232126.644, 232126.966],
'y':[579530.01599999995,579530.05099999998,579530.09100000001,579530.15099999995,579530.23199999996],
'z':[16858.0, 16878.0, 16904.0, 16950.0, 16973.0],
'hash':[1,2,2,1,1],
'label':[3,5,3,5,3]})
df = dd.from_pandas(pdf, npartitions = 2)
df_pos = pd.DataFrame({'x_c':[232124.703, 232127.674, 232126.650, 232126.644, 232126.966],
'y_c':[579533.01599999995,579531.05099999998,579530.09100000001,579530.15099999995,579530.23199999996],
'hash':[1,2,3,4,5]})
def add_distance(df, df_pos=df_pos):
ref = df_pos[df_pos.hash == df.name].copy()
df = df.copy()
assert df[['x', 'y']].values.shape[1] == ref[['x_c', 'y_c']].values.shape[1]
assert ref[['x_c', 'y_c']].values.shape[1] == 2
d_traj = np.linalg.norm(df[['x', 'y']].values - ref[['x_c', 'y_c']].values, axis=1)
assert np.isnan(d_traj).any() == False
d_traj = pd.Series(d_traj)
assert len(d_traj) == len(df)
df['d_traj'] = d_traj
return df
df_traj = df.groupby('hash').apply(add_distance, meta=pd.DataFrame(columns=['hash', 'label', 'x', 'y', 'z', 'd_traj']))
df_traj.compute()
本例中的问题是 df
的原始索引。要防止 d_traj
多次覆盖自身并使其他记录具有 NaN
值,请首先使用 reset_index()
。
例子
import dask.dataframe as dd
import pandas as pd
import numpy as np
pdf = pd.DataFrame({'x':[232126.703, 232126.674, 232126.650, 232126.644, 232126.966],
'y':[579530.01599999995,579530.05099999998,579530.09100000001,579530.15099999995,579530.23199999996],
'z':[16858.0, 16878.0, 16904.0, 16950.0, 16973.0],
'hash':[1,2,2,1,1],
'label':[3,5,3,5,3]})
df = dd.from_pandas(pdf, npartitions = 2)
df_pos = pd.DataFrame({'x_c':[232124.703, 232127.674, 232126.650, 232126.644, 232126.966],
'y_c':[579533.01599999995,579531.05099999998,579530.09100000001,579530.15099999995,579530.23199999996],
'hash':[1,2,3,4,5]})
def add_distance(df, df_pos=df_pos):
ref = df_pos[df_pos.hash == df.name].copy()
df = df.copy()
df.reset_index(inplace=True, drop=True) # added this line!
d_traj = np.linalg.norm(df[['x', 'y']].values - ref[['x_c', 'y_c']].values, axis=1)
d_traj = pd.Series(d_traj)
df['d_traj'] = d_traj
return df
df_traj = df.groupby('hash').apply(add_distance, meta=pd.DataFrame(columns=['hash', 'label', 'x', 'y', 'z', 'd_traj']))
df_traj.compute()
动作 分组依据,应用,从其他数据框中检索参考,为组中的每个值计算到参考的距离。
问题 引入莫名其妙的NaN值,不同run结果不同
尝试次数 尝试了应用函数的计算(没有分组依据)并且工作正常。所以问题似乎不在计算中。
问题 是什么导致这些 NaN 值?为什么多次运行的计算不同?
例子
以下示例通过了所有断言,但给出了意外结果
import dask.dataframe as dd
import pandas as pd
import numpy as np
pdf = pd.DataFrame({'x':[232126.703, 232126.674, 232126.650, 232126.644, 232126.966],
'y':[579530.01599999995,579530.05099999998,579530.09100000001,579530.15099999995,579530.23199999996],
'z':[16858.0, 16878.0, 16904.0, 16950.0, 16973.0],
'hash':[1,2,2,1,1],
'label':[3,5,3,5,3]})
df = dd.from_pandas(pdf, npartitions = 2)
df_pos = pd.DataFrame({'x_c':[232124.703, 232127.674, 232126.650, 232126.644, 232126.966],
'y_c':[579533.01599999995,579531.05099999998,579530.09100000001,579530.15099999995,579530.23199999996],
'hash':[1,2,3,4,5]})
def add_distance(df, df_pos=df_pos):
ref = df_pos[df_pos.hash == df.name].copy()
df = df.copy()
assert df[['x', 'y']].values.shape[1] == ref[['x_c', 'y_c']].values.shape[1]
assert ref[['x_c', 'y_c']].values.shape[1] == 2
d_traj = np.linalg.norm(df[['x', 'y']].values - ref[['x_c', 'y_c']].values, axis=1)
assert np.isnan(d_traj).any() == False
d_traj = pd.Series(d_traj)
assert len(d_traj) == len(df)
df['d_traj'] = d_traj
return df
df_traj = df.groupby('hash').apply(add_distance, meta=pd.DataFrame(columns=['hash', 'label', 'x', 'y', 'z', 'd_traj']))
df_traj.compute()
本例中的问题是 df
的原始索引。要防止 d_traj
多次覆盖自身并使其他记录具有 NaN
值,请首先使用 reset_index()
。
例子
import dask.dataframe as dd
import pandas as pd
import numpy as np
pdf = pd.DataFrame({'x':[232126.703, 232126.674, 232126.650, 232126.644, 232126.966],
'y':[579530.01599999995,579530.05099999998,579530.09100000001,579530.15099999995,579530.23199999996],
'z':[16858.0, 16878.0, 16904.0, 16950.0, 16973.0],
'hash':[1,2,2,1,1],
'label':[3,5,3,5,3]})
df = dd.from_pandas(pdf, npartitions = 2)
df_pos = pd.DataFrame({'x_c':[232124.703, 232127.674, 232126.650, 232126.644, 232126.966],
'y_c':[579533.01599999995,579531.05099999998,579530.09100000001,579530.15099999995,579530.23199999996],
'hash':[1,2,3,4,5]})
def add_distance(df, df_pos=df_pos):
ref = df_pos[df_pos.hash == df.name].copy()
df = df.copy()
df.reset_index(inplace=True, drop=True) # added this line!
d_traj = np.linalg.norm(df[['x', 'y']].values - ref[['x_c', 'y_c']].values, axis=1)
d_traj = pd.Series(d_traj)
df['d_traj'] = d_traj
return df
df_traj = df.groupby('hash').apply(add_distance, meta=pd.DataFrame(columns=['hash', 'label', 'x', 'y', 'z', 'd_traj']))
df_traj.compute()