我怎样才能加快我的 3D 欧氏距离矩阵代码

Question

我已经创建了代码来根据每个时间步长（帧）的 x、y、z 坐标（TX、TY、TZ）计算所有对象（tagID）彼此之间的距离。虽然这段代码确实有效，但它对于我需要的东西来说太慢了。我目前的测试数据，大约有538,792行数据，我的实际数据会是大约6,880,000行数据。目前制作这些距离矩阵需要几分钟（可能是 10-15 分钟），而且由于我将有 40 组数据，我想加快速度。

当前代码如下：

# Sample data frame with correct columns:

data2 = ({'Frame' :[1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7], 
      'tagID' : ['nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3'],
      'TX':[5,2,3,4,5,6,7,5,np.nan,5,2,3,4,5,6,7,5,4,8,3,2],
      'TY':[4,2,3,4,5,9,3,2,np.nan,5,2,3,4,5,6,7,5,4,8,3,2],
      'TZ':[2,3,4,6,7,8,4,3,np.nan,5,2,3,4,5,6,7,5,4,8,3,2]})

df = pd.DataFrame(data2)

Frame tagID   TX   TY   TZ
0       1   nb1  5.0  4.0  2.0
1       1   nb2  2.0  2.0  3.0
2       1   nb3  3.0  3.0  4.0
3       2   nb1  4.0  4.0  6.0
4       2   nb2  5.0  5.0  7.0
5       2   nb3  6.0  9.0  8.0
6       3   nb1  7.0  3.0  4.0
7       3   nb2  5.0  2.0  3.0
8       3   nb3  NaN  NaN  NaN
9       4   nb1  5.0  5.0  5.0
10      4   nb2  2.0  2.0  2.0
11      4   nb3  3.0  3.0  3.0
12      5   nb1  4.0  4.0  4.0
13      5   nb2  5.0  5.0  5.0
14      5   nb3  6.0  6.0  6.0
15      6   nb1  7.0  7.0  7.0
16      6   nb2  5.0  5.0  5.0
17      6   nb3  4.0  4.0  4.0
18      7   nb1  8.0  8.0  8.0
19      7   nb2  3.0  3.0  3.0
20      7   nb3  2.0  2.0  2.0


# Calculate the squared distance between all x points:

TXdf = [] 
for i in range(1,df['Frame'].max()+1):
    boox = df['Frame'] == i 
    tempx = df[boox] 
    tx=tempx['TX'].apply(lambda x : (tempx['TX']-x)**2) 
    tx.columns=tempx.tagID   
    tx['ID']=tempx.tagID 
    tx['Frame'] = tempx.Frame 
    TXdf.append(tx) 
TXdfFinal = pd.concat(TXdf) # once all df for every 
print(TXdfFinal)
TXdfFinal.info()

# Calculate the squared distance between all y points:

print('y-diff sum')
TYdf = [] 
for i in range(1,df['Frame'].max()+1):
    booy = df['Frame'] == i 
    tempy = df[booy] 
    ty=tempy['TY'].apply(lambda x : (tempy['TY']-x)**2) 
    ty.columns=tempy.tagID   
    ty['ID']=tempy.tagID 
    ty['Frame'] = tempy.Frame 
    TYdf.append(ty) 
TYdfFinal = pd.concat(TYdf) 
print(TYdfFinal)
TYdfFinal.info()

# Calculate the squared distance between all z points:

print('z-diff sum')
TZdf = [] 
for i in range(1,df['Frame'].max()+1):
    booz = df['Frame'] == i 
    tempz = df[booz] 
    tz=tempz['TZ'].apply(lambda x : (tempz['TZ']-x)**2) 
    tz.columns=tempz.tagID  
    tz['ID']=tempz.tagID 
    tz['Frame'] = tempz.Frame 
    TZdf.append(tz) 
TZdfFinal = pd.concat(TZdf)


# Add all squared differences together:

euSum = TXdfFinal + TYdfFinal + TZdfFinal

# Square root the sum of the differences of each coordinate for Euclidean distance and add Frame and ID columns back on:

euDist = euSum.loc[:, euSum.columns !='ID'].apply(lambda x: x**0.5)
euDist['tagID'] = list(TXdfFinal['ID'])
euDist['Frame'] = list(TXdfFinal['Frame'])


# Add the distance matrix to the original dataframe based on Frame and ID columns:

new_df = pd.merge(df, euDist,  how='left', left_on=['Frame','tagID'], right_on = ['Frame','tagID'])

   Frame tagID   TX   TY   TZ      nb1     nb2      nb3
0       1   nb1  5.0  4.0  2.0   0.0000  3.7417   3.0000
1       1   nb2  2.0  2.0  3.0   3.7417  0.0000   1.7321
2       1   nb3  3.0  3.0  4.0   3.0000  1.7321   0.0000
3       2   nb1  4.0  4.0  6.0   0.0000  1.7321   5.7446
4       2   nb2  5.0  5.0  7.0   1.7321  0.0000   4.2426
5       2   nb3  6.0  9.0  8.0   5.7446  4.2426   0.0000
6       3   nb1  7.0  3.0  4.0   0.0000  2.4495      NaN
7       3   nb2  5.0  2.0  3.0   2.4495  0.0000      NaN
8       3   nb3  NaN  NaN  NaN      NaN     NaN      NaN
9       4   nb1  5.0  5.0  5.0   0.0000  5.1962   3.4641
10      4   nb2  2.0  2.0  2.0   5.1962  0.0000   1.7321
11      4   nb3  3.0  3.0  3.0   3.4641  1.7321   0.0000
12      5   nb1  4.0  4.0  4.0   0.0000  1.7321   3.4641
13      5   nb2  5.0  5.0  5.0   1.7321  0.0000   1.7321
14      5   nb3  6.0  6.0  6.0   3.4641  1.7321   0.0000
15      6   nb1  7.0  7.0  7.0   0.0000  3.4641   5.1962
16      6   nb2  5.0  5.0  5.0   3.4641  0.0000   1.7321
17      6   nb3  4.0  4.0  4.0   5.1962  1.7321   0.0000
18      7   nb1  8.0  8.0  8.0   0.0000  8.6603  10.3923
19      7   nb2  3.0  3.0  3.0   8.6603  0.0000   1.7321
20      7   nb3  2.0  2.0  2.0  10.3923  1.7321   0.0000

我尝试同时使用：euclidean() 和 pdist() with metric='euclidean' 但无法正确迭代。

任何关于如何获得相同结果但更快的建议将不胜感激。

Answer 1

您可以尝试将 for 循环的数量从 3 次减少到 1 次。看起来您正在对同一项目进行 3 次迭代。尝试在一个循环中完成所有计算

这应该会减少三分之二的时间。

Answer 2

方法来自 scipy

from scipy.spatial import distance
df['nb1'],df['nb2'],df['nb3']=np.concatenate([distance.cdist(y, y, metric='euclidean') for x , y in df[['TX','TY','TZ']].groupby(df['Frame'])]).T

我怎样才能加快我的 3D 欧氏距离矩阵代码

How can I speed up my 3D Euclidean distance matrix code

python

performance

euclidean-distance

pandas

distance-matrix