如何加快 Pandas 中的最近搜索(可能通过向量化代码)
How to speed up nearest search in Pandas (perhaps by vectorizing code)
我有两个数据框。每个都包含位置 (X,Y) 和该点的值。对于第一个数据框中的每个点,我想在第二个数据框中找到最近的点,然后找出差异。我有可用的代码,但它使用 for 循环,速度很慢。
关于如何加快速度的任何建议?我知道为了提高性能,在 pandas 中摆脱 for 循环通常是个好主意,但我不知道在这种情况下该怎么做。
下面是一些示例代码:
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
df2=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
nearest=df1.copy() #CORRECTION. This had been just =df1 which caused a problem when trying to compare to answers submitted.
for idx,row in nearest.iterrows():
#Find the X,Y points closest to the selected point:
closest=df2.ix[((df2['X']-row['X'])**2+(df2['Y']-row['Y'])**2).idxmin()]
#Set the max to the difference between the current row and the nearest one.
nearest.loc[idx,'val']= df1.loc[idx,'val'] - closest['val']
由于我在较大的数据帧上使用它,因此计算需要很长时间。
谢谢,
一个很酷的解决方案涉及利用 complex
数据类型(内置于 python 和 numpy)。
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
df2=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
# dataframes to numpy arrays of complex numbers
p1 = (df1['X'] + 1j * df1['Y']).values
p2 = (df2['X'] + 1j * df2['Y']).values
# calculate all the distances, between each point in
# df1 and each point in df2 (using an array-broadcasting trick)
all_dists = abs(p1[..., np.newaxis] - p2)
# find indices of the minimal distance from df1 to df2,
# and from df2 to df1
nearest_idxs1 = np.argmin(all_dists, axis = 0)
nearest_idxs2 = np.argmin(all_dists, axis = 1)
# extract the rows from the dataframes
nearest_points1 = df1.ix[nearest_idxs1].reset_index()
nearest_points2 = df2.ix[nearest_idxs2].reset_index()
这可能比使用循环快得多,但如果你的系列变得很大,它会消耗大量内存(点数的二次方)。
此外,如果点集具有不同的长度,则此解决方案有效。
这里有一个具体的例子来说明它是如何工作的:
df1 = pd.DataFrame([ [987, 0, 0], [888, 2,2], [2345, 3,3] ], columns=['val', 'X', 'Y'])
df2 = pd.DataFrame([ [ 1000, 1, 1 ], [2000, 9, 9] ] , columns=['val', 'X', 'Y'])
df1
val X Y
0 987 0 0
1 888 2 2
2 2345 3 3
df2
val X Y
0 1000 1 1
1 2000 9 9
这里,对于df1中的每一个点,df2[0]=(1,1)都是最近的点(如下图nearest_idxs2
)。考虑相反的问题,对于(1,1),(0,0)或(2,2)是最近的,对于(9,9),df1[1]=(3,3)是最近的(如下图nearest_idxs1
所示)。
p1 = (df1['X'] + 1j * df1['Y']).values
p2 = (df2['X'] + 1j * df2['Y']).values
all_dists = abs(p1[..., np.newaxis] - p2)
nearest_idxs1 = np.argmin(all_dists, axis = 0)
nearest_idxs2 = np.argmin(all_dists, axis = 1)
nearest_idxs1
array([0, 2])
nearest_idxs2
array([0, 0, 0])
# It's nearest_points2 you're after:
nearest_points2 = df2.ix[nearest_idxs2].reset_index()
nearest_points2
index val X Y
0 0 1000 1 1
1 0 1000 1 1
2 0 1000 1 1
df1['val'] - nearest_points2['val']
0 -13
1 -112
2 1345
要解决相反的问题(对于 df2 中的每个点,在 df1 中找到最近的点),取 nearest_points1
和 df2['val'] - nearest_points1['val']
我有两个数据框。每个都包含位置 (X,Y) 和该点的值。对于第一个数据框中的每个点,我想在第二个数据框中找到最近的点,然后找出差异。我有可用的代码,但它使用 for 循环,速度很慢。
关于如何加快速度的任何建议?我知道为了提高性能,在 pandas 中摆脱 for 循环通常是个好主意,但我不知道在这种情况下该怎么做。
下面是一些示例代码:
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
df2=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
nearest=df1.copy() #CORRECTION. This had been just =df1 which caused a problem when trying to compare to answers submitted.
for idx,row in nearest.iterrows():
#Find the X,Y points closest to the selected point:
closest=df2.ix[((df2['X']-row['X'])**2+(df2['Y']-row['Y'])**2).idxmin()]
#Set the max to the difference between the current row and the nearest one.
nearest.loc[idx,'val']= df1.loc[idx,'val'] - closest['val']
由于我在较大的数据帧上使用它,因此计算需要很长时间。
谢谢,
一个很酷的解决方案涉及利用 complex
数据类型(内置于 python 和 numpy)。
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
df2=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
# dataframes to numpy arrays of complex numbers
p1 = (df1['X'] + 1j * df1['Y']).values
p2 = (df2['X'] + 1j * df2['Y']).values
# calculate all the distances, between each point in
# df1 and each point in df2 (using an array-broadcasting trick)
all_dists = abs(p1[..., np.newaxis] - p2)
# find indices of the minimal distance from df1 to df2,
# and from df2 to df1
nearest_idxs1 = np.argmin(all_dists, axis = 0)
nearest_idxs2 = np.argmin(all_dists, axis = 1)
# extract the rows from the dataframes
nearest_points1 = df1.ix[nearest_idxs1].reset_index()
nearest_points2 = df2.ix[nearest_idxs2].reset_index()
这可能比使用循环快得多,但如果你的系列变得很大,它会消耗大量内存(点数的二次方)。
此外,如果点集具有不同的长度,则此解决方案有效。
这里有一个具体的例子来说明它是如何工作的:
df1 = pd.DataFrame([ [987, 0, 0], [888, 2,2], [2345, 3,3] ], columns=['val', 'X', 'Y'])
df2 = pd.DataFrame([ [ 1000, 1, 1 ], [2000, 9, 9] ] , columns=['val', 'X', 'Y'])
df1
val X Y
0 987 0 0
1 888 2 2
2 2345 3 3
df2
val X Y
0 1000 1 1
1 2000 9 9
这里,对于df1中的每一个点,df2[0]=(1,1)都是最近的点(如下图nearest_idxs2
)。考虑相反的问题,对于(1,1),(0,0)或(2,2)是最近的,对于(9,9),df1[1]=(3,3)是最近的(如下图nearest_idxs1
所示)。
p1 = (df1['X'] + 1j * df1['Y']).values
p2 = (df2['X'] + 1j * df2['Y']).values
all_dists = abs(p1[..., np.newaxis] - p2)
nearest_idxs1 = np.argmin(all_dists, axis = 0)
nearest_idxs2 = np.argmin(all_dists, axis = 1)
nearest_idxs1
array([0, 2])
nearest_idxs2
array([0, 0, 0])
# It's nearest_points2 you're after:
nearest_points2 = df2.ix[nearest_idxs2].reset_index()
nearest_points2
index val X Y
0 0 1000 1 1
1 0 1000 1 1
2 0 1000 1 1
df1['val'] - nearest_points2['val']
0 -13
1 -112
2 1345
要解决相反的问题(对于 df2 中的每个点,在 df1 中找到最近的点),取 nearest_points1
和 df2['val'] - nearest_points1['val']