如何优化我的代码以计算欧氏距离
How to optimize my code to calculate Euclidean distance
我正在寻找两点之间的欧氏距离。我在 Dataframe 中有大约 13000 行。我必须针对所有 13000 行找到每一行的欧几里德距离,然后获得相似度分数。 运行代码比较耗时(超过 24 小时)。
下面是我的代码:
# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)
# 'i' refers all id's in the dataframe
# Length of df_distance is 13000
for i in tqdm(range(len(df_distance))):
df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])
# in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the
# comparision from that index of "i" itself.
if i < len(df_distance):
index = i
# This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's
for j in (range(len(df_distance))):
# "a" is the id we are comparing with
a = df_distance.iloc[i,2:]
# "b" is the id we are selecting to compare with
b = df_distance.iloc[index,2:]
value = euclidean_dist(a,b)
# Create a temp dictionary to load the data into dataframe
dict = {
'id': df_distance['id'][i],
'id_match': df_distance['id'][index],
'similarity_distance':value
}
df_50 = df_50.append(dict,ignore_index=True)
# if the b values are less (nearer to the end of the array)
# in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
if index == len(df_distance)-1:
index = 0
else:
index +=1
# Append the content of "df_50" into "df_similar" once for the iteration of "i"
df_similar = df_similar.append(df_50,ignore_index=True)
我想对我来说更耗时的是 for 循环。
我在代码中使用的欧氏距离函数。
from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
euclidean_val = euclidean_distances([a, b])
value = euclidean_val[0][1]
return value
样本df_distance数据
注意:在图像中,值是从列位置到末端缩放的,我们仅使用此值来计算距离
输出格式为以下格式。
尝试使用 numpy,做一些像这样的事情:
import pandas as pd
import numpy as np
def numpy_euclidian_distance(point_1, point_2):
array_1, array_2 = np.array(point_1), np.array(point_2)
squared_distance = np.sum(np.square(array_1 - array_2))
distance = np.sqrt(squared_distance)
return distance
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
# # Create DataFrame
df = pd.DataFrame(data)
# calculate distance of the hole number at ones using numpy
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)
好的,所以根据评论我认为你想要前 50 个距离,使用 KDTree
一步更快。作为警告,KDTree
只会比 columns**2 < rows
的蛮力更快,所以你有超过 13 行,可能有更快的实现方法,但这仍然可能是最简单的:
from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50) # shape of each is (13k, 50)
k_i[i]
将是距离索引 i
和 0 <= i < 13000
最近的 50 个点的索引列表,而 k_d[i]
将是相应的距离.
编辑:这应该得到你想要的数据框,使用multi-index:
df_d = {
idx: {
df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
} for i, idx in enumerate(df_distance['id'])
}
out = pd.dataframe(df_d).T
我正在寻找两点之间的欧氏距离。我在 Dataframe 中有大约 13000 行。我必须针对所有 13000 行找到每一行的欧几里德距离,然后获得相似度分数。 运行代码比较耗时(超过 24 小时)。
下面是我的代码:
# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)
# 'i' refers all id's in the dataframe
# Length of df_distance is 13000
for i in tqdm(range(len(df_distance))):
df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])
# in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the
# comparision from that index of "i" itself.
if i < len(df_distance):
index = i
# This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's
for j in (range(len(df_distance))):
# "a" is the id we are comparing with
a = df_distance.iloc[i,2:]
# "b" is the id we are selecting to compare with
b = df_distance.iloc[index,2:]
value = euclidean_dist(a,b)
# Create a temp dictionary to load the data into dataframe
dict = {
'id': df_distance['id'][i],
'id_match': df_distance['id'][index],
'similarity_distance':value
}
df_50 = df_50.append(dict,ignore_index=True)
# if the b values are less (nearer to the end of the array)
# in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
if index == len(df_distance)-1:
index = 0
else:
index +=1
# Append the content of "df_50" into "df_similar" once for the iteration of "i"
df_similar = df_similar.append(df_50,ignore_index=True)
我想对我来说更耗时的是 for 循环。
我在代码中使用的欧氏距离函数。
from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
euclidean_val = euclidean_distances([a, b])
value = euclidean_val[0][1]
return value
样本df_distance数据 注意:在图像中,值是从列位置到末端缩放的,我们仅使用此值来计算距离
输出格式为以下格式。
尝试使用 numpy,做一些像这样的事情:
import pandas as pd
import numpy as np
def numpy_euclidian_distance(point_1, point_2):
array_1, array_2 = np.array(point_1), np.array(point_2)
squared_distance = np.sum(np.square(array_1 - array_2))
distance = np.sqrt(squared_distance)
return distance
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
# # Create DataFrame
df = pd.DataFrame(data)
# calculate distance of the hole number at ones using numpy
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)
好的,所以根据评论我认为你想要前 50 个距离,使用 KDTree
一步更快。作为警告,KDTree
只会比 columns**2 < rows
的蛮力更快,所以你有超过 13 行,可能有更快的实现方法,但这仍然可能是最简单的:
from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50) # shape of each is (13k, 50)
k_i[i]
将是距离索引 i
和 0 <= i < 13000
最近的 50 个点的索引列表,而 k_d[i]
将是相应的距离.
编辑:这应该得到你想要的数据框,使用multi-index:
df_d = {
idx: {
df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
} for i, idx in enumerate(df_distance['id'])
}
out = pd.dataframe(df_d).T