如何优化我的代码以计算欧氏距离

How to optimize my code to calculate Euclidean distance

我正在寻找两点之间的欧氏距离。我在 Dataframe 中有大约 13000 行。我必须针对所有 13000 行找到每一行的欧几里德距离,然后获得相似度分数。 运行代码比较耗时(超过 24 小时)。

下面是我的代码:

# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)

# 'i' refers all id's in the dataframe
# Length of df_distance is 13000

for i in tqdm(range(len(df_distance))):
    df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])

    # in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the 
    # comparision from that index of "i" itself.
    if i < len(df_distance):
        index = i

    # This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's 
    for j in (range(len(df_distance))):

        # "a" is the id we are comparing with
        a = df_distance.iloc[i,2:]        

        # "b" is the id we are selecting to compare with
        b = df_distance.iloc[index,2:]

        value = euclidean_dist(a,b)

        # Create a temp dictionary to load the data into dataframe
        dict = {
            'id': df_distance['id'][i], 
            'id_match': df_distance['id'][index], 
            'similarity_distance':value
        }


        df_50 = df_50.append(dict,ignore_index=True)

        # if the b values are less (nearer to the end of the array)
        # in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
        if index == len(df_distance)-1:
            index = 0
        else:
            index +=1

    # Append the content of "df_50" into "df_similar" once for the iteration of "i"
    df_similar = df_similar.append(df_50,ignore_index=True)

我想对我来说更耗时的是 for 循环。

我在代码中使用的欧氏距离函数。

from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
        euclidean_val = euclidean_distances([a, b])
        value = euclidean_val[0][1]
        return value

样本df_distance数据 注意:在图像中,值是从列位置到末端缩放的,我们仅使用此值来计算距离

输出格式为以下格式。

尝试使用 numpy,做一些像这样的事情:

import pandas as pd
import numpy as np 

def numpy_euclidian_distance(point_1, point_2):
    array_1, array_2 = np.array(point_1), np.array(point_2)
    squared_distance = np.sum(np.square(array_1 - array_2))
    distance = np.sqrt(squared_distance)
    return distance 
    
    
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
 
# # Create DataFrame
df = pd.DataFrame(data)

# calculate distance of the hole number at ones using numpy 
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)

好的,所以根据评论我认为你想要前 50 个距离,使用 KDTree 一步更快。作为警告,KDTree 只会比 columns**2 < rows 的蛮力更快,所以你有超过 13 行,可能有更快的实现方法,但这仍然可能是最简单的:

from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50)  # shape of each is (13k, 50)

k_i[i] 将是距离索引 i0 <= i < 13000 最近的 50 个点的索引列表,而 k_d[i] 将是相应的距离.

编辑:这应该得到你想要的数据框,使用multi-index:

df_d = {
        idx: {
              df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
              } for i, idx in enumerate(df_distance['id'])
        }
out = pd.dataframe(df_d).T