当经销商数据集有 25k+ 条记录且客户有 200k+ 条记录时，使用 Python 找到距离给定客户位置最近的经销商

Question

我有两个 tables - 经销商和客户。对于来自客户 table 的每个客户位置，我需要找到距离经销商 table 最近的经销商。

我有一个有效的代码，但需要几个小时才能运行。我需要帮助来优化我的解决方案。

经销商 table 有 25k+ 行，客户 table 有 200k+ 行。 table 都有 3 个主要列：(DealerID, Lat, Long) 和 (CustomerID, Lat, Long)。我的输出看起来像这样：

CustomerID	Lat	Long	ClosestDealer	Distance
Customer1	61.61	-149.58	Dealer3	15.53
Customer2	42.37	-72.52	Dealer258	8.02
Customer3	42.42	-72.1	Dealer1076	32.92
Customer4	31.59	-89.87	Dealer32	3.85
Customer5	36.75	-94.84	Dealer726	7.90

我当前的解决方案： 遍历所有数据行以找到最小值。距离会花很长时间。为了优化这一点，我根据四舍五入的纬度和经度点将两个 table 中的数据分组，然后将它们加在一起得到我的最终组（参见下面的 'LatLongGroup' 列）。

CustomerID	Lat	Long	LatGroup	LongGroup	LatLongGroup
Customer1	61.61	-149.58	61	-149	-88
Customer2	42.37	-72.52	42	-72	-30
Customer3	42.42	-72.1	42	-72	-30
Customer4	31.59	-89.87	31	-89	-58
Customer5	36.75	-94.84	36	-94	-58

这两个 table 都是根据 'LatLongGroup' 列排序的。我有一个单独的 table 称为组，它为经销商 table 提供每个组的开始和结束行号。然后我匹配 dealer table 中与 customerID 具有相同 'Latlonggroup' 的记录。这有助于我缩小对最近经销商的搜索范围。

但有时最近的经销商可能不属于同一组，因此为了避免任何陷阱，我不仅搜索匹配的组，还会搜索上方和下方的组。 View Currently Used Code

请告诉我什么是优化它的最佳方法，或者是否有更简单的方法来为这样的大型数据集找到最近的经销商。非常感谢任何方向。谢谢！

col_names = ["CustomerKey","DealerKey","Dist"]
df = pd.DataFrame(columns = col_names)
c = 0
for i in range(0,len(df_c)):
    print(i)
    row = {'CustomerKey':df_c.loc[i,'ZIPBRANDKEY'],'DealerKey':'','Dist':0}
    df = df.append(row, ignore_index=True)
    a = group[group['LatLongGroup'] == df_c.LatLongGroup[i]].index[0]
    if(a-1 >= 0):
        start = group.loc[a-1,'Start']
    else:
        start = group.loc[a,'Start']
    if(a+1 < len(group)):
        end = group.loc[a+1,'End']
    else:
        end = group.loc[a,'End']
    t1 = 0
    for j in range(start,end):
        dist = round(geopy.distance.distance(df_c.Lat_long[i], df_s.Lat_long[j]).miles,2)
        if(t1 == 0):
            min_dist = dist
            dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
            t1 = 1
        elif(dist < min_dist):
            min_dist = dist
            dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
    df.loc[c,'DealerKey'] = dealerkey
    df.loc[c,'Dist'] = min_dist
    c = c+1
df.head()

作为参考，上面提到的组数据框如下所示：

Group	Start	End
-138	0	7
-137	7	15
-136	15	53
-135	53	55
-88	55	78

Answer 1

几周前我遇到了同样的问题，我觉得最好的方法是使用K-Nearest Neighbors算法。

# Dropping the duplicates in order to get the unique dealer list

dealer_df = df_s[["DealerID", "LAT", "LNG"]].drop_duplicates()
dealer_df = dealer_df.set_index("DealerID")

from sklearn.neighbors import KNeighborsClassifier

# Instantiating with n_neighbors as 1 and weights as "distance"

knn = KNeighborsClassifier(n_neighbors=1, weights="distance", n_jobs=-1)
knn.fit(dealer_df.values, dealer_df.index)

df_c["Nearest Dealer"] = knn.predict(df_c[["LAT", "LNG"]].values)

我已经在将近 180 万个数据点上使用了相同的方法，并且花费了大约 5 分钟。

Answer 2

生成示例数据：

import pandas as pd

N = 25000
dealers = pd.DataFrame({"DealerID": "Dealer" + pd.RangeIndex(1, N+1).astype(str),
                        "Lat": np.random.uniform(30, 65, N),
                        "Long": np.random.uniform(-150, -70, N)}
                      ).set_index("DealerID")

N = 200000
customers = pd.DataFrame({"CustomerID": "Customer" + pd.RangeIndex(1, N+1).astype(str),
                          "Lat": np.random.uniform(30, 65, N),
                          "Long": np.random.uniform(-150, -70, N)}
                        ).set_index("CustomerID")

>>> dealers
                   Lat        Long
DealerID
Dealer1      53.923040  -96.238974
Dealer2      33.375229 -136.379545
Dealer3      30.635395 -107.639308
Dealer4      50.264205  -97.563283
Dealer5      52.366663 -130.242301
...                ...         ...
Dealer24996  62.369789 -140.430366
Dealer24997  43.079035 -126.496873
Dealer24998  43.858461  -97.471257
Dealer24999  34.433920 -135.038754
Dealer25000  61.967902  -95.496924

[25000 rows x 2 columns]

>>> customers
                      Lat        Long
CustomerID
Customer1       30.748900 -133.231319
Customer2       38.636134  -98.618844
Customer3       60.282135  -97.100096
Customer4       42.995473 -120.135218
Customer5       50.809563  -80.662491
...                   ...         ...
Customer199996  47.387618  -88.420528
Customer199997  53.618939 -124.432385
Customer199998  58.506937 -146.024708
Customer199999  48.329325 -129.149631
Customer200000  36.599969 -145.019091

[200000 rows x 2 columns]

您可以使用 Scipy 中的 KDTree:

from scipy.spatial import KDTree

distances, indices = KDTree(dealers).query(customers)

几秒后：

>>> customers.assign(ClosestDealer=dealers.iloc[indices].index, Distance=distances)
                      Lat        Long ClosestDealer  Distance
CustomerID
Customer1       30.748900 -133.231319   Dealer22102  0.189255
Customer2       38.636134  -98.618844    Dealer1510  0.282966
Customer3       60.282135  -97.100096    Dealer2715  0.182832
Customer4       42.995473 -120.135218   Dealer10539  0.423006
Customer5       50.809563  -80.662491   Dealer12022  0.091765
...                   ...         ...           ...       ...
Customer199996  47.387618  -88.420528   Dealer17124  0.325962
Customer199997  53.618939 -124.432385    Dealer9177  0.133110
Customer199998  58.506937 -146.024708   Dealer15558  0.299639
Customer199999  48.329325 -129.149631   Dealer18371  0.023172
Customer200000  36.599969 -145.019091    Dealer2316  0.199344

[200000 rows x 4 columns]

当经销商数据集有 25k+ 条记录且客户有 200k+ 条记录时，使用 Python 找到距离给定客户位置最近的经销商

Find the dealer nearest to the given customer location when the dealer data set has 25k+ records and customer has 200k+ records using Python

python

optimization

geopy

python-3.x

pandas