当经销商数据集有 25k+ 条记录且客户有 200k+ 条记录时,使用 Python 找到距离给定客户位置最近的经销商
Find the dealer nearest to the given customer location when the dealer data set has 25k+ records and customer has 200k+ records using Python
我有两个 tables - 经销商和客户。对于来自客户 table 的每个客户位置,我需要找到距离经销商 table 最近的经销商。
我有一个有效的代码,但需要几个小时才能 运行。我需要帮助来优化我的解决方案。
经销商 table 有 25k+ 行,客户 table 有 200k+ 行。 table 都有 3 个主要列:(DealerID, Lat, Long) 和 (CustomerID, Lat, Long)。我的输出看起来像这样:
CustomerID
Lat
Long
ClosestDealer
Distance
Customer1
61.61
-149.58
Dealer3
15.53
Customer2
42.37
-72.52
Dealer258
8.02
Customer3
42.42
-72.1
Dealer1076
32.92
Customer4
31.59
-89.87
Dealer32
3.85
Customer5
36.75
-94.84
Dealer726
7.90
我当前的解决方案: 遍历所有数据行以找到最小值。距离会花很长时间。为了优化这一点,我根据四舍五入的纬度和经度点将两个 table 中的数据分组,然后将它们加在一起得到我的最终组(参见下面的 'LatLongGroup' 列)。
CustomerID
Lat
Long
LatGroup
LongGroup
LatLongGroup
Customer1
61.61
-149.58
61
-149
-88
Customer2
42.37
-72.52
42
-72
-30
Customer3
42.42
-72.1
42
-72
-30
Customer4
31.59
-89.87
31
-89
-58
Customer5
36.75
-94.84
36
-94
-58
这两个 table 都是根据 'LatLongGroup' 列排序的。我有一个单独的 table 称为组,它为经销商 table 提供每个组的开始和结束行号。
然后我匹配 dealer table 中与 customerID 具有相同 'Latlonggroup' 的记录。这有助于我缩小对最近经销商的搜索范围。
但有时最近的经销商可能不属于同一组,因此为了避免任何陷阱,我不仅搜索匹配的组,还会搜索上方和下方的组。 View Currently Used Code
请告诉我什么是优化它的最佳方法,或者是否有更简单的方法来为这样的大型数据集找到最近的经销商。非常感谢任何方向。谢谢!
col_names = ["CustomerKey","DealerKey","Dist"]
df = pd.DataFrame(columns = col_names)
c = 0
for i in range(0,len(df_c)):
print(i)
row = {'CustomerKey':df_c.loc[i,'ZIPBRANDKEY'],'DealerKey':'','Dist':0}
df = df.append(row, ignore_index=True)
a = group[group['LatLongGroup'] == df_c.LatLongGroup[i]].index[0]
if(a-1 >= 0):
start = group.loc[a-1,'Start']
else:
start = group.loc[a,'Start']
if(a+1 < len(group)):
end = group.loc[a+1,'End']
else:
end = group.loc[a,'End']
t1 = 0
for j in range(start,end):
dist = round(geopy.distance.distance(df_c.Lat_long[i], df_s.Lat_long[j]).miles,2)
if(t1 == 0):
min_dist = dist
dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
t1 = 1
elif(dist < min_dist):
min_dist = dist
dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
df.loc[c,'DealerKey'] = dealerkey
df.loc[c,'Dist'] = min_dist
c = c+1
df.head()
作为参考,上面提到的组数据框如下所示:
Group
Start
End
-138
0
7
-137
7
15
-136
15
53
-135
53
55
-88
55
78
几周前我遇到了同样的问题,我觉得最好的方法是使用K-Nearest Neighbors算法。
# Dropping the duplicates in order to get the unique dealer list
dealer_df = df_s[["DealerID", "LAT", "LNG"]].drop_duplicates()
dealer_df = dealer_df.set_index("DealerID")
from sklearn.neighbors import KNeighborsClassifier
# Instantiating with n_neighbors as 1 and weights as "distance"
knn = KNeighborsClassifier(n_neighbors=1, weights="distance", n_jobs=-1)
knn.fit(dealer_df.values, dealer_df.index)
df_c["Nearest Dealer"] = knn.predict(df_c[["LAT", "LNG"]].values)
我已经在将近 180 万个数据点上使用了相同的方法,并且花费了大约 5 分钟。
生成示例数据:
import pandas as pd
N = 25000
dealers = pd.DataFrame({"DealerID": "Dealer" + pd.RangeIndex(1, N+1).astype(str),
"Lat": np.random.uniform(30, 65, N),
"Long": np.random.uniform(-150, -70, N)}
).set_index("DealerID")
N = 200000
customers = pd.DataFrame({"CustomerID": "Customer" + pd.RangeIndex(1, N+1).astype(str),
"Lat": np.random.uniform(30, 65, N),
"Long": np.random.uniform(-150, -70, N)}
).set_index("CustomerID")
>>> dealers
Lat Long
DealerID
Dealer1 53.923040 -96.238974
Dealer2 33.375229 -136.379545
Dealer3 30.635395 -107.639308
Dealer4 50.264205 -97.563283
Dealer5 52.366663 -130.242301
... ... ...
Dealer24996 62.369789 -140.430366
Dealer24997 43.079035 -126.496873
Dealer24998 43.858461 -97.471257
Dealer24999 34.433920 -135.038754
Dealer25000 61.967902 -95.496924
[25000 rows x 2 columns]
>>> customers
Lat Long
CustomerID
Customer1 30.748900 -133.231319
Customer2 38.636134 -98.618844
Customer3 60.282135 -97.100096
Customer4 42.995473 -120.135218
Customer5 50.809563 -80.662491
... ... ...
Customer199996 47.387618 -88.420528
Customer199997 53.618939 -124.432385
Customer199998 58.506937 -146.024708
Customer199999 48.329325 -129.149631
Customer200000 36.599969 -145.019091
[200000 rows x 2 columns]
您可以使用 Scipy 中的 KDTree:
from scipy.spatial import KDTree
distances, indices = KDTree(dealers).query(customers)
几秒后:
>>> customers.assign(ClosestDealer=dealers.iloc[indices].index, Distance=distances)
Lat Long ClosestDealer Distance
CustomerID
Customer1 30.748900 -133.231319 Dealer22102 0.189255
Customer2 38.636134 -98.618844 Dealer1510 0.282966
Customer3 60.282135 -97.100096 Dealer2715 0.182832
Customer4 42.995473 -120.135218 Dealer10539 0.423006
Customer5 50.809563 -80.662491 Dealer12022 0.091765
... ... ... ... ...
Customer199996 47.387618 -88.420528 Dealer17124 0.325962
Customer199997 53.618939 -124.432385 Dealer9177 0.133110
Customer199998 58.506937 -146.024708 Dealer15558 0.299639
Customer199999 48.329325 -129.149631 Dealer18371 0.023172
Customer200000 36.599969 -145.019091 Dealer2316 0.199344
[200000 rows x 4 columns]
我有两个 tables - 经销商和客户。对于来自客户 table 的每个客户位置,我需要找到距离经销商 table 最近的经销商。
我有一个有效的代码,但需要几个小时才能 运行。我需要帮助来优化我的解决方案。
经销商 table 有 25k+ 行,客户 table 有 200k+ 行。 table 都有 3 个主要列:(DealerID, Lat, Long) 和 (CustomerID, Lat, Long)。我的输出看起来像这样:
CustomerID | Lat | Long | ClosestDealer | Distance |
---|---|---|---|---|
Customer1 | 61.61 | -149.58 | Dealer3 | 15.53 |
Customer2 | 42.37 | -72.52 | Dealer258 | 8.02 |
Customer3 | 42.42 | -72.1 | Dealer1076 | 32.92 |
Customer4 | 31.59 | -89.87 | Dealer32 | 3.85 |
Customer5 | 36.75 | -94.84 | Dealer726 | 7.90 |
我当前的解决方案: 遍历所有数据行以找到最小值。距离会花很长时间。为了优化这一点,我根据四舍五入的纬度和经度点将两个 table 中的数据分组,然后将它们加在一起得到我的最终组(参见下面的 'LatLongGroup' 列)。
CustomerID | Lat | Long | LatGroup | LongGroup | LatLongGroup |
---|---|---|---|---|---|
Customer1 | 61.61 | -149.58 | 61 | -149 | -88 |
Customer2 | 42.37 | -72.52 | 42 | -72 | -30 |
Customer3 | 42.42 | -72.1 | 42 | -72 | -30 |
Customer4 | 31.59 | -89.87 | 31 | -89 | -58 |
Customer5 | 36.75 | -94.84 | 36 | -94 | -58 |
这两个 table 都是根据 'LatLongGroup' 列排序的。我有一个单独的 table 称为组,它为经销商 table 提供每个组的开始和结束行号。 然后我匹配 dealer table 中与 customerID 具有相同 'Latlonggroup' 的记录。这有助于我缩小对最近经销商的搜索范围。
但有时最近的经销商可能不属于同一组,因此为了避免任何陷阱,我不仅搜索匹配的组,还会搜索上方和下方的组。 View Currently Used Code
请告诉我什么是优化它的最佳方法,或者是否有更简单的方法来为这样的大型数据集找到最近的经销商。非常感谢任何方向。谢谢!
col_names = ["CustomerKey","DealerKey","Dist"]
df = pd.DataFrame(columns = col_names)
c = 0
for i in range(0,len(df_c)):
print(i)
row = {'CustomerKey':df_c.loc[i,'ZIPBRANDKEY'],'DealerKey':'','Dist':0}
df = df.append(row, ignore_index=True)
a = group[group['LatLongGroup'] == df_c.LatLongGroup[i]].index[0]
if(a-1 >= 0):
start = group.loc[a-1,'Start']
else:
start = group.loc[a,'Start']
if(a+1 < len(group)):
end = group.loc[a+1,'End']
else:
end = group.loc[a,'End']
t1 = 0
for j in range(start,end):
dist = round(geopy.distance.distance(df_c.Lat_long[i], df_s.Lat_long[j]).miles,2)
if(t1 == 0):
min_dist = dist
dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
t1 = 1
elif(dist < min_dist):
min_dist = dist
dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
df.loc[c,'DealerKey'] = dealerkey
df.loc[c,'Dist'] = min_dist
c = c+1
df.head()
作为参考,上面提到的组数据框如下所示:
Group | Start | End |
---|---|---|
-138 | 0 | 7 |
-137 | 7 | 15 |
-136 | 15 | 53 |
-135 | 53 | 55 |
-88 | 55 | 78 |
几周前我遇到了同样的问题,我觉得最好的方法是使用K-Nearest Neighbors算法。
# Dropping the duplicates in order to get the unique dealer list
dealer_df = df_s[["DealerID", "LAT", "LNG"]].drop_duplicates()
dealer_df = dealer_df.set_index("DealerID")
from sklearn.neighbors import KNeighborsClassifier
# Instantiating with n_neighbors as 1 and weights as "distance"
knn = KNeighborsClassifier(n_neighbors=1, weights="distance", n_jobs=-1)
knn.fit(dealer_df.values, dealer_df.index)
df_c["Nearest Dealer"] = knn.predict(df_c[["LAT", "LNG"]].values)
我已经在将近 180 万个数据点上使用了相同的方法,并且花费了大约 5 分钟。
生成示例数据:
import pandas as pd
N = 25000
dealers = pd.DataFrame({"DealerID": "Dealer" + pd.RangeIndex(1, N+1).astype(str),
"Lat": np.random.uniform(30, 65, N),
"Long": np.random.uniform(-150, -70, N)}
).set_index("DealerID")
N = 200000
customers = pd.DataFrame({"CustomerID": "Customer" + pd.RangeIndex(1, N+1).astype(str),
"Lat": np.random.uniform(30, 65, N),
"Long": np.random.uniform(-150, -70, N)}
).set_index("CustomerID")
>>> dealers
Lat Long
DealerID
Dealer1 53.923040 -96.238974
Dealer2 33.375229 -136.379545
Dealer3 30.635395 -107.639308
Dealer4 50.264205 -97.563283
Dealer5 52.366663 -130.242301
... ... ...
Dealer24996 62.369789 -140.430366
Dealer24997 43.079035 -126.496873
Dealer24998 43.858461 -97.471257
Dealer24999 34.433920 -135.038754
Dealer25000 61.967902 -95.496924
[25000 rows x 2 columns]
>>> customers
Lat Long
CustomerID
Customer1 30.748900 -133.231319
Customer2 38.636134 -98.618844
Customer3 60.282135 -97.100096
Customer4 42.995473 -120.135218
Customer5 50.809563 -80.662491
... ... ...
Customer199996 47.387618 -88.420528
Customer199997 53.618939 -124.432385
Customer199998 58.506937 -146.024708
Customer199999 48.329325 -129.149631
Customer200000 36.599969 -145.019091
[200000 rows x 2 columns]
您可以使用 Scipy 中的 KDTree:
from scipy.spatial import KDTree
distances, indices = KDTree(dealers).query(customers)
几秒后:
>>> customers.assign(ClosestDealer=dealers.iloc[indices].index, Distance=distances)
Lat Long ClosestDealer Distance
CustomerID
Customer1 30.748900 -133.231319 Dealer22102 0.189255
Customer2 38.636134 -98.618844 Dealer1510 0.282966
Customer3 60.282135 -97.100096 Dealer2715 0.182832
Customer4 42.995473 -120.135218 Dealer10539 0.423006
Customer5 50.809563 -80.662491 Dealer12022 0.091765
... ... ... ... ...
Customer199996 47.387618 -88.420528 Dealer17124 0.325962
Customer199997 53.618939 -124.432385 Dealer9177 0.133110
Customer199998 58.506937 -146.024708 Dealer15558 0.299639
Customer199999 48.329325 -129.149631 Dealer18371 0.023172
Customer200000 36.599969 -145.019091 Dealer2316 0.199344
[200000 rows x 4 columns]