有没有比 for 循环和 if 语句更快的方法来找到到 python 中另一个点的最近点?
Is there a faster way than for loops and if statements to find nearest point to another point in python?
是否有更快的方法(在 Python 中,使用 CPU)来执行与下面的函数相同的操作?我使用了 For
循环和 if
语句,想知道是否有更快的方法? 运行 这个功能目前每 100 个邮政编码大约需要 1 分钟,我有大约 70,000 个可以通过。
使用的 2 个数据帧是:
postcode_df
其中包含 71,092 行和列:
- 邮政编码“BL4 7PD”
- 纬度例如53.577653
- 经度-2.434136
例如
postcode_df = pd.DataFrame({"Postcode":["SK12 2LH", "SK7 6LQ"],
"Latitude":[53.362549, 53.373812],
"Longitude":[-2.061329, -2.120956]})
air
其中包含 421 行和列:
- TubeRef 例如“ABC01”
- 纬度例如53.55108
- 经度-2.396236
例如
air = pd.DataFrame({"TubeRef":["Stkprt35", "Stkprt07", "Stkprt33"],
"Latitude":[53.365085, 53.379502, 53.407510],
"Longitude":[-2.0763, -2.120777, -2.145632]})
函数循环遍历postcode_df中的每个邮政编码,对于每个邮政编码循环遍历每个TubeRef并计算(使用geopy
)它们之间的距离并保存距离最短的TubeRef邮政编码。
输出 df postcode_nearest_tube_refs
包含每个邮政编码最近的管道并包含列:
- 邮政编码“BL4 7PD”
- 最近的空气管,例如"ABC01
- 到空气管 KM 的距离 例如1.035848
# define function to get nearest air quality monitoring tube per postcode
def get_nearest_tubes(constituency_list):
postcodes = []
nearest_tubes = []
distances_to_tubes = []
for postcode in postcode_df["Postcode"]:
closest_tube = ""
shortest_dist = 500
postcode_lat = postcode_df.loc[postcode_df["Postcode"]==postcode, "Latitude"]
postcode_long = postcode_df.loc[postcode_df["Postcode"]==postcode, "Longitude"]
postcode_coord = (float(postcode_lat), float(postcode_long))
for tuberef in air["TubeRef"]:
tube_lat = air.loc[air["TubeRef"]==tuberef, "Latitude"]
tube_long = air.loc[air["TubeRef"]==tuberef, "Longitude"]
tube_coord = (float(tube_lat), float(tube_long))
# calculate distance between postcode and tube
dist_to_tube = geopy.distance.distance(postcode_coord, tube_coord).km
if dist_to_tube < shortest_dist:
shortest_dist = dist_to_tube
closest_tube = str(tuberef)
# save postcode's tuberef with shortest distance
postcodes.append(str(postcode))
nearest_tubes.append(str(closest_tube))
distances_to_tubes.append(shortest_dist)
# create dataframe of the postcodes, nearest tuberefs and distance
postcode_nearest_tube_refs = pd.DataFrame({"Postcode":postcodes,
"Nearest Air Tube":nearest_tubes,
"Distance to Air Tube KM": distances_to_tubes})
return postcode_nearest_tube_refs
我使用的库是:
import numpy as np
import pandas as pd
# !pip install geopy
import geopy.distance
你可以用numpy计算集合A中任意点到集合B中任意点的距离矩阵,然后取集合A中最小距离对应的点即可。
import numpy as np
import pandas as pd
dfA = pd.DataFrame({'lat':np.random.uniform(0, 30, 3), 'lon':np.random.uniform(0, 30, 3), 'id':[1,2,3]})
dfB = pd.DataFrame({'lat':np.random.uniform(0, 30, 3), 'lon':np.random.uniform(0, 30, 3), 'id':['a', 'b', 'c']})
lat1 = dfA.lat.values.reshape(-1, 1)
lat2 = dfB.lat.values.reshape(1, -1)
lon1 = dfA.lon.values.reshape(-1, 1)
lon2 = dfB.lon.values.reshape(1, -1)
dists = np.sqrt((lat1 - lat2)**2 + (lon1-lon2)**2)
for id1, id2 in zip (dfB.id, dfA.id.iloc[np.argmin(dists, axis=1)]):
print(f'the closest point in dfA to {id1} is {id2}')
此处为工作示例,耗时数秒 (<10)。
导入库
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
import uuid
我生成一些随机数据,这也需要一秒钟,但至少我们有一些实际数量。
np_rand_post = 5 * np.random.random((72000,2))
np_rand_post = np_rand_post + np.array((53.577653, -2.434136))
并使用 UUID 作为假邮政编码
postcode_df = pd.DataFrame( np_rand_post , columns=['lat', 'long'])
postcode_df['postcode'] = [uuid.uuid4().hex[:6] for _ in range(72000)]
postcode_df.head()
我们对空气做同样的事情
np_rand = 5 * np.random.random((500,2))
np_rand = np_rand + np.array((53.55108, -2.396236))
并再次使用 uuid 作为假 ref
tube_df = pd.DataFrame( np_rand , columns=['lat', 'long'])
tube_df['ref'] = [uuid.uuid4().hex[:5] for _ in range(500)]
tube_df.head()
将 gps 值提取为 numpy
postcode_gps = postcode_df[["lat", "long"]].values
air_gps = tube_df[["lat", "long"]].values
创建球树
postal_radians = np.radians(postcode_gps)
air_radians = np.radians(air_gps)
tree = BallTree(air_radians, leaf_size=15, metric='haversine')
先查询最近的
distance, index = tree.query(postal_radians, k=1)
注意距离不是KM,需要先换算
earth_radius = 6371000
distance_in_meters = distance * earth_radius
distance_in_meters
例如使用 tube_df.ref[ index[:,0] ]
获取 ref
是否有更快的方法(在 Python 中,使用 CPU)来执行与下面的函数相同的操作?我使用了 For
循环和 if
语句,想知道是否有更快的方法? 运行 这个功能目前每 100 个邮政编码大约需要 1 分钟,我有大约 70,000 个可以通过。
使用的 2 个数据帧是:
postcode_df
其中包含 71,092 行和列:
- 邮政编码“BL4 7PD”
- 纬度例如53.577653
- 经度-2.434136
例如
postcode_df = pd.DataFrame({"Postcode":["SK12 2LH", "SK7 6LQ"],
"Latitude":[53.362549, 53.373812],
"Longitude":[-2.061329, -2.120956]})
air
其中包含 421 行和列:
- TubeRef 例如“ABC01”
- 纬度例如53.55108
- 经度-2.396236
例如
air = pd.DataFrame({"TubeRef":["Stkprt35", "Stkprt07", "Stkprt33"],
"Latitude":[53.365085, 53.379502, 53.407510],
"Longitude":[-2.0763, -2.120777, -2.145632]})
函数循环遍历postcode_df中的每个邮政编码,对于每个邮政编码循环遍历每个TubeRef并计算(使用geopy
)它们之间的距离并保存距离最短的TubeRef邮政编码。
输出 df postcode_nearest_tube_refs
包含每个邮政编码最近的管道并包含列:
- 邮政编码“BL4 7PD”
- 最近的空气管,例如"ABC01
- 到空气管 KM 的距离 例如1.035848
# define function to get nearest air quality monitoring tube per postcode
def get_nearest_tubes(constituency_list):
postcodes = []
nearest_tubes = []
distances_to_tubes = []
for postcode in postcode_df["Postcode"]:
closest_tube = ""
shortest_dist = 500
postcode_lat = postcode_df.loc[postcode_df["Postcode"]==postcode, "Latitude"]
postcode_long = postcode_df.loc[postcode_df["Postcode"]==postcode, "Longitude"]
postcode_coord = (float(postcode_lat), float(postcode_long))
for tuberef in air["TubeRef"]:
tube_lat = air.loc[air["TubeRef"]==tuberef, "Latitude"]
tube_long = air.loc[air["TubeRef"]==tuberef, "Longitude"]
tube_coord = (float(tube_lat), float(tube_long))
# calculate distance between postcode and tube
dist_to_tube = geopy.distance.distance(postcode_coord, tube_coord).km
if dist_to_tube < shortest_dist:
shortest_dist = dist_to_tube
closest_tube = str(tuberef)
# save postcode's tuberef with shortest distance
postcodes.append(str(postcode))
nearest_tubes.append(str(closest_tube))
distances_to_tubes.append(shortest_dist)
# create dataframe of the postcodes, nearest tuberefs and distance
postcode_nearest_tube_refs = pd.DataFrame({"Postcode":postcodes,
"Nearest Air Tube":nearest_tubes,
"Distance to Air Tube KM": distances_to_tubes})
return postcode_nearest_tube_refs
我使用的库是:
import numpy as np
import pandas as pd
# !pip install geopy
import geopy.distance
你可以用numpy计算集合A中任意点到集合B中任意点的距离矩阵,然后取集合A中最小距离对应的点即可。
import numpy as np
import pandas as pd
dfA = pd.DataFrame({'lat':np.random.uniform(0, 30, 3), 'lon':np.random.uniform(0, 30, 3), 'id':[1,2,3]})
dfB = pd.DataFrame({'lat':np.random.uniform(0, 30, 3), 'lon':np.random.uniform(0, 30, 3), 'id':['a', 'b', 'c']})
lat1 = dfA.lat.values.reshape(-1, 1)
lat2 = dfB.lat.values.reshape(1, -1)
lon1 = dfA.lon.values.reshape(-1, 1)
lon2 = dfB.lon.values.reshape(1, -1)
dists = np.sqrt((lat1 - lat2)**2 + (lon1-lon2)**2)
for id1, id2 in zip (dfB.id, dfA.id.iloc[np.argmin(dists, axis=1)]):
print(f'the closest point in dfA to {id1} is {id2}')
此处为工作示例,耗时数秒 (<10)。
导入库
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
import uuid
我生成一些随机数据,这也需要一秒钟,但至少我们有一些实际数量。
np_rand_post = 5 * np.random.random((72000,2))
np_rand_post = np_rand_post + np.array((53.577653, -2.434136))
并使用 UUID 作为假邮政编码
postcode_df = pd.DataFrame( np_rand_post , columns=['lat', 'long'])
postcode_df['postcode'] = [uuid.uuid4().hex[:6] for _ in range(72000)]
postcode_df.head()
我们对空气做同样的事情
np_rand = 5 * np.random.random((500,2))
np_rand = np_rand + np.array((53.55108, -2.396236))
并再次使用 uuid 作为假 ref
tube_df = pd.DataFrame( np_rand , columns=['lat', 'long'])
tube_df['ref'] = [uuid.uuid4().hex[:5] for _ in range(500)]
tube_df.head()
将 gps 值提取为 numpy
postcode_gps = postcode_df[["lat", "long"]].values
air_gps = tube_df[["lat", "long"]].values
创建球树
postal_radians = np.radians(postcode_gps)
air_radians = np.radians(air_gps)
tree = BallTree(air_radians, leaf_size=15, metric='haversine')
先查询最近的
distance, index = tree.query(postal_radians, k=1)
注意距离不是KM,需要先换算
earth_radius = 6371000
distance_in_meters = distance * earth_radius
distance_in_meters
例如使用 tube_df.ref[ index[:,0] ]