如何获取两个不同数据框的两个地理坐标之间的距离?
How to get the distance between two geographic coordinates of two different dataframes?
我正在为大学做一个项目,我有两个 pandas 数据帧:
# Libraries
import pandas as pd
from geopy import distance
# Dataframes
df1 = pd.DataFrame({'id': [1,2,3],
'lat':[-23.48, -22.94, -23.22],
'long':[-46.36, -45.40, -45.80]})
df2 = pd.DataFrame({'id': [100,200,300],
'lat':[-28.48, -22.94, -23.22],
'long':[-46.36, -46.40, -45.80]})
我需要计算数据帧之间的地理纬度和经度坐标之间的距离。所以我用了geopy。如果坐标组合之间的距离小于 100 米的阈值,那么我必须在 'nearby' 列中分配值 1。我编写了以下代码:
threshold = 100 # meters
df1['nearby'] = 0
for i in range(0, len(df1)):
for j in range(0, len(df2)):
coord_geo_1 = (df1['lat'].iloc[i], df1['long'].iloc[i])
coord_geo_2 = (df2['lat'].iloc[j], df2['long'].iloc[j])
var_distance = (distance.distance(coord_geo_1, coord_geo_2).km) * 1000
if(var_distance < threshold):
df1['nearby'].iloc[i] = 1
虽然出现警告,但代码正常。但是,我想找到一种方法来覆盖 for() 迭代。可能吗?
# Output:
id lat long nearby
1 -23.48 -46.36 0
2 -22.94 -45.40 0
3 -23.22 -45.80 1
您可以 cross-merge 两个 dfs 来获得 df1 和 df2 中每个 id 之间的距离:
dfm = pd.merge(df1, df2, how = 'cross', suffixes = ['','_2'])
dfm['dist'] = dfm.apply(lambda r: distance.distance((r['lat'],r['long']),(r['lat_2'],r['long_2'])).km * 1000 , axis=1)
dfm
看起来像这样:
id lat long id_2 lat_2 long_2 dist
-- ---- ------ ------ ------ ------- -------- --------
0 1 -23.48 -46.36 100 -28.48 -46.36 553941
1 1 -23.48 -46.36 200 -22.94 -46.4 59943.4
2 1 -23.48 -46.36 300 -23.22 -45.8 64095.5
3 2 -22.94 -45.4 100 -28.48 -46.36 621251
4 2 -22.94 -45.4 200 -22.94 -46.4 102568
5 2 -22.94 -45.4 300 -23.22 -45.8 51393.4
6 3 -23.22 -45.8 100 -28.48 -46.36 585430
7 3 -23.22 -45.8 200 -22.94 -46.4 68854.7
8 3 -23.22 -45.8 300 -23.22 -45.8 0
您可以测试列 'dist' 低于阈值,但是如果要求是从 df1
聚合 id
那么您可以做例如
res = df1.merge(dfm.groupby('id').apply(lambda g:any(g['dist'] < threshold)*1).rename('nearby'), on = 'id')
res
现在看起来像这样:
id lat long nearby
-- ---- ------ ------ --------
0 1 -23.48 -46.36 0
1 2 -22.94 -45.4 0
2 3 -23.22 -45.8 1
如果可以使用库scikit-learn,方法haversine_distances
计算两组坐标之间的距离。所以你得到:
from sklearn.metrics.pairwise import haversine_distances
# variable in meter you can change
threshold = 100 # meters
# another parameter
earth_radius = 6371000 # meters
df1['nearby'] = (
# get the distance between all points of each DF
haversine_distances(
# note that you need to convert to radiant with *np.pi/180
X=df1[['lat','long']].to_numpy()*np.pi/180,
Y=df2[['lat','long']].to_numpy()*np.pi/180)
# get the distance in meter
*earth_radius
# compare to your threshold
< threshold
# you want to check if any point from df2 is near df1
).any(axis=1).astype(int)
print(df1)
# id lat long nearby
# 0 1 -23.48 -46.36 0
# 1 2 -22.94 -45.40 0
# 2 3 -23.22 -45.80 1
编辑:OP 要求一个与 geopy 有距离的版本,所以这是一种方法。
df1['nearby'] = (np.array(
[[(distance.distance(coord1, coord2).km)
for coord2 in df2[['lat','long']].to_numpy()]
for coord1 in df1[['lat','long']].to_numpy()]
) * 1000 < threshold
).any(1).astype(int)
我正在为大学做一个项目,我有两个 pandas 数据帧:
# Libraries
import pandas as pd
from geopy import distance
# Dataframes
df1 = pd.DataFrame({'id': [1,2,3],
'lat':[-23.48, -22.94, -23.22],
'long':[-46.36, -45.40, -45.80]})
df2 = pd.DataFrame({'id': [100,200,300],
'lat':[-28.48, -22.94, -23.22],
'long':[-46.36, -46.40, -45.80]})
我需要计算数据帧之间的地理纬度和经度坐标之间的距离。所以我用了geopy。如果坐标组合之间的距离小于 100 米的阈值,那么我必须在 'nearby' 列中分配值 1。我编写了以下代码:
threshold = 100 # meters
df1['nearby'] = 0
for i in range(0, len(df1)):
for j in range(0, len(df2)):
coord_geo_1 = (df1['lat'].iloc[i], df1['long'].iloc[i])
coord_geo_2 = (df2['lat'].iloc[j], df2['long'].iloc[j])
var_distance = (distance.distance(coord_geo_1, coord_geo_2).km) * 1000
if(var_distance < threshold):
df1['nearby'].iloc[i] = 1
虽然出现警告,但代码正常。但是,我想找到一种方法来覆盖 for() 迭代。可能吗?
# Output:
id lat long nearby
1 -23.48 -46.36 0
2 -22.94 -45.40 0
3 -23.22 -45.80 1
您可以 cross-merge 两个 dfs 来获得 df1 和 df2 中每个 id 之间的距离:
dfm = pd.merge(df1, df2, how = 'cross', suffixes = ['','_2'])
dfm['dist'] = dfm.apply(lambda r: distance.distance((r['lat'],r['long']),(r['lat_2'],r['long_2'])).km * 1000 , axis=1)
dfm
看起来像这样:
id lat long id_2 lat_2 long_2 dist
-- ---- ------ ------ ------ ------- -------- --------
0 1 -23.48 -46.36 100 -28.48 -46.36 553941
1 1 -23.48 -46.36 200 -22.94 -46.4 59943.4
2 1 -23.48 -46.36 300 -23.22 -45.8 64095.5
3 2 -22.94 -45.4 100 -28.48 -46.36 621251
4 2 -22.94 -45.4 200 -22.94 -46.4 102568
5 2 -22.94 -45.4 300 -23.22 -45.8 51393.4
6 3 -23.22 -45.8 100 -28.48 -46.36 585430
7 3 -23.22 -45.8 200 -22.94 -46.4 68854.7
8 3 -23.22 -45.8 300 -23.22 -45.8 0
您可以测试列 'dist' 低于阈值,但是如果要求是从 df1
聚合 id
那么您可以做例如
res = df1.merge(dfm.groupby('id').apply(lambda g:any(g['dist'] < threshold)*1).rename('nearby'), on = 'id')
res
现在看起来像这样:
id lat long nearby
-- ---- ------ ------ --------
0 1 -23.48 -46.36 0
1 2 -22.94 -45.4 0
2 3 -23.22 -45.8 1
如果可以使用库scikit-learn,方法haversine_distances
计算两组坐标之间的距离。所以你得到:
from sklearn.metrics.pairwise import haversine_distances
# variable in meter you can change
threshold = 100 # meters
# another parameter
earth_radius = 6371000 # meters
df1['nearby'] = (
# get the distance between all points of each DF
haversine_distances(
# note that you need to convert to radiant with *np.pi/180
X=df1[['lat','long']].to_numpy()*np.pi/180,
Y=df2[['lat','long']].to_numpy()*np.pi/180)
# get the distance in meter
*earth_radius
# compare to your threshold
< threshold
# you want to check if any point from df2 is near df1
).any(axis=1).astype(int)
print(df1)
# id lat long nearby
# 0 1 -23.48 -46.36 0
# 1 2 -22.94 -45.40 0
# 2 3 -23.22 -45.80 1
编辑:OP 要求一个与 geopy 有距离的版本,所以这是一种方法。
df1['nearby'] = (np.array(
[[(distance.distance(coord1, coord2).km)
for coord2 in df2[['lat','long']].to_numpy()]
for coord1 in df1[['lat','long']].to_numpy()]
) * 1000 < threshold
).any(1).astype(int)