坐标之间的距离 Python vs R 计算时间

Question

我正在尝试计算 WGS84 椭球上一个点与许多其他点之间的距离 - 不是半正弦近似 ，如中所述其他答案。我想在 Python 中执行此操作，但相对于 R 而言计算时间非常长。我下面的 Python 脚本需要将近 23 秒，而 R 中的等效脚本需要 0.13 秒。有什么关于加快我的 python 代码的建议吗？

Python 脚本：

import numpy as np
import pandas as pd
import xarray as xr
from geopy.distance import geodesic
from timeit import default_timer as timer
df = pd.DataFrame()
city_coord_orig = (4.351749, 50.845701) 
city_coord_orig_r = tuple(reversed(city_coord_orig))
N = 100000
np.random.normal()
df['or'] = [city_coord_orig_r] * N
df['new'] = df.apply(lambda x: (x['or'][0] + np.random.normal(), x['or'][1] + np.random.normal()), axis=1)
start = timer()
df['d2city2'] = df.apply(lambda x: geodesic(x['or'], x['new']).km, axis=1)
end = timer()
print(end - start)

R 脚本

# clean up
rm(list = ls())
# read libraries
library(geosphere)

city.coord.orig <- c(4.351749, 50.845701)
N<-100000
many <- data.frame(x=rep(city.coord.orig[1], N) + rnorm(N), 
                   y=rep(city.coord.orig[2], N) + rnorm(N))
city.coord.orig <- c(4.351749, 50.845701)
start_time <- Sys.time()
many$d2city <- distGeo(city.coord.orig, many[,c("x","y")]) 
end_time <- Sys.time()
end_time - start_time

Answer 1

您正在使用 .apply()，它使用一个简单的循环来运行每一行的函数。距离计算完全在 Python 中完成（geopy 使用 geographiclib 似乎只写在 Python 中）。非矢量化距离计算很慢，你需要的是使用编译代码的矢量化解决方案，就像 calculating the Haversine distance.

pyproj offers verctorised WSG84 distance calculations (the methods of the pyproj.Geod class accept numpy arrays) and wraps the PROJ4 library，意思是运行在本地机器代码中进行这些计算：

from pyproj import Geod

# split out coordinates into separate columns
df[['or_lat', 'or_lon']] = pd.DataFrame(df['or'].tolist(), index=df.index)
df[['new_lat', 'new_lon']] = pd.DataFrame(df['new'].tolist(), index=df.index)

wsg84 = Geod(ellps='WGS84')
# numpy matrix of the lon / lat columns, iterable in column order
or_and_new = df[['or_lon', 'or_lat', 'new_lon', 'new_lat']].to_numpy().T
df['d2city2'] = wsg84.inv(*or_and_new)[-1] / 1000  # as km

这个打卡时间要好得多：

>>> from timeit import Timer
>>> count, total = Timer(
...     "wsg84.inv(*df[['or_lon', 'or_lat', 'new_lon', 'new_lat']].to_numpy().T)[-1] / 1000",
...     'from __main__ import wsg84, df'
... ).autorange()
>>> total / count * 10 ** 3  # milliseconds
66.09873340003105

66毫秒计算100k距离，不错！

为了进行比较 objective，这是您在同一台机器上的 geopy / df.apply() 版本：

>>> count, total = Timer("df.apply(lambda x: geodesic(x['or'], x['new']).km, axis=1)", 'from __main__ import geodesic, df').autorange()
>>> total / count * 10 ** 3  # milliseconds
25844.119450000107

25.8 秒，甚至不在同一个球场。

坐标之间的距离 Python vs R 计算时间

Distance betweeen coordinates Python vs R computation time

python

performance

wgs84

distance

pandas