有效计算两个数据集之间的成对 haversine 距离 - NumPy / Python
Efficiently compute pairwise haversine distances between two datasets - NumPy / Python
我想计算经纬度之间的地理距离。
我已经检查过这个线程
但是当我将它用于两组不同的坐标时,出现错误。
df1 的大小可以达到数百万,如果有任何其他方法可以在更短的时间内计算出准确的地理距离,那将非常有用。
length1 = 1000
d1 = np.random.uniform(-90, 90, length1)
d2 = np.random.uniform(-180, 180, length1)
length2 = 100
d3 = np.random.uniform(-90, 90, length2)
d4 = np.random.uniform(-180, 180, length2)
coords = tuple(zip(d1, d2))
df1 = pd.DataFrame({'coordinates':coords})
coords = tuple(zip(d3, d4))
df2 = pd.DataFrame({'coordinates':coords})
def get_diff(df1, df2):
data1 = np.array(df1['coordinates'].tolist())
data2 = np.array(df2['coordinates'].tolist())
lat1 = data1[:,0]
lng1 = data1[:,1]
lat2 = data2[:,0]
lng2 = data2[:,1]
#print(lat1.shape)
#print(lng1.shape)
#print(lat2.shape)
#print(lng2.shape)
diff_lat = lat1[:,None] - lat2
diff_lng = lng1[:,None] - lng2
#print(diff_lat.shape)
#print(diff_lng.shape)
d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat1) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
get_diff(df1, df2)
ValueError Traceback (most recent call last)
<ipython-input-58-df06c7cff72c> in <module>
----> 1 get_diff(df1, df2)
<ipython-input-57-9bd8f10189e6> in get_diff(df1, df2)
26 print(diff_lat.shape)
27 print(diff_lng.shape)
---> 28 d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat1) * np.sin(diff_lng/2)**2
29 return 2 * 6371 * np.arcsin(np.sqrt(d))
ValueError: operands could not be broadcast together with shapes (1000,1000) (1000,100)
使用简单的 print
语句来显示方程式的参数。 sin
表达式中的某些操作长度不同——基础 broadcast
操作(类似于 zip
的矢量化等价物)需要相同的长度。
成对的正弦距离
这是基于 -
broadcasting
的矢量化方式
def convert_to_arrays(df1, df2):
d1 = np.array(df1['coordinates'].tolist())
d2 = np.array(df2['coordinates'].tolist())
return d1,d2
def broadcasting_based_lng_lat(data1, data2):
# data1, data2 are the data arrays with 2 cols and they hold
# lat., lng. values in those cols respectively
data1 = np.deg2rad(data1)
data2 = np.deg2rad(data2)
lat1 = data1[:,0]
lng1 = data1[:,1]
lat2 = data2[:,0]
lng2 = data2[:,1]
diff_lat = lat1[:,None] - lat2
diff_lng = lng1[:,None] - lng2
d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
因此,要解决您的情况以获得所有成对的半正弦距离,它将是 -
broadcasting_based_lng_lat(*convert_to_arrays(df1,df2))
元素方向的正弦距离
对于两个数据之间的逐元素 haversine 距离计算,这样每个数据都将纬度和经度保存在两列中,或者分别包含两个元素的列表,我们将跳过 2D
的一些扩展并结束想出这样的东西 -
def broadcasting_based_lng_lat_elementwise(data1, data2):
# data1, data2 are the data arrays with 2 cols and they hold
# lat., lng. values in those cols respectively
data1 = np.deg2rad(data1)
data2 = np.deg2rad(data2)
lat1 = data1[:,0]
lng1 = data1[:,1]
lat2 = data2[:,0]
lng2 = data2[:,1]
diff_lat = lat1 - lat2
diff_lng = lng1 - lng2
d = np.sin(diff_lat/2)**2 + np.cos(lat1)*np.cos(lat2) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
示例 运行,数据框将两个数据保存在两列中 -
In [42]: np.random.seed(0)
...: a = np.random.randint(10,100,(5,2)).tolist()
...: b = np.random.randint(10,100,(5,2)).tolist()
...: df = pd.DataFrame({'A':a,'B':b})
In [43]: df
Out[43]:
A B
0 [54, 57] [80, 98]
1 [74, 77] [98, 22]
2 [77, 19] [68, 75]
3 [93, 31] [49, 97]
4 [46, 97] [56, 98]
In [44]: from haversine import haversine
In [45]: [haversine(i,j) for (i,j) in zip(df.A,df.B)]
Out[45]:
[3235.9659882513424,
2399.6124657290075,
2012.0851666001824,
4702.8069773315865,
1114.1193334220534]
In [46]: broadcasting_based_lng_lat_elementwise(np.vstack(df.A), np.vstack(df.B))
Out[46]:
array([3235.96151855, 2399.60915125, 2012.08238739, 4702.80048155,
1114.11779454])
这些细微差别主要是因为 haversine
library 假设地球半径为 6371.0088
,而我们在这里将其视为 6371
。
我想计算经纬度之间的地理距离。
我已经检查过这个线程
df1 的大小可以达到数百万,如果有任何其他方法可以在更短的时间内计算出准确的地理距离,那将非常有用。
length1 = 1000
d1 = np.random.uniform(-90, 90, length1)
d2 = np.random.uniform(-180, 180, length1)
length2 = 100
d3 = np.random.uniform(-90, 90, length2)
d4 = np.random.uniform(-180, 180, length2)
coords = tuple(zip(d1, d2))
df1 = pd.DataFrame({'coordinates':coords})
coords = tuple(zip(d3, d4))
df2 = pd.DataFrame({'coordinates':coords})
def get_diff(df1, df2):
data1 = np.array(df1['coordinates'].tolist())
data2 = np.array(df2['coordinates'].tolist())
lat1 = data1[:,0]
lng1 = data1[:,1]
lat2 = data2[:,0]
lng2 = data2[:,1]
#print(lat1.shape)
#print(lng1.shape)
#print(lat2.shape)
#print(lng2.shape)
diff_lat = lat1[:,None] - lat2
diff_lng = lng1[:,None] - lng2
#print(diff_lat.shape)
#print(diff_lng.shape)
d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat1) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
get_diff(df1, df2)
ValueError Traceback (most recent call last)
<ipython-input-58-df06c7cff72c> in <module>
----> 1 get_diff(df1, df2)
<ipython-input-57-9bd8f10189e6> in get_diff(df1, df2)
26 print(diff_lat.shape)
27 print(diff_lng.shape)
---> 28 d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat1) * np.sin(diff_lng/2)**2
29 return 2 * 6371 * np.arcsin(np.sqrt(d))
ValueError: operands could not be broadcast together with shapes (1000,1000) (1000,100)
使用简单的 print
语句来显示方程式的参数。 sin
表达式中的某些操作长度不同——基础 broadcast
操作(类似于 zip
的矢量化等价物)需要相同的长度。
成对的正弦距离
这是基于
broadcasting
的矢量化方式
def convert_to_arrays(df1, df2):
d1 = np.array(df1['coordinates'].tolist())
d2 = np.array(df2['coordinates'].tolist())
return d1,d2
def broadcasting_based_lng_lat(data1, data2):
# data1, data2 are the data arrays with 2 cols and they hold
# lat., lng. values in those cols respectively
data1 = np.deg2rad(data1)
data2 = np.deg2rad(data2)
lat1 = data1[:,0]
lng1 = data1[:,1]
lat2 = data2[:,0]
lng2 = data2[:,1]
diff_lat = lat1[:,None] - lat2
diff_lng = lng1[:,None] - lng2
d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
因此,要解决您的情况以获得所有成对的半正弦距离,它将是 -
broadcasting_based_lng_lat(*convert_to_arrays(df1,df2))
元素方向的正弦距离
对于两个数据之间的逐元素 haversine 距离计算,这样每个数据都将纬度和经度保存在两列中,或者分别包含两个元素的列表,我们将跳过 2D
的一些扩展并结束想出这样的东西 -
def broadcasting_based_lng_lat_elementwise(data1, data2):
# data1, data2 are the data arrays with 2 cols and they hold
# lat., lng. values in those cols respectively
data1 = np.deg2rad(data1)
data2 = np.deg2rad(data2)
lat1 = data1[:,0]
lng1 = data1[:,1]
lat2 = data2[:,0]
lng2 = data2[:,1]
diff_lat = lat1 - lat2
diff_lng = lng1 - lng2
d = np.sin(diff_lat/2)**2 + np.cos(lat1)*np.cos(lat2) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
示例 运行,数据框将两个数据保存在两列中 -
In [42]: np.random.seed(0)
...: a = np.random.randint(10,100,(5,2)).tolist()
...: b = np.random.randint(10,100,(5,2)).tolist()
...: df = pd.DataFrame({'A':a,'B':b})
In [43]: df
Out[43]:
A B
0 [54, 57] [80, 98]
1 [74, 77] [98, 22]
2 [77, 19] [68, 75]
3 [93, 31] [49, 97]
4 [46, 97] [56, 98]
In [44]: from haversine import haversine
In [45]: [haversine(i,j) for (i,j) in zip(df.A,df.B)]
Out[45]:
[3235.9659882513424,
2399.6124657290075,
2012.0851666001824,
4702.8069773315865,
1114.1193334220534]
In [46]: broadcasting_based_lng_lat_elementwise(np.vstack(df.A), np.vstack(df.B))
Out[46]:
array([3235.96151855, 2399.60915125, 2012.08238739, 4702.80048155,
1114.11779454])
这些细微差别主要是因为 haversine
library 假设地球半径为 6371.0088
,而我们在这里将其视为 6371
。