Python: 使用两列计算两点坐标之间的距离

Python: Computing the distance between two point coordinates using two columns

我想计算两个坐标之间的距离。我知道我可以计算两点之间的半正弦距离。但是,我想知道是否有更简单的方法来代替使用迭代整个列的公式创建循环(也会在循环中出错)。

这是示例的一些数据

# Random values for the duration from one point to another
random_values = random.sample(range(2,20), 8)
random_values

# Creating arrays for the coordinates
lat_coor = [11.923855, 11.923862, 11.923851, 11.923847, 11.923865, 11.923841, 11.923860, 11.923846]
lon_coor = [57.723843, 57.723831, 57.723839, 57.723831, 57.723827, 57.723831, 57.723835, 57.723827]

df = pd.DataFrame(
    {'duration': random_values,
     'latitude': lat_coor,
     'longitude': lon_coor
    })

df

    duration    latitude    longitude
0   5           11.923855   57.723843
1   2           11.923862   57.723831
2   10          11.923851   57.723839
3   19         11.923847    57.723831
4   16         11.923865    57.723827
5   4          11.923841    57.723831
6   13         11.923860    57.723835
7   3          11.923846    57.723827

为了计算距离,这是我尝试过的方法:

# Looping over each row to compute the Haversine distance between two points
# Earth's radius (in m)
R = 6373.0 * 1000

lat = df["latitude"]
lon = df["longitude"]


for i in lat:
    lat1 = lat[i]
    lat2 = lat[i+1]
    
    for j in lon:
        lon1 = lon[i]
        lon2 = lon[i+1]
        
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        
        # Haversine formula
        a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
        distance = R * c
        
        print(distance) # in m

但是,这是我得到的错误:

计算距离的两点应该取自同一列。

第一个距离值:

11.923855 57.723843 (point1/observation1)

11.923862 57.723831 (point2/observation2)

秒距离值:

11.923862 57.723831 (point1/observation2)

11.923851 57.723839(point2/observation3)

第三个距离值:

11.923851 57.723839(point1/observation3)

11.923847 57.723831 (point1/observation4)

...(等等)

我了解到您想获得 df 中所有点之间的成对正弦距离。这是如何完成的:

对很多点使用这种方法时要小心,因为它会很快生成很多列

设置

import random 
random_values = random.sample(range(2,20), 8)
random_values

# Creating arrays for the coordinates
lat_coor = [11.923855, 11.923862, 11.923851, 11.923847, 11.923865, 11.923841, 11.923860, 11.923846]
lon_coor = [57.723843, 57.723831, 57.723839, 57.723831, 57.723827, 57.723831, 57.723835, 57.723827]

df = pd.DataFrame(
    {'duration': random_values,
     'latitude': lat_coor,
     'longitude': lon_coor
    })

获取弧度

import math
df['lat_rad'] = df.latitude.apply(math.radians)
df['long_rad'] = df.latitude.apply(math.radians)

计算成对距离

from sklearn.metrics.pairwise import haversine_distances

for idx_from, from_point in df.iterrows():
    for idx_to, to_point in df.iterrows():
            column_name = f"Distance_to_point_{idx_from}"
            haversine_matrix = haversine_distances([[from_point.lat_rad, from_point.long_rad], [to_point.lat_rad, to_point.long_rad]])
            point_distance = haversine_matrix[0][1] * 6371000/1000
            df.loc[idx_to, column_name] = point_distance
df

    duration    latitude    longitude   lat_rad long_rad    Distance_to_point_0 Distance_to_point_1 Distance_to_point_2 Distance_to_point_3 Distance_to_point_4 Distance_to_point_5 Distance_to_point_6 Distance_to_point_7
0   3   11.923855   57.723843   0.20811052928038845 0.20811052928038845 0.0 0.0010889626934743966   0.0006222644021223135   0.001244528808978787    0.0015556609862946524   0.002177925427923575    0.000777830496776312    0.0014000949117650525
1   13  11.923862   57.723831   0.2081106514534361  0.2081106514534361  0.0010889626934743966   0.0 0.0017112270955967099   0.002333491502453183    0.0004666982928202561   0.00326688812139797 0.00031113219669808446  0.0024890576052394482
2   14  11.923851   57.723839   0.2081104594672184  0.2081104594672184  0.0006222644021223135   0.0017112270955967099   0.0 0.0006222644068564735   0.002177925388416966    0.0015556610258012616   0.0014000948988986254   0.0007778305096427389
3   4   11.923847   57.723831   0.20811038965404832 0.20811038965404832 0.001244528808978787    0.002333491502453183    0.0006222644068564735   0.0 0.0028001897952734385   0.0009333966189447881   0.002022359305755099    0.0001555661027862654
4   5   11.923865   57.723827   0.20811070381331365 0.20811070381331365 0.0015556609862946524   0.0004666982928202561   0.002177925388416966    0.0028001897952734385   0.0 0.003733586414218225    0.0007778304895183407   0.002955755898059704
5   7   11.923841   57.723831   0.20811028493429318 0.20811028493429318 0.002177925427923575    0.00326688812139797 0.0015556610258012616   0.0009333966189447881   0.003733586414218225    0.0 0.002955755924699886    0.0007778305161585227
6   9   11.92386    57.723835   0.20811061654685106 0.20811061654685106 0.000777830496776312    0.00031113219669808446  0.0014000948988986254   0.002022359305755099    0.0007778304895183407   0.002955755924699886    0.0 0.002177925408541364
7   8   11.923846   57.723827   0.20811037220075576 0.20811037220075576 0.0014000949117650525   0.0024890576052394482   0.0007778305096427389   0.0001555661027862654   0.002955755898059704    0.0007778305161585227   0.002177925408541364    0.0

好的,首先您可以创建一个数据框,将每个测量值与前一个测量值相结合:

df2 = pd.concat([df.add_suffix('_pre').shift(), df], axis=1)
df2

这输出:

    duration_pre    latitude_pre    longitude_pre   duration    latitude    longitude
0   NaN     NaN     NaN     5   11.923855   57.723843
1   5.0     11.923855   57.723843   2   11.923862   57.723831
2   2.0     11.923862   57.723831   10  11.923851   57.723839
…

然后创建一个 haversine 函数并将其应用于行:

def haversine(lat1, lon1, lat2, lon2):
    import math
    R = 6373.0 * 1000
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
    return R *2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

df2.apply(lambda x: haversine(x['latitude_pre'], x['longitude_pre'], x['latitude'], x['longitude']), axis=1)

计算每一行与前一行的距离(因此第一行是 NaN)。

0           NaN
1     75.754755
2     81.120210
3     48.123604
…

并且,如果您想在一行中包含原始数据框中的新列:

df['distance'] = pd.concat([df.add_suffix('_pre').shift(), df], axis=1).apply(lambda x: haversine(x['latitude_pre'], x['longitude_pre'], x['latitude'], x['longitude']), axis=1)

输出:

  duration  latitude    longitude   distance
0   5   11.923855   57.723843   NaN
1   2   11.923862   57.723831   75.754755
2   10  11.923851   57.723839   81.120210
3   19  11.923847   57.723831   48.123604
4   16  11.923865   57.723827   116.515304
5   4   11.923841   57.723831   154.307571
6   13  11.923860   57.723835   122.794838
7   3   11.923846   57.723827   98.115312

您混淆了索引与值本身,因此您遇到了一个关键错误,因为您的示例中没有 lat[i](例如 lat[11.923855])。将 i 固定为索引后,您的代码将使用 [i+1] 超出纬度和经度的最后一行。既然你想将每一行与前一行进行比较,那么从索引 1 开始并按 1 向后看如何,那么你就不会超出范围。您的代码的这个编辑版本不会崩溃:

for i in range(1, len(lat)):
    lat1 = lat[i - 1]
    lat2 = lat[i]

    for j in range(1, len(lon)):
        lon1 = lon[i - 1]
        lon2 = lon[i]

        dlon = lon2 - lon1
        dlat = lat2 - lat1

        # Haversine formula
        a = math.sin(dlat / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
        distance = R * c

        print(distance)  # in m