从 pandas 数据框到元组(对于 haversine 模块)
From pandas dataframe to tuples (for haversine module)
我有一个 pandas 数据框 my_df
,其中包含以下列:
id lat1 lon1 lat2 lon2
1 45 0 41 3
2 40 1 42 4
3 42 2 37 1
基本上,我想做以下事情:
import haversine
haversine.haversine((45, 0), (41, 3)) # just to show syntax of haversine()
> 507.20410687342115
# what I'd like to do
my_df["dist"] = haversine.haversine((my_df["lat1"], my_df["lon1"]),(my_df["lat2"], my_df["lon2"]))
TypeError: cannot convert the series to < class 'float' >
使用 this,我尝试了以下操作:
my_df['dist'] = haversine.haversine(
list(zip(*[my_df[['lat1','lon1']][c].values.tolist() for c in my_df[['lat1','lon1']]]))
,
list(zip(*[my_df[['lat2','lon2']][c].values.tolist() for c in my_df[['lat2','lon2']]]))
)
File "blabla\lib\site-packages\haversine__init__.py", line 20, in haversine
lat1, lng1 = point1
ValueError: too many values to unpack (expected 2)
知道我做错了什么/我怎样才能实现我想要的吗?
将apply
与axis=1
一起使用:
my_df["dist"] = my_df.apply(lambda row : haversine.haversine((row["lat1"], row["lon1"]),(row["lat2"], row["lon2"])), axis=1)
要在每一行上调用 haversine 函数,该函数理解标量值,而不是类似数组的值,因此会出现错误。通过使用 axis=1
调用 apply
,您可以逐行迭代,这样我们就可以访问每个列值并以方法期望的形式传递这些值。
我也不知道有什么区别,但是有一个向量化的version半正弦公式
使用向量化方法怎么样:
import pandas as pd
# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2])
a = pd.np.sin((lat2-lat1)/2.0)**2 + \
pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))
演示:
In [38]: df
Out[38]:
id lat1 lon1 lat2 lon2
0 1 45 0 41 3
1 2 40 1 42 4
2 3 42 2 37 1
In [39]: df['dist'] = haversine(df.lat1, df.lon1, df.lat2, df.lon2)
In [40]: df
Out[40]:
id lat1 lon1 lat2 lon2 dist
0 1 45 0 41 3 507.204107
1 2 40 1 42 4 335.876312
2 3 42 2 37 1 562.543582
我有一个 pandas 数据框 my_df
,其中包含以下列:
id lat1 lon1 lat2 lon2
1 45 0 41 3
2 40 1 42 4
3 42 2 37 1
基本上,我想做以下事情:
import haversine
haversine.haversine((45, 0), (41, 3)) # just to show syntax of haversine()
> 507.20410687342115
# what I'd like to do
my_df["dist"] = haversine.haversine((my_df["lat1"], my_df["lon1"]),(my_df["lat2"], my_df["lon2"]))
TypeError: cannot convert the series to < class 'float' >
使用 this,我尝试了以下操作:
my_df['dist'] = haversine.haversine(
list(zip(*[my_df[['lat1','lon1']][c].values.tolist() for c in my_df[['lat1','lon1']]]))
,
list(zip(*[my_df[['lat2','lon2']][c].values.tolist() for c in my_df[['lat2','lon2']]]))
)
File "blabla\lib\site-packages\haversine__init__.py", line 20, in haversine lat1, lng1 = point1
ValueError: too many values to unpack (expected 2)
知道我做错了什么/我怎样才能实现我想要的吗?
将apply
与axis=1
一起使用:
my_df["dist"] = my_df.apply(lambda row : haversine.haversine((row["lat1"], row["lon1"]),(row["lat2"], row["lon2"])), axis=1)
要在每一行上调用 haversine 函数,该函数理解标量值,而不是类似数组的值,因此会出现错误。通过使用 axis=1
调用 apply
,您可以逐行迭代,这样我们就可以访问每个列值并以方法期望的形式传递这些值。
我也不知道有什么区别,但是有一个向量化的version半正弦公式
使用向量化方法怎么样:
import pandas as pd
# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2])
a = pd.np.sin((lat2-lat1)/2.0)**2 + \
pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))
演示:
In [38]: df
Out[38]:
id lat1 lon1 lat2 lon2
0 1 45 0 41 3
1 2 40 1 42 4
2 3 42 2 37 1
In [39]: df['dist'] = haversine(df.lat1, df.lon1, df.lat2, df.lon2)
In [40]: df
Out[40]:
id lat1 lon1 lat2 lon2 dist
0 1 45 0 41 3 507.204107
1 2 40 1 42 4 335.876312
2 3 42 2 37 1 562.543582