Pandas Dataframe:根据地理坐标(经度和纬度)连接范围内的项目
Pandas Dataframe: join items in range based on their geo coordinates (longitude and latitude)
我得到了一个包含地点及其经纬度的数据框。想象一下城市。
df = pd.DataFrame([{'city':"Berlin", 'lat':52.5243700, 'lng':13.4105300},
{'city':"Potsdam", 'lat':52.3988600, 'lng':13.0656600},
{'city':"Hamburg", 'lat':53.5753200, 'lng':10.0153400}]);
现在我试图让所有城市都在一个半径范围内。假设距离柏林 500 公里、汉堡 500 公里等的所有城市。我会通过复制原始数据帧并将两者与距离函数连接来做到这一点。
中间结果有点像这样:
Berlin --> Potsdam
Berlin --> Hamburg
Potsdam --> Berlin
Potsdam --> Hamburg
Hamburg --> Potsdam
Hamburg --> Berlin
这个分组(归约)后的最终结果应该是这样的。 备注:如果值列表包含城市的所有列,那就太好了。
Berlin --> [Potsdam, Hamburg]
Potsdam --> [Berlin, Hamburg]
Hamburg --> [Berlin, Potsdam]
或者只是一个城市周围 500 公里的城市数。
Berlin --> 2
Potsdam --> 2
Hamburg --> 2
由于我是 Python 的新手,我将不胜感激。我熟悉 haversine 距离。但不确定Scipy或Pandas.
中是否有有用的distance/spatial方法
很高兴你能给我一个起点。到目前为止,我尝试关注 this post.
更新: 这道题的原意来自Two Sigma Connect Rental Listing Kaggle Competition。这个想法是让那些列表在另一个列表周围 100m。其中 a) 表示密度,因此是一个受欢迎的区域,b) 如果比较地址,您可以找出是否有交叉路口,因此是一个嘈杂的区域。因此,您不需要完整的项目到项目关系,因为您不仅需要比较距离,还需要比较地址和其他元数据。 PS: 我没有向 Kaggle 上传解决方案。我只是想学习。
您可以使用:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
首先需要与merge
, remove row with same values in city_x
and city_y
by boolean indexing
进行交叉连接:
df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
print (df)
city_x lat_x lng_x tmp city_y lat_y lng_y
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566
然后应用haversine函数:
df['dist'] = df.apply(lambda row: haversine(row['lng_x'],
row['lat_x'],
row['lng_y'],
row['lat_y']), axis=1)
过滤距离:
df = df[df.dist < 500]
print (df)
city_x lat_x lng_x tmp city_y lat_y lng_y dist
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566 27.215704
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534 255.223782
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053 27.215704
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534 242.464120
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053 255.223782
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566 242.464120
最后创建 list
或使用 groupby
获得 size
:
df1 = df.groupby('city_x')['city_y'].apply(list)
print (df1)
city_x
Berlin [Potsdam, Hamburg]
Hamburg [Berlin, Potsdam]
Potsdam [Berlin, Hamburg]
Name: city_y, dtype: object
df2 = df.groupby('city_x')['city_y'].size()
print (df2)
city_x
Berlin 2
Hamburg 2
Potsdam 2
dtype: int64
也可以使用 numpy haversine solution
:
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
#print (df)
df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
city_x lat_x lng_x tmp city_y lat_y lng_y dist
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566 27.198616
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534 255.063541
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053 27.198616
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534 242.311890
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053 255.063541
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566 242.311890
更新:我建议首先构建一个距离数据帧:
from scipy.spatial.distance import squareform, pdist
from itertools import combinations
# see definition of "haversine_np()" below
x = pd.DataFrame({'dist':pdist(df[['lat','lng']], haversine_np)},
index=pd.MultiIndex.from_tuples(tuple(combinations(df['city'], 2))))
有效地产生成对距离 DF(没有重复):
In [106]: x
Out[106]:
dist
Berlin Potsdam 27.198616
Hamburg 255.063541
Potsdam Hamburg 242.311890
旧答案:
这里是稍微优化一下的版本,使用了scipy.spatial.distance.pdist方法:
from scipy.spatial.distance import squareform, pdist
# slightly modified version: of
def haversine_np(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lat1, lon1, lat2, lon2 = np.radians([p1[0], p1[1],
p2[0], p2[1]])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
x = pd.DataFrame(squareform(pdist(df[['lat','lng']], haversine_np)),
columns=df.city.unique(),
index=df.city.unique())
这给了我们:
In [78]: x
Out[78]:
Berlin Potsdam Hamburg
Berlin 0.000000 27.198616 255.063541
Potsdam 27.198616 0.000000 242.311890
Hamburg 255.063541 242.311890 0.000000
让我们数一数距离大于30的城市数量:
In [81]: x.groupby(level=0, as_index=False) \
...: .apply(lambda c: c[c>30].notnull().sum(1)) \
...: .reset_index(level=0, drop=True)
Out[81]:
Berlin 1
Hamburg 2
Potsdam 1
dtype: int64
我得到了一个包含地点及其经纬度的数据框。想象一下城市。
df = pd.DataFrame([{'city':"Berlin", 'lat':52.5243700, 'lng':13.4105300},
{'city':"Potsdam", 'lat':52.3988600, 'lng':13.0656600},
{'city':"Hamburg", 'lat':53.5753200, 'lng':10.0153400}]);
现在我试图让所有城市都在一个半径范围内。假设距离柏林 500 公里、汉堡 500 公里等的所有城市。我会通过复制原始数据帧并将两者与距离函数连接来做到这一点。
中间结果有点像这样:
Berlin --> Potsdam
Berlin --> Hamburg
Potsdam --> Berlin
Potsdam --> Hamburg
Hamburg --> Potsdam
Hamburg --> Berlin
这个分组(归约)后的最终结果应该是这样的。 备注:如果值列表包含城市的所有列,那就太好了。
Berlin --> [Potsdam, Hamburg]
Potsdam --> [Berlin, Hamburg]
Hamburg --> [Berlin, Potsdam]
或者只是一个城市周围 500 公里的城市数。
Berlin --> 2
Potsdam --> 2
Hamburg --> 2
由于我是 Python 的新手,我将不胜感激。我熟悉 haversine 距离。但不确定Scipy或Pandas.
中是否有有用的distance/spatial方法很高兴你能给我一个起点。到目前为止,我尝试关注 this post.
更新: 这道题的原意来自Two Sigma Connect Rental Listing Kaggle Competition。这个想法是让那些列表在另一个列表周围 100m。其中 a) 表示密度,因此是一个受欢迎的区域,b) 如果比较地址,您可以找出是否有交叉路口,因此是一个嘈杂的区域。因此,您不需要完整的项目到项目关系,因为您不仅需要比较距离,还需要比较地址和其他元数据。 PS: 我没有向 Kaggle 上传解决方案。我只是想学习。
您可以使用:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
首先需要与merge
, remove row with same values in city_x
and city_y
by boolean indexing
进行交叉连接:
df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
print (df)
city_x lat_x lng_x tmp city_y lat_y lng_y
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566
然后应用haversine函数:
df['dist'] = df.apply(lambda row: haversine(row['lng_x'],
row['lat_x'],
row['lng_y'],
row['lat_y']), axis=1)
过滤距离:
df = df[df.dist < 500]
print (df)
city_x lat_x lng_x tmp city_y lat_y lng_y dist
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566 27.215704
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534 255.223782
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053 27.215704
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534 242.464120
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053 255.223782
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566 242.464120
最后创建 list
或使用 groupby
获得 size
:
df1 = df.groupby('city_x')['city_y'].apply(list)
print (df1)
city_x
Berlin [Potsdam, Hamburg]
Hamburg [Berlin, Potsdam]
Potsdam [Berlin, Hamburg]
Name: city_y, dtype: object
df2 = df.groupby('city_x')['city_y'].size()
print (df2)
city_x
Berlin 2
Hamburg 2
Potsdam 2
dtype: int64
也可以使用 numpy haversine solution
:
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
#print (df)
df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
city_x lat_x lng_x tmp city_y lat_y lng_y dist
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566 27.198616
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534 255.063541
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053 27.198616
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534 242.311890
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053 255.063541
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566 242.311890
更新:我建议首先构建一个距离数据帧:
from scipy.spatial.distance import squareform, pdist
from itertools import combinations
# see definition of "haversine_np()" below
x = pd.DataFrame({'dist':pdist(df[['lat','lng']], haversine_np)},
index=pd.MultiIndex.from_tuples(tuple(combinations(df['city'], 2))))
有效地产生成对距离 DF(没有重复):
In [106]: x
Out[106]:
dist
Berlin Potsdam 27.198616
Hamburg 255.063541
Potsdam Hamburg 242.311890
旧答案:
这里是稍微优化一下的版本,使用了scipy.spatial.distance.pdist方法:
from scipy.spatial.distance import squareform, pdist
# slightly modified version: of
def haversine_np(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lat1, lon1, lat2, lon2 = np.radians([p1[0], p1[1],
p2[0], p2[1]])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
x = pd.DataFrame(squareform(pdist(df[['lat','lng']], haversine_np)),
columns=df.city.unique(),
index=df.city.unique())
这给了我们:
In [78]: x
Out[78]:
Berlin Potsdam Hamburg
Berlin 0.000000 27.198616 255.063541
Potsdam 27.198616 0.000000 242.311890
Hamburg 255.063541 242.311890 0.000000
让我们数一数距离大于30的城市数量:
In [81]: x.groupby(level=0, as_index=False) \
...: .apply(lambda c: c[c>30].notnull().sum(1)) \
...: .reset_index(level=0, drop=True)
Out[81]:
Berlin 1
Hamburg 2
Potsdam 1
dtype: int64