使用 geopy pandas 坐标的新列
new column with coordinates using geopy pandas
我有一个 df:
import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
df
city_name state_name county_name
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
2 WASHINGTON DC DIST OF COLUMBIA
3 WASHINGTON DC DIST OF COLUMBIA
4 WASHINGTON DC DIST OF COLUMBIA
5 WASHINGTON DC DIST OF COLUMBIA
6 WASHINGTON DC DIST OF COLUMBIA
7 WASHINGTON DC DIST OF COLUMBIA
8 WASHINGTON DC DIST OF COLUMBIA
9 WASHINGTON DC DIST OF COLUMBIA
我想获取下方数据框中任意一列的纬度和经度坐标。使用各个位置的文档时,文档 (http://geopy.readthedocs.org/en/latest/#data) 非常简单。
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}
但是我想将该函数应用于 df 中的每一行并创建一个新列。我尝试了以下
df['city_coord'] = geolocator.geocode(lambda row: 'state_name' (row))
但我认为我的代码中遗漏了一些东西,因为我得到以下信息:
city_name state_name county_name coordinates
0 WASHINGTON DC DIST OF COLUMBIA None
1 WASHINGTON DC DIST OF COLUMBIA None
2 WASHINGTON DC DIST OF COLUMBIA None
3 WASHINGTON DC DIST OF COLUMBIA None
4 WASHINGTON DC DIST OF COLUMBIA None
5 WASHINGTON DC DIST OF COLUMBIA None
6 WASHINGTON DC DIST OF COLUMBIA None
7 WASHINGTON DC DIST OF COLUMBIA None
8 WASHINGTON DC DIST OF COLUMBIA None
9 WASHINGTON DC DIST OF COLUMBIA None
我想要这样的东西,希望使用 Lambda 函数:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
1 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
2 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
3 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
4 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
5 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
6 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
7 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
8 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
9 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
10 GLYNCO GA GLYNN 31.2224512, -81.5101023
感谢任何帮助。获得坐标后,我想绘制它们的地图。也非常感谢任何推荐的映射坐标资源。谢谢
您可以调用 apply
并传递要在每一行上执行的函数,如下所示:
In [9]:
geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
city_name state_name county_name \
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
city_coord
0 (District of Columbia, United States of Americ...
1 (District of Columbia, United States of Americ...
然后您可以访问纬度和经度属性:
In [16]:
df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
或者通过调用 apply
两次在一行中完成:
In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df
Out[17]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
您的尝试 geolocator.geocode(lambda row: 'state_name' (row))
也没有任何作用,因此为什么您有一列充满 None
值
编辑
@leb 在这里提出了一个有趣的观点,如果你有很多重复值,那么对每个唯一值进行地理编码然后添加这个会更高效:
In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d
Out[38]:
{'DC': (38.8937154, -76.9877934586326)}
In [40]:
df['city_coord'] = df['state_name'].map(d)
df
Out[40]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
所以上面使用 unique
获取所有唯一值,从中构造一个字典,然后调用 map
执行查找并添加坐标,这将比尝试更有效按行进行地理编码
投票并接受@EdChum 的回答,我只是想补充一下。他的方法很完美,但根据个人经验,我想分享一些东西:
在处理地理编码时,如果您有多个 city/state 组合重复,那么只发送 1 个进行地理编码然后将其余的复制到下面的其他行:
这对于大数据非常有帮助可以通过两种方式完成:
- 仅基于您的数据,因为这些行看起来完全重复,并且仅当您需要时,才删除多余的行并对其中之一执行地理编码。这可以使用
drop_duplicate
来完成
- 如果您想保留所有行,
group_by
city/state 组合,通过调用 head(1)
对其第一行应用地理编码,然后复制到其余行。
原因是每次调用 Nominatim 时都会出现一个小的延迟问题,即使您连续排队 city/state 也是如此。当您的数据变大时,这种 小 延迟会变得更糟,导致响应出现巨大延迟并可能超时。
再说一遍,这都是亲手处理的。如果现在对您没有好处,请记住以备将来使用。
我有一个 df:
import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
df
city_name state_name county_name
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
2 WASHINGTON DC DIST OF COLUMBIA
3 WASHINGTON DC DIST OF COLUMBIA
4 WASHINGTON DC DIST OF COLUMBIA
5 WASHINGTON DC DIST OF COLUMBIA
6 WASHINGTON DC DIST OF COLUMBIA
7 WASHINGTON DC DIST OF COLUMBIA
8 WASHINGTON DC DIST OF COLUMBIA
9 WASHINGTON DC DIST OF COLUMBIA
我想获取下方数据框中任意一列的纬度和经度坐标。使用各个位置的文档时,文档 (http://geopy.readthedocs.org/en/latest/#data) 非常简单。
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}
但是我想将该函数应用于 df 中的每一行并创建一个新列。我尝试了以下
df['city_coord'] = geolocator.geocode(lambda row: 'state_name' (row))
但我认为我的代码中遗漏了一些东西,因为我得到以下信息:
city_name state_name county_name coordinates
0 WASHINGTON DC DIST OF COLUMBIA None
1 WASHINGTON DC DIST OF COLUMBIA None
2 WASHINGTON DC DIST OF COLUMBIA None
3 WASHINGTON DC DIST OF COLUMBIA None
4 WASHINGTON DC DIST OF COLUMBIA None
5 WASHINGTON DC DIST OF COLUMBIA None
6 WASHINGTON DC DIST OF COLUMBIA None
7 WASHINGTON DC DIST OF COLUMBIA None
8 WASHINGTON DC DIST OF COLUMBIA None
9 WASHINGTON DC DIST OF COLUMBIA None
我想要这样的东西,希望使用 Lambda 函数:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
1 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
2 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
3 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
4 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
5 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
6 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
7 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
8 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
9 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
10 GLYNCO GA GLYNN 31.2224512, -81.5101023
感谢任何帮助。获得坐标后,我想绘制它们的地图。也非常感谢任何推荐的映射坐标资源。谢谢
您可以调用 apply
并传递要在每一行上执行的函数,如下所示:
In [9]:
geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
city_name state_name county_name \
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
city_coord
0 (District of Columbia, United States of Americ...
1 (District of Columbia, United States of Americ...
然后您可以访问纬度和经度属性:
In [16]:
df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
或者通过调用 apply
两次在一行中完成:
In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df
Out[17]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
您的尝试 geolocator.geocode(lambda row: 'state_name' (row))
也没有任何作用,因此为什么您有一列充满 None
值
编辑
@leb 在这里提出了一个有趣的观点,如果你有很多重复值,那么对每个唯一值进行地理编码然后添加这个会更高效:
In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d
Out[38]:
{'DC': (38.8937154, -76.9877934586326)}
In [40]:
df['city_coord'] = df['state_name'].map(d)
df
Out[40]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
所以上面使用 unique
获取所有唯一值,从中构造一个字典,然后调用 map
执行查找并添加坐标,这将比尝试更有效按行进行地理编码
投票并接受@EdChum 的回答,我只是想补充一下。他的方法很完美,但根据个人经验,我想分享一些东西:
在处理地理编码时,如果您有多个 city/state 组合重复,那么只发送 1 个进行地理编码然后将其余的复制到下面的其他行:
这对于大数据非常有帮助可以通过两种方式完成:
- 仅基于您的数据,因为这些行看起来完全重复,并且仅当您需要时,才删除多余的行并对其中之一执行地理编码。这可以使用
drop_duplicate
来完成
- 如果您想保留所有行,
group_by
city/state 组合,通过调用head(1)
对其第一行应用地理编码,然后复制到其余行。
原因是每次调用 Nominatim 时都会出现一个小的延迟问题,即使您连续排队 city/state 也是如此。当您的数据变大时,这种 小 延迟会变得更糟,导致响应出现巨大延迟并可能超时。
再说一遍,这都是亲手处理的。如果现在对您没有好处,请记住以备将来使用。