如何使用 beautifulsoup 从 html 页面抓取 Latitude/Longitude 数据
How to use beautifulsoup to scrape the Latitude/Longitude data from html page
我正在尝试从该网站抓取经纬度数字:
http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false
对于每个提供者,如果您查看元素,它看起来像
div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22"
如何使用beautifulsoup获取这里的纬度和经度数?
我尝试在我的脚本中使用正则表达式,
下面是我的脚本 -
Geo = soup.find("div", class_="providerSearchResults")
print Geo.findAll("div", data-lat_= re.compile('[0-9.]'))
但我收到此错误消息:"SyntaxError: keyword can't be an expression"
此外,对于每个提供者,"div" 部分总是变化
它可以是:
div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22"
或
div class="listingfirst" data-lat="40.66862" data-lng="-73.98574" data-listing="22"
甚至
div class="listing enhancedlisting" data-lat="40.66862" data-lng="-73.98574" data-listing="22"
先提出几点要求:
pip install requests
pip install BeautifulSoup
pip install lxml
latlongbs4.py:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false')
soup = BeautifulSoup(r.text, 'lxml')
latlonglist = soup.find_all(attrs={"data-lat": True, "data-lng": True})
for latlong in latlonglist:
print latlong['data-lat'], latlong['data-lng']
编辑: 从 attrs 字典中删除了 class
。
输出:
(latlongbs4)macbook:latlongbs4 joeyoung$ python latlongbs4.py
40.71851 -74.00984
40.77536 -73.97707
40.71961 -74.00347
40.71395 -74.008
40.711614 -74.015901
40.724576 -74.001771
40.7175 -74.00087
40.71961 -74.00347
40.71766 -73.99293
40.71961 -74.00347
40.71848 -73.99648
40.709917 -74.009884
40.71553 -74.00977
40.71702 -73.996
40.71254 -73.99994
40.70869 -74.01164
40.70994 -74.00764
40.707325 -74.003982
40.7184 -74.00098
40.71373 -74.00812
40.710474 -74.009844
40.7175 -74.00087
40.727582 -73.894632
40.763469 -73.963106
40.724853 -73.841097
几点说明:
我在字典中使用了 attrs
关键字,因为:
Some attributes, like the data-* attributes in HTML 5, have names that
can’t be used as the names of keyword arguments:
You can use these attributes in searches by putting them into a
dictionary and passing the dictionary into find_all() as the attrs
argument:
来源: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
我正在尝试从该网站抓取经纬度数字:
http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false
对于每个提供者,如果您查看元素,它看起来像
div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22"
如何使用beautifulsoup获取这里的纬度和经度数?
我尝试在我的脚本中使用正则表达式,
下面是我的脚本 -
Geo = soup.find("div", class_="providerSearchResults")
print Geo.findAll("div", data-lat_= re.compile('[0-9.]'))
但我收到此错误消息:"SyntaxError: keyword can't be an expression"
此外,对于每个提供者,"div" 部分总是变化 它可以是:
div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22"
或
div class="listingfirst" data-lat="40.66862" data-lng="-73.98574" data-listing="22"
甚至
div class="listing enhancedlisting" data-lat="40.66862" data-lng="-73.98574" data-listing="22"
先提出几点要求:
pip install requests
pip install BeautifulSoup
pip install lxml
latlongbs4.py:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false')
soup = BeautifulSoup(r.text, 'lxml')
latlonglist = soup.find_all(attrs={"data-lat": True, "data-lng": True})
for latlong in latlonglist:
print latlong['data-lat'], latlong['data-lng']
编辑: 从 attrs 字典中删除了 class
。
输出:
(latlongbs4)macbook:latlongbs4 joeyoung$ python latlongbs4.py
40.71851 -74.00984
40.77536 -73.97707
40.71961 -74.00347
40.71395 -74.008
40.711614 -74.015901
40.724576 -74.001771
40.7175 -74.00087
40.71961 -74.00347
40.71766 -73.99293
40.71961 -74.00347
40.71848 -73.99648
40.709917 -74.009884
40.71553 -74.00977
40.71702 -73.996
40.71254 -73.99994
40.70869 -74.01164
40.70994 -74.00764
40.707325 -74.003982
40.7184 -74.00098
40.71373 -74.00812
40.710474 -74.009844
40.7175 -74.00087
40.727582 -73.894632
40.763469 -73.963106
40.724853 -73.841097
几点说明:
我在字典中使用了 attrs
关键字,因为:
Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:
You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:
来源: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments