如何提取网页中的经纬度
How to extract latitude and longitude in a web
我在https://www.peakbagger.com/list.aspx?lid=5651 under the first column https://www.peakbagger.com/peak.aspx?pid=10882
中提取了一些信息
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.peakbagger.com/peak.aspx?pid=10882'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
a= soup.select("td")
a
只想从我得到的输出中检索纬度和经度,即 35.360638、138.727347。
在
E<br/>35.360638, 138.727347
还有,除了一个一个地检索/www.peakbagger.com/list.aspx?lid=5651 的链接,有没有更好的方法来检索所有的经纬度?
谢谢
这个答案非常针对您的问题。这里的问题是 br 标签出现在 td 标签内。 etree 模块(lxml 的一部分)允许您访问标签后的文本(a.k.a。它的 tail)。此代码将打印您显示为所需输出的值。
import requests
from lxml import etree
with requests.Session() as session:
r = session.get('https://www.peakbagger.com/peak.aspx?pid=10882')
r.raise_for_status()
tree = etree.HTML(r.text)
print(' '.join(tree.xpath('//table[@class="gray"][1]/*//br')[1].tail.split()[:2]))
正如 Brutus 所提到的,它非常具体,如果您不想使用 etree
这可能是一个替代方案。
find()
<td>
与字符串 Latitude/Longitude (WGS84)
findNext()
下一个 <td>
- 获取其内容
- 替换 , 并用空格拆分它
- 通过将结果切片为前两个元素,您将获得包含纬度和经度的列表。
.
data = soup.find('td', string='Latitude/Longitude (WGS84)')\
.findNext('td')\
.contents[2]\
.replace(',','')\
.split()[:2]
data
编辑
你有一个 url 列表并循环遍历它 - 为了考虑网站而不是被禁止,运行 在页面上有一些延迟 (time.sleep()
)。
import time
import requests
from bs4 import BeautifulSoup
urls = ['https://www.peakbagger.com/peak.aspx?pid=10882',
'https://www.peakbagger.com/peak.aspx?pid=10866',
'https://www.peakbagger.com/peak.aspx?pid=10840',
'https://www.peakbagger.com/peak.aspx?pid=10868',
'https://www.peakbagger.com/peak.aspx?pid=10832']
data = {}
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
ll= soup.find('td', string='Latitude/Longitude (WGS84)')\
.findNext('td')\
.contents[2]\
.replace(',','')\
.split()[:2]
data[soup.select_one('h1').get_text()]={
'url':url,
'lat':ll[0],
'long':ll[1]
}
time.sleep(3)
data
输出
{'Fuji-san, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10882',
'lat': '35.360638',
'long': '138.727347'},
'Kita-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10866',
'lat': '35.674537',
'long': '138.238833'},
'Hotaka-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10840',
'lat': '36.289203',
'long': '137.647986'},
'Aino-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10868',
'lat': '35.646037',
'long': '138.228292'},
'Yariga-take, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10832',
'lat': '36.34198',
'long': '137.647625'}}
我在https://www.peakbagger.com/list.aspx?lid=5651 under the first column https://www.peakbagger.com/peak.aspx?pid=10882
中提取了一些信息
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.peakbagger.com/peak.aspx?pid=10882'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
a= soup.select("td")
a
只想从我得到的输出中检索纬度和经度,即 35.360638、138.727347。
在
E<br/>35.360638, 138.727347
还有,除了一个一个地检索/www.peakbagger.com/list.aspx?lid=5651 的链接,有没有更好的方法来检索所有的经纬度?
谢谢
这个答案非常针对您的问题。这里的问题是 br 标签出现在 td 标签内。 etree 模块(lxml 的一部分)允许您访问标签后的文本(a.k.a。它的 tail)。此代码将打印您显示为所需输出的值。
import requests
from lxml import etree
with requests.Session() as session:
r = session.get('https://www.peakbagger.com/peak.aspx?pid=10882')
r.raise_for_status()
tree = etree.HTML(r.text)
print(' '.join(tree.xpath('//table[@class="gray"][1]/*//br')[1].tail.split()[:2]))
正如 Brutus 所提到的,它非常具体,如果您不想使用 etree
这可能是一个替代方案。
find()
<td>
与字符串Latitude/Longitude (WGS84)
findNext()
下一个<td>
- 获取其内容
- 替换 , 并用空格拆分它
- 通过将结果切片为前两个元素,您将获得包含纬度和经度的列表。
.
data = soup.find('td', string='Latitude/Longitude (WGS84)')\
.findNext('td')\
.contents[2]\
.replace(',','')\
.split()[:2]
data
编辑
你有一个 url 列表并循环遍历它 - 为了考虑网站而不是被禁止,运行 在页面上有一些延迟 (time.sleep()
)。
import time
import requests
from bs4 import BeautifulSoup
urls = ['https://www.peakbagger.com/peak.aspx?pid=10882',
'https://www.peakbagger.com/peak.aspx?pid=10866',
'https://www.peakbagger.com/peak.aspx?pid=10840',
'https://www.peakbagger.com/peak.aspx?pid=10868',
'https://www.peakbagger.com/peak.aspx?pid=10832']
data = {}
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
ll= soup.find('td', string='Latitude/Longitude (WGS84)')\
.findNext('td')\
.contents[2]\
.replace(',','')\
.split()[:2]
data[soup.select_one('h1').get_text()]={
'url':url,
'lat':ll[0],
'long':ll[1]
}
time.sleep(3)
data
输出
{'Fuji-san, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10882',
'lat': '35.360638',
'long': '138.727347'},
'Kita-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10866',
'lat': '35.674537',
'long': '138.238833'},
'Hotaka-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10840',
'lat': '36.289203',
'long': '137.647986'},
'Aino-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10868',
'lat': '35.646037',
'long': '138.228292'},
'Yariga-take, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10832',
'lat': '36.34198',
'long': '137.647625'}}