使用 Python 网络抓取获取 url 的 link;请求,requests_html,硒

Get url of link using Python web scraping; requests, requests_html, selenium

我是网络抓取的新手,我在获取 link USGS 地震的数据时遇到了问题你感觉到了吗页面。我试图从中获取数据的 url 是:https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity

我正在尝试自动收集这些数据,这样我就不必在每次地震后手动收集它。我试图提取的数据的 url 是一致的,除了我拥有的地震 ID 和一个似乎与任何事物无关的数字,所以我想我可以得到url 网络抓取。

如果您查看该页面,会看到一个名为下载不同数据产品的下拉菜单。我正在尝试获取 DYFI 地理空间数据的 url,UTM 聚合(10 公里间距),因此我可以使用 curl.

提取 geojson 文件

我不太了解网络抓取或 html 代码,我尝试过的大部分内容都是基于我在此处和 youtube 上找到的内容。

我试过的:

我尝试使用请求获取 html 并用漂亮的汤解析它,但是页面是动态生成的,所以过来的 html 没有包含我要找的内容。

import requests
import bs4 #beautiful soup

res = requests.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
    print(link)

这会输出三个 link,但不是我需要的那个:

<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and Web Services</a>
<a href="https://angular.io/guide/browser-support">view supported
            browsers</a>
<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and
            Web Services</a>

我认为 USGS 站点使用 javascript 来填充下拉下载菜单,这就是常规请求方法不起作用的原因,因此我认为我可能会尝试使用 selenium。我希望它能给我在使用检查元素工具时可以看到的 html,但我没有任何运气。

from selenium import webdriver
path = "/Users/jon/Desktop/selenium_webdriver/chromedriver" #path to chromedriver on my machine
driver = webdriver.Chrome(executable_path=path)
driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
html_eq = driver.page_source
soup = bs4.BeautifulSoup(html_eq, 'html.parser')
for link in soup.find_all('a'):
    print(link) 

这比我最初的尝试输出了更多 link,但没有得到我正在寻找的 link。 这是我的硒尝试的输出:

<a _ngcontent-fgi-c8="" class="hazdev-site-logo" href="/" title="U.S. Geological Survey"><img _ngcontent-fgi-c8="" alt="U.S. Geological Survey logo" src="assets/usgs-logo.svg"/></a>
<a _ngcontent-fgi-c8="" class="hazdev-jumplink-navigation" href="#site-sectionnav">Jump to Navigation</a>
<a _ngcontent-fgi-c5="" class="up-one-level ng-star-inserted" href="/earthquakes/map/" templatesidenavigation=""> Latest Earthquakes </a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/executive" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Overview </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Interactive Map </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/region-info" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Regional Information </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Impact </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/tellus" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Felt Report - Tell Us! </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted active-link" href="/earthquakes/eventpage/us7000bi0e/dyfi" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Did You Feel It? </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/technical" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Technical </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/origin" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Origin </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/waveforms" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Waveforms </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/feed/v1.0/detail/us7000bi0e.kml" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Download Event KML </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/map/#%7B%22autoUpdate%22%3Afalse%2C%22basemap%22%3A%22terrain%22%2C%22event%22%3A%22us7000bi0e%22%2C%22feed%22%3A%22us7000bi0e%22%2C%22mapposition%22%3A%5B%5B6.104279985601153%2C-85.06432001439885%5D%2C%5B10.603920014398849%2C-80.56467998560115%5D%5D%2C%22search%22%3A%7B%22id%22%3A%22us7000bi0e%22%2C%22isSearch%22%3Atrue%2C%22name%22%3A%22Search%20Results%22%2C%22params%22%3A%7B%22endtime%22%3A%222020-09-25T17%3A46%3A43.975Z%22%2C%22latitude%22%3A8.3541%2C%22longitude%22%3A-82.8145%2C%22maxradiuskm%22%3A250%2C%22minmagnitude%22%3A2%2C%22starttime%22%3A%222020-08-14T17%3A46%3A43.975Z%22%7D%7D%7D" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> View Nearby Seismicity </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Earthquakes </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/hazards/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Hazards </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/data/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Data &amp; Products </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/learn/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Learn </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/monitoring/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Monitoring </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/research/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Research </div></a>
<a _ngcontent-fgi-c18="" class="tell-us-link" href="/earthquakes/eventpage/us7000bi0e/tellus" queryparamshandling="preserve"> Felt Report - Tell Us! </a>
<a _ngcontent-fgi-c22=""> View all dyfi products (1 total) </a>
<a _ngcontent-fgi-c20="" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity"> US </a>
<a _ngcontent-fgi-c18="" aria-current="true" aria-disabled="false" class="mat-tab-link ng-star-inserted mat-tab-label-active" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/zip" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> ZIP Map </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity-vs-distance" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity Vs. Distance </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses-vs-time" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Responses Vs. Time </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> DYFI Responses </a>
<a _ngcontent-fgi-c28="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map?dyfi-responses-10km=true&amp;shakemap-intensity=false"><img _ngcontent-fgi-c28="" alt="DYFI intensity map" src="https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/us7000bi0e_ciim_geo.jpg"/></a>
<a _ngcontent-fgi-c23="" href="/earthquakes/eventpage/us7000bi0e">Overview</a>
<a _ngcontent-fgi-c32="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact"> Impact Summary </a>
<a _ngcontent-fgi-c18="" href="https://earthquake.usgs.gov/data/dyfi/">Scientific Background for Did You Feel It?</a>
<a href="https://earthquake.usgs.gov/data/comcat/contributor/us/">USGS National Earthquake Information Center, PDE</a>
<a _ngcontent-fgi-c7="" href="/data/comcat/"> ANSS Comprehensive Earthquake Catalog (ComCat) Documentation </a>
<a _ngcontent-fgi-c7="" href="/data/comcat/data-eventterms.php"> Technical terms used on event pages </a>
<a _ngcontent-fgi-c11="" href="mailto:lisa%2Behpweb@usgs.gov">Questions or comments?</a>
<a _ngcontent-fgi-c11="" class="facebook ng-star-inserted" href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Facebook">Facebook</a>
<a _ngcontent-fgi-c11="" class="twitter ng-star-inserted" href="https://twitter.com/intent/tweet?url=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity&amp;text=USGS%20%7C%20M 5.3 - 1 km NNW of Manaca Norte, Panama" title="Share using Twitter">Twitter</a>
<a _ngcontent-fgi-c11="" class="email ng-star-inserted" href="mailto:lisa%2Behpweb@usgs.gov?to=&amp;subject=M 5.3 - 1 km NNW of Manaca Norte, Panama&amp;body=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Email">Email</a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/"> Home </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/aboutus/"> About Us </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/contactus/"> Contacts </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/legal.php"> Legal </a>

我找到了一个关于使用 requests_html 进行网络抓取的 youtube 教程,我认为它可能有用:https://www.youtube.com/watch?v=MeBU-4Xs2RU 我可以得到他在视频中给出的例子来与啤酒网站合作,但我一直无法将其应用于我的情况。

这是我试过的代码,

from requests_html import HTMLSession

url_usgs = 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity'

r_usgs = s.get(url_usgs)

r_usgs.html.render(sleep=1)

downloads = r_usgs.html.xpath('//*[@id="mat-expansion-panel-header-0"]', first=True)
print(downloads.absolute_links)

虽然这没有返回任何东西。我不知道 html 所以我可能选择了错误的项目的 xpath 来使用。

如果有人对我如何从下载菜单 (https://earthquake.usgs.gov/archive/product/dyfi/us7000biji/us/1601214674370/dyfi_geo_10km.geojson) 中获取 10 公里 dyfi 数据的 url 有任何想法,或者可以为我指明更深入的方向material 关于网络抓取,我将不胜感激。

您需要单击“下载”菜单才能展开内容。

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time


driver = webdriver.Chrome()
driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')

# get a reference to the download menu. This will run before the page has 
# finished loading, so we stick it in a while loop and just keep looping
# until we're successful.
while True:
    try:
        download_menu = driver.find_element_by_id('mat-expansion-panel-header-0')
    except NoSuchElementException:
        time.sleep(0.2)
        continue
    else:
        break

# click on the download menu to expand the content
download_menu.click()

while True:
    try:
        downloads = driver.find_element_by_id('cdk-accordion-child-0')
    except NoSuchElementException:
        time.sleep(0.2)
        continue
    else:
        break

links = downloads.find_elements_by_css_selector('a')
geojson = [link for link in links if 'geojson' in link.text.lower()]

for link in geojson:
    print(link.text, ':', link.get_attribute('href'))


driver.close()

这将产生:

GEOJSON 645.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_zip.geojson
GEOJSON 844.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_1km.geojson
GEOJSON 1.0 KB : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_10km.geojson

...当然您可以检查 href 属性的值以查找 10 公里数据(通过在 link 中查找包含 10km 的数据) .