使用 BS4 抓取网站时出现问题

Question

通常我能够编写一个用于抓取的脚本，但是我在抓取这个网站时遇到了一些困难，因为 table 征募了我正在从事的这个研究项目。我计划在进入我的目标状态 URL 之前验证脚本在一个状态下是否有效。

import requests
import bs4 as bs

url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table) 
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections

我不确定该网站是否试图阻止人们抓取，但如果您查看 Table 输出的内容，我想要获取的所有信息都在“"”内。

Answer 1

文本使用 JavaScript 呈现。首先使用 dryscrape

呈现页面

（如果您不想使用 dryscrape，请参阅 Web-scraping JavaScript page with Python）

然后可以在呈现文本后从页面上的不同位置（即呈现文本的位置）提取文本。

例如，此代码将从摘要中提取 HTML。

import bs4 as bs
import dryscrape

url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])

输出：

<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...

Answer 2

所以我终于设法解决了这个问题，并成功地从 Javascript 页面获取数据如果有人在尝试使用 python 时遇到同样的问题，下面的代码对我有用使用 windows 抓取 javascript 网页（dryscrape 不兼容）。

import bs4 as bs
from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
    trip = str(n.text)
    data.append(trip)

使用 BS4 抓取网站时出现问题

Trouble Scraping site with BS4

python

beautifulsoup

web-scraping

python-3.x

bs4