使用 BS4 抓取网站时出现问题

Trouble Scraping site with BS4

通常我能够编写一个用于抓取的脚本,但是我在抓取这个网站时遇到了一些困难,因为 table 征募了我正在从事的这个研究项目。我计划在进入我的目标状态 URL 之前验证脚本在一个状态下是否有效。

import requests
import bs4 as bs

url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table) 
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections

我不确定该网站是否试图阻止人们抓取,但如果您查看 Table 输出的内容,我想要获取的所有信息都在“"”内。

文本使用 JavaScript 呈现。 首先使用 dryscrape

呈现页面

(如果您不想使用 dryscrape,请参阅 Web-scraping JavaScript page with Python

然后可以在呈现文本后从页面上的不同位置(即呈现文本的位置)提取文本。

例如,此代码将从摘要中提取 HTML。

import bs4 as bs
import dryscrape

url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0]) 

输出:

<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...

所以我终于设法解决了这个问题,并成功地从 Javascript 页面获取数据如果有人在尝试使用 python 时遇到同样的问题,下面的代码对我有用使用 windows 抓取 javascript 网页(dryscrape 不兼容)。

import bs4 as bs
from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
    trip = str(n.text)
    data.append(trip)