dryscrape 和 BeautifulSoup 获取 js 呈现的 iframe 中的所有行

Question

我正在尝试在 http://apps2.eere.energy.gov/wind/windexchange/economics_tools.asp

上抓取 table

enter image description here

table 默认显示 5 个条目。我使用 dryscrape 和 BeautifulSoup 如下：

import dryscrape
from bs4 import BeautifulSoup
myurl = 'http://apps2.eere.energy.gov/wind/windexchange/economics_tools.asp'
session = dryscrape.Session()
session.visit(myurl)
response = session.body()
soup = BeautifulSoup(response,'lxml')
table = soup.find_all("td")

但这只是 returns 那个 table 的默认 5 个条目。如何获取此 table 中的所有行？

非常感谢！

Answer 1

对于这个特定页面，您不需要 dryscrape。因为您要获取的全部 table 都在源代码 html 中，所以您可以这样做：

from bs4 import BeautifulSoup
import requests

myurl = 'http://apps2.eere.energy.gov/wind/windexchange/economics_tools.asp'
soup = BeautifulSoup(requests.get(myurl).text,'lxml')
table = soup.find_all("td")

或者，使用您当前的设置：

table = session.xpath('//td')

将在 dryscrape session 中为您提供 td 标签的节点。那样的话就不需要美汤了

session.body() 为您提供当前加载到 dom 中的 html。因为 java-script 正在作用于它并改变 dom 中的内容。因此，您可以执行一个 for 循环，在其中单击 session 中的每个下一个按钮，并在每次迭代后将 body 放入漂亮的汤中，但这对我来说似乎没有必要。

useful reference

dryscrape 和 BeautifulSoup 获取 js 呈现的 iframe 中的所有行

dryscrape and BeautifulSoup to get all rows in a js rendered iframe

javascript

python

iframe

beautifulsoup

dryscrape