使用 python 获取包含动态扩展容器的页面的完整 HTML

Question

我正在尝试从 ratemyprofessors.com 中提取完整的 HTML 但是在页面底部，有一个 "Load More Ratings" 按钮可以让您看到更多评论。

我正在使用 requests.get(url) 和 beautifulsoup，但这只给出了前 20 条评论。有没有办法让页面加载它之前的所有评论returns?

这是我目前正在做的，给出了前 20 条评论，但不是全部。

    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    comments = []
    for j in soup.findAll('div', attrs={'class': 'Comments__StyledComments-dzzyvm-0 dEfjGB'}):
        comments.append(j.text)

Answer 1

BeautifulSoup 更像是静态页面的 HTML 解析器，而不是更多动态网络应用程序的渲染器。

您可以通过 Selenium 使用无头浏览器实现您想要的效果，方法是呈现整个页面并重复单击更多 link 直到没有更多要加载。

示例：Clicking on a link via selenium

由于您已经在使用 Requests，另一个可能有效的选项是 Requests-HTML，它也支持动态渲染通过在响应对象上调用 .html.render()。

示例：https://requests-html.kennethreitz.org/index.html#requests_html.HTML.render

参考：Clicking link using beautifulsoup in python

使用 python 获取包含动态扩展容器的页面的完整 HTML

Get full HTML for page with dynamic expanded containers with python

html

beautifulsoup

expandable

web-scraping

python-3.x