urllib.request + BeautifulSoup 无法抓取特定页面，而是抓取根页面

Question

我无法从 url http://csgo-stats.com/epsilon-/ but due to the way the website handles things BeautifulSoup is only collecting data from the root page, aka http://csgo-stats.com

抓取信息

是否存在重定向 BS？我可以在 html 中看到 BS 输出它正在尝试加载我的数据，但 BS 捕获它的速度太快了：

<main class="site-content" id="content">
        <div class="loading-spinner" data-request="epsilon-" id="load">
            Loading
        </div>

这是我正在使用的代码，以备不时之需：

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://csgo-stats.com/Epsilon-/"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())

Answer 1

问题是urllib.request没有处理Javascript。尝试访问禁用 Javascript 的页面。有关 javascript-enabled 抓取的更多信息：Web-scraping JavaScript page with Python
如果提供 API ()
，最好避免抓取

Answer 2

虽然大多数 http 内容库（beautiful soup、requests...）都会为您提供页面源代码，但这并不是页面在浏览器中呈现后的样子。这与今天构建 HTML 代码的方式有关，这是因为当页面上的所有 JavaScript 都正常工作时，大部分页面呈现都发生在稍后。这正是您看不到 'final' 内容的原因。

现在，如果您希望在播放所有 JavaScript 音乐后以浏览器呈现内容的方式收集内容，那么您需要另一种 (python) 库，而该库是硒。

有关 Selenium 的更多信息：http://www.seleniumhq.org/

只是警告你，selenium 是一个相当大的野兽，有很多毛茸茸的末端，但学习它是值得的（不仅仅是为了抓取）

urllib.request + BeautifulSoup 无法抓取特定页面，而是抓取根页面

urllib.request + BeautifulSoup cannot scrape certain page, instead scrape root page

python

urllib

beautifulsoup

web-scraping

web