如何使用 Python 抓取不 return 源代码的网站？

Question

我正在尝试从以下网站抓取 'ASX code' 中公司在澳大利亚证券交易所发布的公告：http://www.asx.com.au/asx/statistics/todayAnns.do

到目前为止，我已尝试使用 BeautifulSoup 和以下代码：

import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
parser = BeautifulSoup(response.content, 'html.parser')
print(parser)

但是，当我打印它时，它的打印效果与我手动进入页面并查看页面源代码时的打印效果不同。我做了一些谷歌搜索并查看了 Whosebug，并相信这是由于页面上的 Javascript 运行隐藏了 html 代码。

但是我不确定如何解决这个问题。任何帮助将不胜感激。

提前致谢。

Answer 1

试试这个。您需要做的就是让抓取工具等待片刻，直到页面加载完毕，因为您可能已经注意到内容正在动态加载。但是，执行后您将从该网页获得 table 的左侧 header。

import time
from bs4 import BeautifulSoup
from selenium  import webdriver

driver = webdriver.Chrome()
driver.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
time.sleep(8)

soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.row'):
    print(item.text)
driver.quit()

部分结果：

RLC
RNE
PFM
PDF
HXG
NCZ
NCZ

顺便说一句，我使用 python 3.5 编写并执行了这段代码。因此，在绑定 selenium 时，最新版本的 python 没有问题。

如何使用 Python 抓取不 return 源代码的网站？

How do I scrape websites which don't return the source code using Python?

python

selenium

beautifulsoup

web-scraping

dryscrape