使用 Splinter 模块抓取动态内容

Question

我正在努力抓取通过 js 动态加载的 table（来自 steamcommunity）。我正在使用 python Splinter 和无头浏览器 Phantomjs 的组合。

这是我已经想到的：

from splinter import Browser
import time
import sys

browser = Browser('phantomjs')

url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'   

browser.visit(url)
print browser.is_element_present_by_xpath('//*[@id="market_commodity_buyreqeusts_table"]', wait_time = 5)
price_table = browser.find_by_xpath('//*[@id="market_commodity_buyreqeusts_table"]/table/tbody/tr')

print price_table
print price_table.first
print price_table.first.text
print price_table.first.value
browser.quit()

第一种方法is_element_present_by_xpath()确保加载我感兴趣的table。然后我尝试访问 table.

的行

正如我从 Splinter 文档中了解到的 .find_by_xpath() 方法 returns ElementList，它本质上是一个提供了一些别名的普通列表。

Price_table 是 table 所有行的 ElementList。最后两次打印给出了空结果，我找不到任何原因为什么文本方法 returns 是一个空字符串。

如何访问 table 的元素？

Answer 1

您尝试过 for i in price_table 了吗？在 code 中，它指出 ElementList 元素扩展了 python list。我相信您可以迭代 price_table。

编辑：这也是我第一次听说 splinter，它看起来只是对 selenium python 包的抽象。也许如果你被卡住了，你可以看看 selenium docs。他们写得更好。

from splinter import Browser
import time
import sys

browser = Browser('phantomjs')

url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'   

browser.visit(url)
print browser.is_element_present_by_xpath('//*[@id="market_commodity_buyreqeusts_table"]', wait_time = 5)
price_table = browser.find_by_xpath('//*[@id="market_commodity_buyreqeusts_table"]/table/tbody/tr')

for i in price_table:
    print i
    print i.text

browser.quit()

Answer 2

我尝试使用不同的浏览器编写代码，但始终为空 text，但我在 html 中找到了预期的数据。也许这只是 splinter.

中的错误

from splinter import Browser

#browser = Browser('firefox')
#browser = Browser('phantomjs')

#browser = Browser('chrome') # executable_path='/usr/bin/chromium-browser' ??? error !!!
browser = Browser('chrome') # executable_path='/usr/bin/chromedriver' OK

url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'   

browser.visit(url)

print(browser.is_element_present_by_xpath('//*[@id="market_commodity_buyreqeusts_table"]', wait_time = 5))

price_table = browser.find_by_xpath('//*[@id="market_commodity_buyreqeusts_table"]/table/tbody/tr')

for row in price_table:
    print('row html:', row.html)
    print('row text:', row.text) # empty ???
    for col in row.find_by_tag('td'):
        print('  col html:', col.html)
        print('  col text:', col.text) # empty ???

browser.quit()

使用 Splinter 模块抓取动态内容

Web scraping dynamic content with Splinter module

python

web-scraping

phantomjs

splinter