如何使用 Selenium-webdriver 和 nokogiri 提取呈现 HTML 的 JS？

Question

考虑两个网页 one and two。第二个站点很容易使用 nokogiri 进行抓取，因为它不使用 JS。然而，不能仅使用 nokogiri 抓取第一个站点。我用谷歌搜索并广泛搜索，发现如果我使用自动网络浏览器加载页面，我可以抓取呈现的 HTML。我在下面有以下代码：

# creates an instance
driver = Selenium::WebDriver.for :chrome

# opens an existing webpage
driver.get 'http://www.bigstub.com/search.aspx' 

# wait is used to let the webpage load up and let the JS render
wait = Selenium::WebDriver::Wait.new(:timeout => 5)

我的问题是，我试图让页面在获得所需 class 后立即关闭。一个例子是，如果我将超时调整为 10 秒，直到我找到 class .title-holder 我将如何编写这段代码？

伪代码：如果 .include?("title-holder")，rendered_source_page 将超时。我只是不知道怎么写。

更新： 关于无头问题，selenium 有一个选项或配置，您可以在其中添加无头选项。这是通过以下代码完成的：

options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options

对于我的下一个问题，为了让网站完全抓取呈现的 JS HTML，我将超时变量设置为 5 秒：

wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }

wait.until 几乎意味着等待 5 秒，直到我在 page_source 或渲染的 HTML 中找到 title-holder class。这几乎解决了我所有的问题。

Answer 1

我假设您在服务器上运行ning selenium。所以先安装Xvfb

sudo apt-get install xvfb

安装火狐

sudo apt-get install firefox

将以下两个 gem 添加到您的 gemfile 中。您将需要无头，因为您想要运行服务器上的 selenium webdriver。 Headless 将为您启动和停止 Xvfb。

#gemfile

gem 'selenium-webdriver'
gem 'headless'

抓取代码

  headless = Headless.new
  headless.start
  driver = Selenium::WebDriver.for :firefox
  driver.navigate.to example.com
  wait = Selenium::WebDriver::Wait.new(:timeout => 30)
  #scraping code comes here

内务管理，这样您就不会运行内存不足。

  driver.quit
  headless.destroy

希望这对您有所帮助。

Answer 2

关于无头问题，selenium 有一个选项或配置，您可以在其中添加无头选项。这是通过以下代码完成的：

options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options

对于我的下一个问题，为了让网站完全抓取呈现的 JS HTML，我将超时变量设置为 5 秒：

wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }

wait.until 几乎意味着等待 5 秒，直到我在 page_source 或渲染的 HTML 中找到 title-holder class。这几乎解决了我所有的问题。

如何使用 Selenium-webdriver 和 nokogiri 提取呈现 HTML 的 JS？

How to extract JS rendered HTML using Selenium-webdriver and nokogiri?

ruby

nokogiri

scraper

web-scraping

selenium-webdriver