Nutch Selenium Interactive 插件忽略 chromedriver 配置

Nutch Selenium Interactive plugin ignores the chromedriver configuration

我配置了 nutch-site.xml 用于包含 selenium 交互式插件的本地抓取。

我只配置了基础,所以配置很简单(属性来自conf/nutch-site.xml)。

<property>
  <name>plugin.includes</name>
  <value>protocol-interactiveselenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
  <name>selenium.driver</name>
  <value>chrome</value>
  <description>
    A String value representing the flavour of Selenium
    WebDriver() to use. Currently the following options
    exist - 'firefox', 'chrome', 'safari', 'opera' and 'remote'.
    If 'remote' is used it is essential to also set correct properties for
    'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host',
    'selenium.hub.protocol', 'selenium.grid.driver', 'selenium.grid.binary'
    and 'selenium.enable.headless'.
  </description>
</property>

<property>
  <name>webdriver.chrome.driver</name>
  <value>/Users/theo/DISKS/Work/PNR/chromedriver</value>
  <description>The path to the ChromeDriver binary</description>
</property>

这是来自 nutch 日志:

2020-08-17 23:40:57,427 ERROR interactiveselenium.Http - Failed to get protocol output
java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
        at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:153)
        at org.apache.nutch.protocol.interactiveselenium.HttpResponse.readPlainContent(HttpResponse.java:401)
        at org.apache.nutch.protocol.interactiveselenium.HttpResponse.<init>(HttpResponse.java:280)
        at org.apache.nutch.protocol.interactiveselenium.Http.getResponse(Http.java:57)
        at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:383)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:352)
Caused by: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
        at com.google.common.base.Preconditions.checkState(Preconditions.java:585)
        at org.openqa.selenium.remote.service.DriverService.checkExecutable(DriverService.java:146)
        at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:141)
        at org.openqa.selenium.chrome.ChromeDriverService.access[=11=]0(ChromeDriverService.java:35)
        at org.openqa.selenium.chrome.ChromeDriverService$Builder.findDefaultExecutable(ChromeDriverService.java:159)
        at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:355)
        at org.openqa.selenium.chrome.ChromeDriverService.createDefaultService(ChromeDriverService.java:94)
        at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:157)
        at org.apache.nutch.protocol.selenium.HttpWebClient.createChromeWebDriver(HttpWebClient.java:182)
        at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:89)
        ... 5 more
2020-08-17 23:40:57,430 INFO  fetcher.FetcherThread - FetcherThread 46 fetch of https://www.amazon.in/ failed with: java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver

为什么它找错了地方?

事实上..它正确地引用了 nutch-site.xml 中的其他设置。一旦我包含了 protocol-interactiveselenium,它就开始使用 selenium 进行获取。

此外,早些时候它正在寻找 /root/geckodriver,这是 firefox 驱动程序。一旦我将 selenium.driver 更改为 chrome,它就会开始寻找 /root/chromedriver.

到目前为止一切顺利。现在,我去更改了 webdriver.chrome.driver 属性,但似乎没有考虑。

查看 code of HttpWebClient - the property webdriver.chrome.driver is overwritten by the value of selenium.grid.binary. Pointing the latter to your chromedrive should work. Please open an issue at https://issues.apache.org/jira/projects/NUTCH,不清楚这是错误还是文档问题。但无论如何都应该解决。