Nutch Selenium Interactive 插件忽略 chromedriver 配置
Nutch Selenium Interactive plugin ignores the chromedriver configuration
我配置了 nutch-site.xml 用于包含 selenium 交互式插件的本地抓取。
我只配置了基础,所以配置很简单(属性来自conf/nutch-site.xml)。
<property>
<name>plugin.includes</name>
<value>protocol-interactiveselenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends, to fetch ftp:// and file:// URLs, for focused crawling,
and many other use cases.
</description>
</property>
<property>
<name>selenium.driver</name>
<value>chrome</value>
<description>
A String value representing the flavour of Selenium
WebDriver() to use. Currently the following options
exist - 'firefox', 'chrome', 'safari', 'opera' and 'remote'.
If 'remote' is used it is essential to also set correct properties for
'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host',
'selenium.hub.protocol', 'selenium.grid.driver', 'selenium.grid.binary'
and 'selenium.enable.headless'.
</description>
</property>
<property>
<name>webdriver.chrome.driver</name>
<value>/Users/theo/DISKS/Work/PNR/chromedriver</value>
<description>The path to the ChromeDriver binary</description>
</property>
这是来自 nutch 日志:
2020-08-17 23:40:57,427 ERROR interactiveselenium.Http - Failed to get protocol output
java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:153)
at org.apache.nutch.protocol.interactiveselenium.HttpResponse.readPlainContent(HttpResponse.java:401)
at org.apache.nutch.protocol.interactiveselenium.HttpResponse.<init>(HttpResponse.java:280)
at org.apache.nutch.protocol.interactiveselenium.Http.getResponse(Http.java:57)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:383)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:352)
Caused by: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
at com.google.common.base.Preconditions.checkState(Preconditions.java:585)
at org.openqa.selenium.remote.service.DriverService.checkExecutable(DriverService.java:146)
at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:141)
at org.openqa.selenium.chrome.ChromeDriverService.access[=11=]0(ChromeDriverService.java:35)
at org.openqa.selenium.chrome.ChromeDriverService$Builder.findDefaultExecutable(ChromeDriverService.java:159)
at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:355)
at org.openqa.selenium.chrome.ChromeDriverService.createDefaultService(ChromeDriverService.java:94)
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:157)
at org.apache.nutch.protocol.selenium.HttpWebClient.createChromeWebDriver(HttpWebClient.java:182)
at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:89)
... 5 more
2020-08-17 23:40:57,430 INFO fetcher.FetcherThread - FetcherThread 46 fetch of https://www.amazon.in/ failed with: java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
为什么它找错了地方?
事实上..它正确地引用了 nutch-site.xml 中的其他设置。一旦我包含了 protocol-interactiveselenium,它就开始使用 selenium 进行获取。
此外,早些时候它正在寻找 /root/geckodriver,这是 firefox 驱动程序。一旦我将 selenium.driver 更改为 chrome,它就会开始寻找 /root/chromedriver.
到目前为止一切顺利。现在,我去更改了 webdriver.chrome.driver 属性,但似乎没有考虑。
查看 code of HttpWebClient - the property webdriver.chrome.driver
is overwritten by the value of selenium.grid.binary
. Pointing the latter to your chromedrive should work. Please open an issue at https://issues.apache.org/jira/projects/NUTCH,不清楚这是错误还是文档问题。但无论如何都应该解决。
我配置了 nutch-site.xml 用于包含 selenium 交互式插件的本地抓取。
我只配置了基础,所以配置很简单(属性来自conf/nutch-site.xml)。
<property>
<name>plugin.includes</name>
<value>protocol-interactiveselenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends, to fetch ftp:// and file:// URLs, for focused crawling,
and many other use cases.
</description>
</property>
<property>
<name>selenium.driver</name>
<value>chrome</value>
<description>
A String value representing the flavour of Selenium
WebDriver() to use. Currently the following options
exist - 'firefox', 'chrome', 'safari', 'opera' and 'remote'.
If 'remote' is used it is essential to also set correct properties for
'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host',
'selenium.hub.protocol', 'selenium.grid.driver', 'selenium.grid.binary'
and 'selenium.enable.headless'.
</description>
</property>
<property>
<name>webdriver.chrome.driver</name>
<value>/Users/theo/DISKS/Work/PNR/chromedriver</value>
<description>The path to the ChromeDriver binary</description>
</property>
这是来自 nutch 日志:
2020-08-17 23:40:57,427 ERROR interactiveselenium.Http - Failed to get protocol output
java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:153)
at org.apache.nutch.protocol.interactiveselenium.HttpResponse.readPlainContent(HttpResponse.java:401)
at org.apache.nutch.protocol.interactiveselenium.HttpResponse.<init>(HttpResponse.java:280)
at org.apache.nutch.protocol.interactiveselenium.Http.getResponse(Http.java:57)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:383)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:352)
Caused by: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
at com.google.common.base.Preconditions.checkState(Preconditions.java:585)
at org.openqa.selenium.remote.service.DriverService.checkExecutable(DriverService.java:146)
at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:141)
at org.openqa.selenium.chrome.ChromeDriverService.access[=11=]0(ChromeDriverService.java:35)
at org.openqa.selenium.chrome.ChromeDriverService$Builder.findDefaultExecutable(ChromeDriverService.java:159)
at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:355)
at org.openqa.selenium.chrome.ChromeDriverService.createDefaultService(ChromeDriverService.java:94)
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:157)
at org.apache.nutch.protocol.selenium.HttpWebClient.createChromeWebDriver(HttpWebClient.java:182)
at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:89)
... 5 more
2020-08-17 23:40:57,430 INFO fetcher.FetcherThread - FetcherThread 46 fetch of https://www.amazon.in/ failed with: java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
为什么它找错了地方?
事实上..它正确地引用了 nutch-site.xml 中的其他设置。一旦我包含了 protocol-interactiveselenium,它就开始使用 selenium 进行获取。
此外,早些时候它正在寻找 /root/geckodriver,这是 firefox 驱动程序。一旦我将 selenium.driver 更改为 chrome,它就会开始寻找 /root/chromedriver.
到目前为止一切顺利。现在,我去更改了 webdriver.chrome.driver 属性,但似乎没有考虑。
查看 code of HttpWebClient - the property webdriver.chrome.driver
is overwritten by the value of selenium.grid.binary
. Pointing the latter to your chromedrive should work. Please open an issue at https://issues.apache.org/jira/projects/NUTCH,不清楚这是错误还是文档问题。但无论如何都应该解决。