NUTCH:如何使 take.screenshot 和 screenshot.location 属性 工作?

NUTCH: How to make take.screenshot and screenshot.location property work?

我已经学习 Nutch(Nutch-1.14 版)一周了,并且在本地模式和 Hadoop-2.7.2(伪分布式模式)下工作正常。今天我在 nutch-site.xml 中遇到 "take.screenshot"、"screenshot.location" 属性,在修改这些属性后,nutch 正在抓取种子 url,但没有在本地模式和 Hadoop 中截取屏幕截图。

nutch-site.xml 本地模式设置

<property>
 <name>take.screenshot</name>
 <value>true</value>
 <description>
  Boolean property determining whether the protocol-htmlunit
  WebDriver should capture a screenshot of the URL. If set to
  true remember to define the 'screenshot.location'
  property as this determines the location screenshots should be
  persisted to on HDFS. If that property is not set, screenshots
  are simply discarded.
 </description>
</property>

<property>
 <name>screenshot.location</name>
 <value>/home/user/nutch-1.14/screenshot</value>
 <description>
  The location on disk where a URL screenshot should be saved
  to if the 'take.screenshot' property is set to true.
  By default this is null, in this case screenshots held in memory
  are simply discarded.
 </description>
</property>

nutch-site.xml Hadoop 设置

<property>
 <name>take.screenshot</name>
 <value>true</value>
</property>

<property>
 <name>screenshot.location</name>
 <value>/screenshot</value>
</property>

注意 "screenshot" 目录存在于 HDFS

HtmlUnit 是一个 "GUI-Less browser for Java programs"(参见 http://htmlunit.sourceforge.net/)。这意味着,HtmlUnit 根本不呈现 html 页面。内部所有操作都是基于 dom 树完成的,没有任何布局。这就是为什么没有截图选项的原因。

您是否启用了 protocol-selenium?基本上,这仅适用于此协议,默认情况下 Nutch 使用不支持此选项的 protocol-http 插件,即使您在配置中启用了这些设置也是如此。