java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException 用 nutch 解析时

java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException when parsing with nutch

我是 Nutch 的初学者。尝试了一些从 NutchWiki. Then I try to make a custom plugin for parsing with the help of this 爬网的教程。 所有配置都已完成,在使用 ant 构建后,我的插件文件夹位于 build/pluginsruntime/local/plugin 以及 apache-nutch-1.13-SNAPSHOT.job 文件中。当我解析获取的内容时,出现以下错误。

Error parsing: http://example.com/: java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parsefilter.TagExtractorParseFilter
    at org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:469)
    at org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:35)
    at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:340)
    at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
    at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
    at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:107)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:45)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parsefilter.TagExtractorParseFilter
    at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:167)
    at org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
    ... 16 more
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parsefilter.TagExtractorParseFilter
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.nutch.plugin.PluginRepository.getCachedClass(PluginRepository.java:331)
    at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
    ... 17 more

我想不通到底是什么问题,我已经完成了教程中指定的所有操作。 任何帮助将不胜感激。

编辑:我通过在 nutch 脚本文件中硬编码插件的类路径暂时解决了这个问题,例如:

CLASSPATH="${CLASSPATH}:$NUTCH_HOME/plugins/TagExtractorParseFilter/TagExtractorParseFilter.jar"
# distributed mode
EXEC_CALL=(hadoop jar "$NUTCH_JOB")

if $local; then
 EXEC_CALL=("$JAVA" $JAVA_HEAP_MAX "${NUTCH_OPTS[@]}" -classpath "$CLASSPATH")
else
.....................

我终于解决了这个问题。这是由于我的 plugin.xml.

中的一个小错误

早些时候 plugin.xml 中的运行时 属性 就像

<runtime>
      <library name="TagExtractorParseFilter">
         <export name="*"/>
      </library>
   </runtime>

我改成了

<runtime>
      <library name="TagExtractorParseFilter.jar">
         <export name="*"/>
      </library>
   </runtime>

然后,成功了。