使用 elasticsearch 的 Apache Nutch 索引

Apache Nutch Indexing using elasticsearch

我目前正在使用 Apache Nutch 和 ElasticSearch 堆栈制作搜索引擎。我正在使用 Apache Nutch 2.1 和 ElasticSearch 1.7.3。

我目前正在尝试按照此处的说明直接从 Nutch 建立索引:https://www.mind-it.info/2013/09/26/integrating-nutch-1-7-elasticsearch/。 Nutch 和 Elasticsearch 都在我的本地主机上运行,​​集群名称为 "elasticsearch".

这些是我更改的 nutch-site.xml 的一些部分:

<property>
    <name>plugin.includes</name>
    <value>protocol-selenium|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin. By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please enable
    protocol-httpclient, but be aware of possible intermittent problems with the
    underlying commons-httpclient library.
    </description>
</property>

在 运行 命令 ant runtime 之后,我尝试发出命令

bin/nutch elasticindex elasticsearch -all

但它返回了这个:

Exception in thread "main" java.lang.RuntimeException: job failed: name=elastic-index [elasticsearch], jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:52)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.indexElastic(ElasticIndexerJob.java:60)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:73)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.main(ElasticIndexerJob.java:78)

我不确定哪里出错了。这是我的 hadoop.log:

    2016-01-15 15:46:24,106 INFO  elastic.ElasticIndexerJob - Starting
2016-01-15 15:46:24,733 INFO  plugin.PluginRepository - Plugins: looking in: /home/gabrielgagno/apache-nutch-2.1/runtime/local/plugins
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Registered Plugins:
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     the nutch core extension points (nutch-extensionpoints)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic URL Normalizer (urlnormalizer-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic Indexing Filter (index-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Html Parse Plug-in (parse-html)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Http / Https Protocol Plug-in (protocol-httpclient)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     HTTP Framework (lib-http)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter (urlfilter-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Pass-through URL Normalizer (urlnormalizer-pass)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Normalizer (urlnormalizer-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Tika Parser Plug-in (parse-tika)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     OPIC Scoring Plug-in (scoring-opic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     CyberNeko HTML Parser (lib-nekohtml)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Anchor Indexing Filter (index-anchor)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter Framework (lib-regex-filter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository - Registered Extension-Points:
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Protocol (org.apache.nutch.protocol.Protocol)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Parse Filter (org.apache.nutch.parse.ParseFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Filter (org.apache.nutch.net.URLFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Content Parser (org.apache.nutch.parse.Parser)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2016-01-15 15:46:24,822 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:24,822 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:24,824 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:24,824 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:25,827 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-15 15:46:26,521 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] version[1.7.3], pid[18188], build[05d4530/2015-10-15T09:14:17Z]
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] initializing ...
2016-01-15 15:46:26,852 INFO  elasticsearch.plugins - [Layla Miller] loaded [], sites []
2016-01-15 15:46:28,229 WARN  elasticsearch.bootstrap - JNA not found. native methods will be disabled.
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] initialized
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] starting ...
2016-01-15 15:46:28,824 INFO  elasticsearch.transport - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/172.16.3.72:9301]}
2016-01-15 15:46:28,836 INFO  elasticsearch.discovery - [Layla Miller] elasticsearch/_tzxV-I7SSeduY9b8enpPw
2016-01-15 15:46:58,836 WARN  elasticsearch.discovery - [Layla Miller] waited for 30s and no initial state was set by the discovery
2016-01-15 15:46:58,845 INFO  elasticsearch.http - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/172.16.3.72:9201]}
2016-01-15 15:46:58,845 INFO  elasticsearch.node - [Layla Miller] started
2016-01-15 15:46:58,848 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:58,848 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:59,438 INFO  elastic.ElasticWriter - Processing remaining requests [docs = 147, length = 1011442, total docs = 147]
2016-01-15 15:46:59,445 INFO  elastic.ElasticWriter - Processing to finalize last execute
2016-01-15 15:47:59,452 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2016-01-15 15:47:59,453 WARN  mapred.LocalJobRunner - job_local_0001
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:151)
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:141)
    at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:215)
    at org.elasticsearch.action.bulk.TransportBulkAction.access[=13=]0(TransportBulkAction.java:67)
    at org.elasticsearch.action.bulk.TransportBulkAction.onFailure(TransportBulkAction.java:153)
    at org.elasticsearch.action.support.TransportAction$ThreadedActionListener.run(TransportAction.java:137)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

谁能帮我解决这个问题?谢谢!

确保您 运行 在 nutch elastic 依赖项和本地服务器中的版本相同。

如果不一样,那就别浪费时间了,用http协议直接从nutch推送到elastic,而不是Java api.