使用 elasticsearch 的 Apache Nutch 索引
Apache Nutch Indexing using elasticsearch
我目前正在使用 Apache Nutch 和 ElasticSearch 堆栈制作搜索引擎。我正在使用 Apache Nutch 2.1 和 ElasticSearch 1.7.3。
我目前正在尝试按照此处的说明直接从 Nutch 建立索引:https://www.mind-it.info/2013/09/26/integrating-nutch-1-7-elasticsearch/。 Nutch 和 Elasticsearch 都在我的本地主机上运行,集群名称为 "elasticsearch".
这些是我更改的 nutch-site.xml 的一些部分:
<property>
<name>plugin.includes</name>
<value>protocol-selenium|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
在 运行 命令 ant runtime 之后,我尝试发出命令
bin/nutch elasticindex elasticsearch -all
但它返回了这个:
Exception in thread "main" java.lang.RuntimeException: job failed: name=elastic-index [elasticsearch], jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:52)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.indexElastic(ElasticIndexerJob.java:60)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:73)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.main(ElasticIndexerJob.java:78)
我不确定哪里出错了。这是我的 hadoop.log:
2016-01-15 15:46:24,106 INFO elastic.ElasticIndexerJob - Starting
2016-01-15 15:46:24,733 INFO plugin.PluginRepository - Plugins: looking in: /home/gabrielgagno/apache-nutch-2.1/runtime/local/plugins
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Registered Plugins:
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Registered Extension-Points:
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Parse Filter (org.apache.nutch.parse.ParseFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2016-01-15 15:46:24,822 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:24,822 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:24,824 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:24,824 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:25,827 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-15 15:46:26,521 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2016-01-15 15:46:26,727 INFO elasticsearch.node - [Layla Miller] version[1.7.3], pid[18188], build[05d4530/2015-10-15T09:14:17Z]
2016-01-15 15:46:26,727 INFO elasticsearch.node - [Layla Miller] initializing ...
2016-01-15 15:46:26,852 INFO elasticsearch.plugins - [Layla Miller] loaded [], sites []
2016-01-15 15:46:28,229 WARN elasticsearch.bootstrap - JNA not found. native methods will be disabled.
2016-01-15 15:46:28,756 INFO elasticsearch.node - [Layla Miller] initialized
2016-01-15 15:46:28,756 INFO elasticsearch.node - [Layla Miller] starting ...
2016-01-15 15:46:28,824 INFO elasticsearch.transport - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/172.16.3.72:9301]}
2016-01-15 15:46:28,836 INFO elasticsearch.discovery - [Layla Miller] elasticsearch/_tzxV-I7SSeduY9b8enpPw
2016-01-15 15:46:58,836 WARN elasticsearch.discovery - [Layla Miller] waited for 30s and no initial state was set by the discovery
2016-01-15 15:46:58,845 INFO elasticsearch.http - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/172.16.3.72:9201]}
2016-01-15 15:46:58,845 INFO elasticsearch.node - [Layla Miller] started
2016-01-15 15:46:58,848 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:58,848 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:58,848 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:58,848 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:59,438 INFO elastic.ElasticWriter - Processing remaining requests [docs = 147, length = 1011442, total docs = 147]
2016-01-15 15:46:59,445 INFO elastic.ElasticWriter - Processing to finalize last execute
2016-01-15 15:47:59,452 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2016-01-15 15:47:59,453 WARN mapred.LocalJobRunner - job_local_0001
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:151)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:141)
at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:215)
at org.elasticsearch.action.bulk.TransportBulkAction.access[=13=]0(TransportBulkAction.java:67)
at org.elasticsearch.action.bulk.TransportBulkAction.onFailure(TransportBulkAction.java:153)
at org.elasticsearch.action.support.TransportAction$ThreadedActionListener.run(TransportAction.java:137)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
谁能帮我解决这个问题?谢谢!
确保您 运行 在 nutch elastic 依赖项和本地服务器中的版本相同。
如果不一样,那就别浪费时间了,用http协议直接从nutch推送到elastic,而不是Java api.
我目前正在使用 Apache Nutch 和 ElasticSearch 堆栈制作搜索引擎。我正在使用 Apache Nutch 2.1 和 ElasticSearch 1.7.3。
我目前正在尝试按照此处的说明直接从 Nutch 建立索引:https://www.mind-it.info/2013/09/26/integrating-nutch-1-7-elasticsearch/。 Nutch 和 Elasticsearch 都在我的本地主机上运行,集群名称为 "elasticsearch".
这些是我更改的 nutch-site.xml 的一些部分:
<property>
<name>plugin.includes</name>
<value>protocol-selenium|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
在 运行 命令 ant runtime 之后,我尝试发出命令
bin/nutch elasticindex elasticsearch -all
但它返回了这个:
Exception in thread "main" java.lang.RuntimeException: job failed: name=elastic-index [elasticsearch], jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:52)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.indexElastic(ElasticIndexerJob.java:60)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:73)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.main(ElasticIndexerJob.java:78)
我不确定哪里出错了。这是我的 hadoop.log:
2016-01-15 15:46:24,106 INFO elastic.ElasticIndexerJob - Starting
2016-01-15 15:46:24,733 INFO plugin.PluginRepository - Plugins: looking in: /home/gabrielgagno/apache-nutch-2.1/runtime/local/plugins
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Registered Plugins:
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Registered Extension-Points:
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Parse Filter (org.apache.nutch.parse.ParseFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2016-01-15 15:46:24,822 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:24,822 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:24,824 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:24,824 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:25,827 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-15 15:46:26,521 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2016-01-15 15:46:26,727 INFO elasticsearch.node - [Layla Miller] version[1.7.3], pid[18188], build[05d4530/2015-10-15T09:14:17Z]
2016-01-15 15:46:26,727 INFO elasticsearch.node - [Layla Miller] initializing ...
2016-01-15 15:46:26,852 INFO elasticsearch.plugins - [Layla Miller] loaded [], sites []
2016-01-15 15:46:28,229 WARN elasticsearch.bootstrap - JNA not found. native methods will be disabled.
2016-01-15 15:46:28,756 INFO elasticsearch.node - [Layla Miller] initialized
2016-01-15 15:46:28,756 INFO elasticsearch.node - [Layla Miller] starting ...
2016-01-15 15:46:28,824 INFO elasticsearch.transport - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/172.16.3.72:9301]}
2016-01-15 15:46:28,836 INFO elasticsearch.discovery - [Layla Miller] elasticsearch/_tzxV-I7SSeduY9b8enpPw
2016-01-15 15:46:58,836 WARN elasticsearch.discovery - [Layla Miller] waited for 30s and no initial state was set by the discovery
2016-01-15 15:46:58,845 INFO elasticsearch.http - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/172.16.3.72:9201]}
2016-01-15 15:46:58,845 INFO elasticsearch.node - [Layla Miller] started
2016-01-15 15:46:58,848 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:58,848 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:58,848 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:58,848 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:59,438 INFO elastic.ElasticWriter - Processing remaining requests [docs = 147, length = 1011442, total docs = 147]
2016-01-15 15:46:59,445 INFO elastic.ElasticWriter - Processing to finalize last execute
2016-01-15 15:47:59,452 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2016-01-15 15:47:59,453 WARN mapred.LocalJobRunner - job_local_0001
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:151)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:141)
at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:215)
at org.elasticsearch.action.bulk.TransportBulkAction.access[=13=]0(TransportBulkAction.java:67)
at org.elasticsearch.action.bulk.TransportBulkAction.onFailure(TransportBulkAction.java:153)
at org.elasticsearch.action.support.TransportAction$ThreadedActionListener.run(TransportAction.java:137)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
谁能帮我解决这个问题?谢谢!
确保您 运行 在 nutch elastic 依赖项和本地服务器中的版本相同。
如果不一样,那就别浪费时间了,用http协议直接从nutch推送到elastic,而不是Java api.