nutch 索引失败并出现 io 异常
nutch indexing fails with io exception
当我 运行 以下命令时,Nutch 索引失败:
root@ubuntu:/home/test-tb/Downloads/apache-nutch-1.10# bin/nutch index mycrl/crawldb/ -dir mycrl/segments/
我在 ubuntu 12.04 LTS 上使用 nutch 1.10。
错误日志详细信息为:
2015-07-09 17:07:36,940 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: content dest: content
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: title dest: title
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: host dest: host
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: segment dest: segment
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: boost dest: boost
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: digest dest: digest
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2015-07-09 17:07:37,030 INFO solr.SolrIndexWriter - Indexing 100 documents
2015-07-09 17:07:37,136 INFO solr.SolrIndexWriter - Indexing 100 documents
2015-07-09 17:07:37,166 WARN mapred.LocalJobRunner - job_local1383488781_0001
org.apache.solr.common.SolrException: Not Found
Not Found
request: http://127.0.0.1:8983/solr/update?wt=javabin&version=2
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
at org.apache.nutch.indexer.IndexerOutputFormat.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2015-07-09 17:07:37,957 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
虽然我没有为 nutch 指定 solr 索引选项,但返回了这个错误。我在这里错过了什么吗?你的指点会很有帮助。提前致谢。
首先,如果您要对数据进行爬网和索引,那么您应该使用 bin/crawl
,因为它是一个更好的工具。
其次,从堆栈跟踪可以看出,您没有正确设置 solr url。通常,您的 solr url 应该类似于 http://domainname:port/solr/corename
但是,我看到你有 localhost:8983/solr/update
。所以,您的 url 缺少 solr 的核心名称。默认是collection1.
当我 运行 以下命令时,Nutch 索引失败:
root@ubuntu:/home/test-tb/Downloads/apache-nutch-1.10# bin/nutch index mycrl/crawldb/ -dir mycrl/segments/
我在 ubuntu 12.04 LTS 上使用 nutch 1.10。
错误日志详细信息为:
2015-07-09 17:07:36,940 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: content dest: content
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: title dest: title
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: host dest: host
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: segment dest: segment
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: boost dest: boost
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: digest dest: digest
2015-07-09 17:07:36,970 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2015-07-09 17:07:37,030 INFO solr.SolrIndexWriter - Indexing 100 documents
2015-07-09 17:07:37,136 INFO solr.SolrIndexWriter - Indexing 100 documents
2015-07-09 17:07:37,166 WARN mapred.LocalJobRunner - job_local1383488781_0001
org.apache.solr.common.SolrException: Not Found
Not Found
request: http://127.0.0.1:8983/solr/update?wt=javabin&version=2
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
at org.apache.nutch.indexer.IndexerOutputFormat.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2015-07-09 17:07:37,957 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
虽然我没有为 nutch 指定 solr 索引选项,但返回了这个错误。我在这里错过了什么吗?你的指点会很有帮助。提前致谢。
首先,如果您要对数据进行爬网和索引,那么您应该使用 bin/crawl
,因为它是一个更好的工具。
其次,从堆栈跟踪可以看出,您没有正确设置 solr url。通常,您的 solr url 应该类似于 http://domainname:port/solr/corename
但是,我看到你有 localhost:8983/solr/update
。所以,您的 url 缺少 solr 的核心名称。默认是collection1.