将 Nutch 爬网数据索引到 Elasticsearch 时出错
Error indexing Nutch crawl data into Elasticsearch
我正在使用 Nutch 1.14 并尝试将一个小型网络抓取索引到 ES v5.3.0 中,但我不断收到此错误:
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)
elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)
elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Error running:
/home/david/tutorials/nutch/apache-nutch-1.14-src/runtime/local/bin/nutch index -Delastic.server.url=http://localhost:9300/search-index/ searchcrawl//crawldb -linkdb searchcrawl//linkdb searchcrawl//segments/20180824175802
Failed with exit value 255.
我已经完成了 但我仍然收到错误...
更新 - 好的,我取得了进展。索引现在似乎可以工作了——不再有错误。但是,当我去查看通过 Kibana 使用 _stats 检查文档计数时,当 Nutch 告诉我这个时我得到 0:
Segment dir is complete: crawl/segments/20180830115119.
Indexer: starting at 2018-08-30 12:19:31
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticRestIndexWriter
elastic.rest.host : hostname
elastic.rest.port : port
elastic.rest.index : elastic index command
elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
Indexer: number of documents indexed, deleted, or skipped:
Indexer: 9 indexed (add/update)
Indexer: finished at 2018-08-30 12:19:45, elapsed: 00:00:14
我假设这意味着 ES 发送了 9 个文档用于索引?
我使用了带有 nutch 1.14 的 Elasticsearch 6.0,它的效果非常好,我使用的是 indexer-elastic-rest 插件和端口 9200,我附上了我的 nutch-site.xml供参考
我正在使用 Nutch 1.14 并尝试将一个小型网络抓取索引到 ES v5.3.0 中,但我不断收到此错误:
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)
elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)
elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Error running:
/home/david/tutorials/nutch/apache-nutch-1.14-src/runtime/local/bin/nutch index -Delastic.server.url=http://localhost:9300/search-index/ searchcrawl//crawldb -linkdb searchcrawl//linkdb searchcrawl//segments/20180824175802
Failed with exit value 255.
我已经完成了
更新 - 好的,我取得了进展。索引现在似乎可以工作了——不再有错误。但是,当我去查看通过 Kibana 使用 _stats 检查文档计数时,当 Nutch 告诉我这个时我得到 0:
Segment dir is complete: crawl/segments/20180830115119.
Indexer: starting at 2018-08-30 12:19:31
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticRestIndexWriter
elastic.rest.host : hostname
elastic.rest.port : port
elastic.rest.index : elastic index command
elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
Indexer: number of documents indexed, deleted, or skipped:
Indexer: 9 indexed (add/update)
Indexer: finished at 2018-08-30 12:19:45, elapsed: 00:00:14
我假设这意味着 ES 发送了 9 个文档用于索引?
我使用了带有 nutch 1.14 的 Elasticsearch 6.0,它的效果非常好,我使用的是 indexer-elastic-rest 插件和端口 9200,我附上了我的 nutch-site.xml供参考