nutch 1.10 作业失败,错误请求错误索引到 solr 5.3.1

nutch 1.10 job failed, bad request error indexing to solr 5.3.1

我在一个测试环境中组装了一个爬虫,运行对 2 个小网站来说还不错,包括成功索引到 solr。所以,nutch 和 solr 之间的集成似乎很好。

我所做的唯一更改是将另一个网站添加到 seed.txt 并在正则表达式中添加另一行 - urlfilters.txt,使用与其他网站完全相同的语法。

现在,当我 运行 爬虫时,它 运行 可以正常运行一段时间,然后崩溃并出现 'Job failed!' 错误和几乎没有帮助的信息。

这是控制台的输出。值得注意的是,这是在爬网中创建的第 3 个段,因此在错误发生之前它已经成功索引了 2 个段。新站点中是否存在导致损坏的内容?

Indexing 20151030150906 to index
/opt/apache-nutch-1.10/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/TestCrawlCore TestCrawl//crawldb -linkdb TestCrawl//linkdb TestCrawl//segments/20151030150906
Indexer: starting at 2015-10-30 15:14:00
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)

Error running:
  /opt/apache-nutch-1.10/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/TestCrawlCore TestCrawl//crawldb -linkdb TestCrawl//linkdb TestCrawl//segments/20151030150906
Failed with exit value 255.

这是hadoop.log

的相关数据
2015-10-30 15:14:00,854 INFO  indexer.IndexingJob - Indexer: starting at 2015-10-30 15:14:00
2015-10-30 15:14:00,909 INFO  indexer.IndexingJob - Indexer: deleting gone documents: false
2015-10-30 15:14:00,909 INFO  indexer.IndexingJob - Indexer: URL filtering: false
2015-10-30 15:14:00,910 INFO  indexer.IndexingJob - Indexer: URL normalizing: false
2015-10-30 15:14:01,113 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-10-30 15:14:01,113 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication


2015-10-30 15:14:01,118 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: TestCrawl/crawldb
2015-10-30 15:14:01,118 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: TestCrawl/linkdb
2015-10-30 15:14:01,119 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: TestCrawl/segments/20151030150906
2015-10-30 15:14:01,264 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-10-30 15:14:01,722 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2015-10-30 15:14:02,253 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: content dest: content
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: title dest: title
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: host dest: host
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: segment dest: segment
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: boost dest: boost
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: digest dest: digest
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2015-10-30 15:14:02,370 INFO  solr.SolrIndexWriter - Indexing 38 documents
2015-10-30 15:14:02,487 INFO  solr.SolrIndexWriter - Indexing 38 documents
2015-10-30 15:14:02,524 WARN  mapred.LocalJobRunner - job_local593696138_0001
org.apache.solr.common.SolrException: Bad Request

Bad Request

request: http://localhost:8983/solr/TestCrawlCore/update?wt=javabin&version=2
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
        at org.apache.nutch.indexer.IndexerOutputFormat.close(IndexerOutputFormat.java:44)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2015-10-30 15:14:03,508 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)

我只是在弄清楚这些东西,所以我不知道解决这个问题的下一步。任何帮助,将不胜感激。如果有具体的帮助,我很乐意提供更多信息。

事实证明这是 nutch 和 solr 模式之间的不匹配。

感谢 TMBT(见上面的评论),我在 Solr 日志中发现了一个额外的错误,声称 "unidentified field: "anchor”。

我所要做的就是将锚点字段声明从 nutch 模式复制到 Solr 模式并重新启动 solr 服务。现在 运行 很好。