在 https 上使用 solr 的 Nutch

Nutch with solr on https

早上好, 我来找你是因为 Nutch (1.14)Solr (7.2)

有问题

所以在我安装 SSL 之前一切正常。

在 http 中使用 Solr,抓取完成后我执行此命令

bin/nutch index -Dsolr.server.url=http://127.0.0.1:8983/solr/CORENAME crawltest/crawldb/ -linkdb crawltest/linkdb/ crawltest/segments/* -filter -normalize -deleteGone

而且效果很好

但是一旦开启了SSL,solr服务器在HTTPS下,就无法将数据发送给solr了。 我在 nutch 站点中添加了以下属性

<name>solr.auth</name>
           <value>true</value>

<property>
           <name>solr.auth.username</name>
           <value>xxxx</value>

<property>
           <name>solr.auth.password</name>
           <value>xxxx</value>

property>
           <name>solr.server.type</name>
           <value>https</value>
property>
           <name>solr.server.url</name>
           <value>https://127.0.0.1:8983/solr/CORENAME</value>

但是当我执行前面的命令时,我得到了这种类型的错误

Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: https://127.0.0.1:8983/solr/CORENAME

&

caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

&

Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

您是否成功将数据发送到 HTTPS solr? 谢谢

编辑 按照 SSL 程序 https://lucene.apache.org/solr/guide/7_0/enabling-ssl.html

修复此错误

最后执行这个 keytool -import -file /path/to/solr/solr-ssl.pem -alias solr_cert -keystore /path/to/java-cacert (jre/lib/security/cacerts) 默认密码是 changeit

进步了一点,在cacerts中导入证书后,我就没有再出现这个错误了。

仍然在相同的上下文中,在 solr 服务器上激活 SSL 和身份验证之后。我使用 Nutch 来抓取 url 并将数据发送到 solr。 由于 SSL 的实施,我无法再向 SOLR 发送数据。

当我执行this bin/nutch index -Dsolr.server.url=https://localhost:8983/solr/CORE -Dsolr.auth=true -Dsolr.auth.username='solr' -Dsolr.auth.password='xxxx' crawltest/crawldb/ -linkdb crawltest/linkdb/ crawltest/segments/* -filter -normalize -deleteGone

我有以下两个错误:

java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://localhost:8983/solr/CORE: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 401 Unauthorized</title>
</head>
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /solr/CORE/update. Reason:
<pre>    Unauthorized</pre></p>
</body>
</html>
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://localhost:8983/solr/CORE: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 401 Unauthorized</title>
</head>
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /solr/CORE/update. Reason:
<pre>    Unauthorized</pre></p>
</body>
</html>

编辑: 第一个错误是由于身份验证错误。 填写正确的值后,我有一个新的错误,我不明白。你有什么想法吗?

2018-06-20 09:47:18,116 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2018-06-20 09:47:19,151 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-06-20 09:47:19,194 INFO  solr.SolrMappingReader - source: content dest: content
2018-06-20 09:47:19,194 INFO  solr.SolrMappingReader - source: title dest: title
2018-06-20 09:47:19,194 INFO  solr.SolrMappingReader - source: host dest: host
2018-06-20 09:47:19,194 INFO  solr.SolrMappingReader - source: segment dest: segment
2018-06-20 09:47:19,194 INFO  solr.SolrMappingReader - source: boost dest: boost
2018-06-20 09:47:19,195 INFO  solr.SolrMappingReader - source: digest dest: digest
2018-06-20 09:47:19,195 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2018-06-20 09:47:19,525 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2018-06-20 09:47:19,525 INFO  solr.SolrIndexWriter - Deleting 0 documents
2018-06-20 09:47:19,808 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2018-06-20 09:47:19,809 INFO  solr.SolrIndexWriter - Deleting 0 documents
2018-06-20 09:47:19,951 WARN  mapred.LocalJobRunner - job_local146539832_0001
java.lang.Exception: java.io.IOException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.io.IOException
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:234)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:213)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:174)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:87)
    at org.apache.nutch.indexer.IndexerOutputFormat.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:369)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: https://localhost:8983/solr/ESRF-EXTERNAL
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:589)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:210)
    ... 16 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
    at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
    at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886)
    at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857)
    at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
    at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
    at org.apache.http.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:115)
    at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:146)
    at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:96)
    at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:112)
    at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
    at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
    at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:481)
    ... 20 more
2018-06-20 09:47:20,873 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

堆栈跟踪

name    cpuTime / userTime

process reaper (37)
java.util.concurrent.SynchronousQueue$TransferStack@24197386
    1.8587ms
0.0000ms

process reaper (36)
java.util.concurrent.SynchronousQueue$TransferStack@24197386
    1.2672ms
0.0000ms

Scheduler-201556483 (31) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6202c0c1
    1.1534ms
0.0000ms

searcherExecutor-7-thread-1 (30)   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@663d5859
    63.2030ms
50.0000ms

DestroyJavaVM (27)
    1164.4748ms
1040.0000ms

Thread-12 (25)
java.lang.Object@233fcafa
    0.1211ms
0.0000ms

Connection evictor (23)
    0.9319ms
0.0000ms

Connection evictor (22)
    2.0995ms
0.0000ms

org.eclipse.jetty.server.session.HashSessionManager@1a052a00Timer (21)    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6626f9cf
    4.2127ms
0.0000ms

qtp2012232625-20 (20)   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@12d62902
    56.7955ms
50.0000ms

qtp2012232625-19 (19)
    47.6864ms
40.0000ms

qtp2012232625-18 (18) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@12d62902
    79.3320ms
70.0000ms

qtp2012232625-17 (17)   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@12d62902
    100.9593ms
90.0000ms

qtp2012232625-16-acceptor-0@2d033cc4-ServerConnector@23c4c714{SSL,[ssl, http/1.1]}{0.0.0.0:8983} (16)
    4.5898ms
0.0000ms

qtp2012232625-15 (15)    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@12d62902
    73.3096ms
60.0000ms

qtp2012232625-14 (14)    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@12d62902
    18.7950ms
10.0000ms

qtp2012232625-13 (13)
    79.7804ms
70.0000ms

qtp2012232625-12 (12)
    70.2385ms
60.0000ms

qtp2012232625-11 (11)    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@12d62902
    22.1012ms
10.0000ms

ShutdownMonitor (10)
    0.3055ms
0.0000ms

Signal Dispatcher (5)
    0.0873ms
0.0000ms

Finalizer (3)    
java.lang.ref.ReferenceQueue$Lock@1e254491
    8.2575ms
0.0000ms

Reference Handler (2)    
java.lang.ref.Reference$Lock@431035b5
    6.3846ms
0.0000ms

EDIT2 为了测试我禁用了身份验证以查看问题是否不是来自 https。无需身份验证即可使用! 我试图更改文件并将其包含在 jetty-https.xml 而不是 jetty.xml.

我有 2 个帐户这样配置

<security-constraint>
    <web-resource-collection>
      <web-resource-name>Solr authenticated application</web-resource-name>
      <url-pattern>/</url-pattern>
    </web-resource-collection>
    <auth-constraint>
      <role-name>admin</role-name>
    </auth-constraint>
  </security-constraint>
  <login-config>
    <auth-method>BASIC</auth-method>
    <realm-name>Test Realm</realm-name>
  </login-config>

security.json

{
"authentication":{
   "blockUnknown": true,
   "class":"solr.BasicAuthPlugin",
   "credentials":{"solr":"xxxx"}
},
"authorization":{
   "class":"solr.RuleBasedAuthorizationPlugin",
   "permissions":[{"name":"security-edit",
      "role":"admin"}],
   "user-role":{"solr":"admin"}
}}

当我执行以下命令时

bin/nutch index -Dsolr.server.url=https://localhost:8983/solr/MYCORE -Dsolr.auth=true  -Dsolr.auth.username='admin'  -Dsolr.auth.password='xxxx' crawltest/crawldb/ -linkdb crawltest/linkdb/ crawltest/segments/* -filter -normalize -deleteGone

我收到这个错误

java.lang.Exception: java.io.IOException
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.io.IOException
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:234)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:213)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:174)
        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:87)
        at org.apache.nutch.indexer.IndexerOutputFormat.write(IndexerOutputFormat.java:50)
        at org.apache.nutch.indexer.IndexerOutputFormat.write(IndexerOutputFormat.java:41)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
        at org.apache.hadoop.mapred.ReduceTask.collect(ReduceTask.java:422)
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:369)
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: https://localhost:8983/solr/MYCORE
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:589)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:210)
        ... 16 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
        at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
        at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886)
        at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857)
        at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
        at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
        at org.apache.http.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:115)
        at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:146)
        at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:96)
        at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:112)
        at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
        at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
        at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:481)
       ... 20 more
2018-06-25 09:38:41,870 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

当我执行此操作时

bin/nutch index -Dsolr.server.url=https://localhost:8983/solr/MYCORE -Dsolr.auth=true  -Dsolr.auth.username='solr'  -Dsolr.auth.password='xxxxx' crawltest/crawldb/ -linkdb crawltest/linkdb/ crawltest/segments/* -filter -normalize -deleteGone

我收到这个错误

java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://localhost:8983/solr/MYCORE: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 401 Unauthorized</title>
</head>
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /solr/MYCORE/update. Reason:
<pre>    Unauthorized</pre></p>
</body>
</html>

        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://localhost:8983/solr/MYCORE: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 401 Unauthorized</title>
</head>
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /solr/MYCORE/update. Reason:
<pre>    Unauthorized</pre></p>
</body>
</html>
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:210)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:174)
        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:87)
        at org.apache.nutch.indexer.IndexerOutputFormat.write(IndexerOutputFormat.java:50)
        at org.apache.nutch.indexer.IndexerOutputFormat.write(IndexerOutputFormat.java:41)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
        at org.apache.hadoop.mapred.ReduceTask.collect(ReduceTask.java:422)
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:369)
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-06-25 09:45:20,106 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

或者现在这个:

java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://127.0.0.1:8983/solr/MYCORE: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 503 </title>
</head>
<body>
<h2>HTTP ERROR: 503</h2>
<p>Problem accessing /solr/MYCORE/update. Reason:
<pre>    Service Unavailable</pre></p>
<hr />
</body>
</html>

        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://127.0.0.1:8983/solr/MYCORE: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 503 </title>
</head>
<body>
<h2>HTTP ERROR: 503</h2>
<p>Problem accessing /solr/MYCORE/update. Reason:
<pre>    Service Unavailable</pre></p>
<hr />
</body>
</html>

Solr 日志:

2018-06-25 14:18:44.352 INFO  (main) [   ] o.e.j.s.Server jetty-9.3.20.v20170531
2018-06-25 14:18:44.597 WARN  (main) [   ] o.e.j.w.WebAppContext Failed startup of context o.e.j.w.WebAppContext@5891e32e{/solr,file:///app/solr-7.2.1/server/solr-webapp/webapp/,UNAVAILABLE}{/app/solr-7.2.1/server/solr-webapp/webapp}
java.lang.IllegalStateException: No LoginService for org.eclipse.jetty.security.authentication.BasicAuthenticator@64c87930 in org.eclipse.jetty.security.ConstraintSecurityHandler@400cff1a
        at org.eclipse.jetty.security.authentication.LoginAuthenticator.setConfiguration(LoginAuthenticator.java:76)
        at org.eclipse.jetty.security.SecurityHandler.doStart(SecurityHandler.java:354)
        at org.eclipse.jetty.security.ConstraintSecurityHandler.doStart(ConstraintSecurityHandler.java:448)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
        at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
        at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105)
        at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61)
        at org.eclipse.jetty.server.handler.ScopedHandler.doStart(ScopedHandler.java:120)
        at org.eclipse.jetty.server.session.SessionHandler.doStart(SessionHandler.java:116)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
        at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
        at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105)
        at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61)
        at org.eclipse.jetty.server.handler.ScopedHandler.doStart(ScopedHandler.java:120)
        at org.eclipse.jetty.server.handler.ContextHandler.startContext(ContextHandler.java:809)
        at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:345)
        at org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1406)
        at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1368)
        at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:778)
        at org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:262)
        at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:522)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
        at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:41)
        at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:188)
        at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:499)
        at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:147)
        at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:180)
        at org.eclipse.jetty.deploy.providers.WebAppProvider.fileAdded(WebAppProvider.java:458)
        at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:64)
        at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:610)
        at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:529)
        at org.eclipse.jetty.util.Scanner.scan(Scanner.java:392)
        at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:313)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
        at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:150)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
        at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:561)
        at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:236)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
 at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
        at org.eclipse.jetty.server.Server.start(Server.java:422)
        at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:113)
        at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61)
        at org.eclipse.jetty.server.Server.doStart(Server.java:389)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
        at org.eclipse.jetty.xml.XmlConfiguration.run(XmlConfiguration.java:1520)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1442)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.eclipse.jetty.start.Main.invokeMain(Main.java:215)
        at org.eclipse.jetty.start.Main.start(Main.java:458)
        at org.eclipse.jetty.start.Main.main(Main.java:76)
2018-06-25 14:18:44.745 INFO  (main) [   ] o.e.j.s.Server Started @799ms

我通过在 /var/solr/data/ (SOLR_HOME)

上移动 security.json 解决了 "No LoginService"

编辑 3: 现在,当我想将 nutch 数据发送到 solr 时,我只会收到一条错误消息 "No allow"。我也无法再连接到管理界面,我得到了同样的错误。我认为它来自 security.json 文件

{
"authentication":{
   "class":"solr.BasicAuthPlugin",
   "credentials":{"solr":"xxxxxx"}
},
"authorization":{
   "class":"solr.RuleBasedAuthorizationPlugin"
   "permissions":[{"name":"security-edit","role":"adminRole"},{"name":"collection-admin-edit","role":"adminRole"},{"name":"update","role":"adminRole"},{"name":"all","role":"adminRole"},{"name":"core-admin-edit","role":"adminRole"},{"name":"read","role":"adminRole"},{"name":"config-edit","role":"adminRole"},{"name":"core-admin-read","role":"adminRole"},{"name":"core-admin-read","role":"adminRole"}]
   "user-role":{"solr":"adminRole"}
}}

我做错了什么?谢谢

我添加了一个新的答案,因为之前的答案太长了

已解决 API 身份验证但不使用 NUTCH: 因此,为了通过 API 进行身份验证,我删除了 jetty-https.xmlwebdefault.xml 中的配置,并删除了 realm.properties 文件以及 [=18] 中的基本身份验证选项=] 我只在 SOLR HOME

中处理 security.json 文件

其实最大的问题是我没有使用加密密码来测试连接,没有它就无法连接。另一方面,我仍然有 nutch 的问题,这是不允许的。

这是 security.json 文件

`{
    "authentication":{
        "class":"solr.BasicAuthPlugin",
        "credentials":{
            "solr":"hzMjhfgN4b9X8KR0QgLB2Um3cUzqDzJygtEBL/O7g5E= CkP7HyXjYvqKNF3F4hBjnVvKGQOkLc/ta4FaNIkqgII="
        }
    },
    "authorization":{
        "class":"solr.RuleBasedAuthorizationPlugin",
        "permissions":[
            {
                "name":"security-edit",
                "role":"adminRole"
            },
            {
                "name":"collection-admin-edit",
                "role":"adminRole"
            },
            {
                "name":"update",
                "role":"adminRole"
            },
            {
                "name":"config-edit",
                "role":"adminRole"
            },
            {
                "name":"core-admin-edit",
                "role":"adminRole"
            },
            {
                "name":"core-admin-read",
                "role":"adminRole"
            {
                "name":"schema-edit",
                "role":"adminRole"
            },
            {
                "name":"all",
                "role":"adminRole"
            }
        ],
        "user-role":{
            "solr":"adminRole"
        }
    }
}
`

加密后的密码表示值"test"

为了测试文件代码,我建议这样 http://json.parser.online.fr/

为了能够使用 nutch 更新 solr,我可能错过了什么?

已解决 添加数据导入的更新角色路径

           {
                "name":"update",
                "path":"/dataimport",
                "role":"adminRole"
            },

但是现在我可以将 nutch 索引到 solr 但是我在爬行时遇到了一个新错误...

`Thu Jun 28 09:21:03 CEST 2018 : Iteration 2 of 5
Generating a new segment
/app/nutch-external/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl//crawldb crawl//segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2018-06-28 09:21:04
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
`

在第一次迭代期间我遇到了这个错误

`Authorization challenge processed
No form element found with 'id' = adminRole, trying 'name'.
No form element found with 'id' = adminRole, trying 'name'.
No form element found with 'name' = adminRole
No form element found with 'name' = adminRole
Supported authentication schemes in the order of preference: [ntlm, digest, basic]
Supported authentication schemes in the order of preference: [ntlm, digest, basic]
Challenge for ntlm authentication scheme not available
Challenge for ntlm authentication scheme not available
Challenge for digest authentication scheme not available
basic authentication scheme selected
Using authentication scheme: basic
Authorization challenge processed
No form element found with 'id' = adminRole, trying 'name'.
No form element found with 'name' = adminRole
Failed to get protocol output
java.lang.RuntimeException: java.lang.IllegalArgumentException: No form exists: adminRole
    at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:506)
    at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:183)
    at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:276)
    at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:342)
Caused by: java.lang.IllegalArgumentException: No form exists: adminRole
    at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:219)
    at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95)
    at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:504)
    ... 3 more
Challenge for digest authentication scheme not available
basic authentication scheme selected
Using authentication scheme: basic
Authorization challenge processed
No form element found with 'id' = adminRole, trying 'name'.
`

它仍然是 security.json 文件。你有什么想法吗?谢谢

已解决 我通过在 /nutch/conf/

中配置 httpclient-auth.xml 解决了这个问题
<auth-configuration>
   <credentials username="solr" password="xxxxx">
      <authscope host="localhost" port="8983"/>
   </credentials>
</auth-configuration>

感谢您的帮助