url 的 NUTCH 1.13 获取失败:org.apache.nutch.protocol.ProtocolNotFound:找不到 url=http 的协议
NUTCH 1.13 fetch of url failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http
fetch of httpurl failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=http at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:285)
Using queue mode : byHost
fetch of httpsurl failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:285)
我在 运行 nutch1.13 和 solr6.6.0
时得到了以上结果
我使用的命令是
bin/crawl -i -D
solr.server.url=http://myip/solr/nutch/ urls/ crawl 2
下面是我的插件部分-site.xml
<name>plugin.includes</name>
<value>
protocol-(http|httpclient)|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
</value>
下面是我的文件内容
[root@localhost apache-nutch-1.13]# ls plugins
creativecommons index-more nutch-extensionpoints protocol-file scoring-similarity urlnormalizer-ajax
feed index-replace parse-ext protocol-ftp subcollection urlnormalizer-basic
headings index-static parsefilter-naivebayes protocol-htmlunit tld urlnormalizer-host
index-anchor language-identifier parsefilter-regex protocol-http urlfilter-automaton urlnormalizer-pass
index-basic lib-htmlunit parse-html protocol-httpclient urlfilter-domain urlnormalizer-protocol
indexer-cloudsearch lib-http parse-js protocol-interactiveselenium urlfilter-domainblacklist urlnormalizer-querystring
indexer-dummy lib-nekohtml parse-metatags protocol-selenium urlfilter-ignoreexempt urlnormalizer-regex
indexer-elastic lib-regex-filter parse-replace publish-rabbitmq urlfilter-prefix urlnormalizer-slash
indexer-solr lib-selenium parse-swf publish-rabitmq urlfilter-regex
index-geoip lib-xml parse-tika scoring-depth urlfilter-suffix
index-links microformats-reltag parse-zip scoring-link urlfilter-validator
index-metadata mimetype-filter plugin scoring-opic urlmeta
我被这个问题困住了。如您所见,我已经包含了两个协议-(http|httpclient)。但仍然获取 url 失败。提前致谢。
较新的问题hadoop.log
2017-09-01 14:35:07,172 INFO solr.SolrIndexWriter - SolrIndexer:
deleting 1/1 documents 2017-09-01 14:35:07,321 WARN
output.FileOutputCommitter - Output Path is null in cleanupJob()
2017-09-01 14:35:07,323 WARN mapred.LocalJobRunner -
job_local1176811933_0001 java.lang.Exception:
java.lang.IllegalStateException: Connection pool shut down at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.IllegalStateException: Connection pool shut down
at org.apache.http.util.Asserts.check(Asserts.java:34) at
org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
at
org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:481)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at
org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at
org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at
org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:122)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) 2017-09-01 14:35:07,679
ERROR indexer.CleaningJob - CleaningJob: java.io.IOException: Job
failed! at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at
org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:174) at
org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:197) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at
org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208)
我以某种方式解决了这个问题。我认为 nutch-site.xml 中的 space 导致其他人来到这里时出现新的 plugin.includes 部分。
<name>plugin.includes</name>
<value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
fetch of httpurl failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85) at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:285)
Using queue mode : byHost fetch of httpsurl failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85) at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:285)
我在 运行 nutch1.13 和 solr6.6.0
时得到了以上结果我使用的命令是
bin/crawl -i -D solr.server.url=http://myip/solr/nutch/ urls/ crawl 2
下面是我的插件部分-site.xml
<name>plugin.includes</name>
<value>
protocol-(http|httpclient)|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
</value>
下面是我的文件内容
[root@localhost apache-nutch-1.13]# ls plugins
creativecommons index-more nutch-extensionpoints protocol-file scoring-similarity urlnormalizer-ajax
feed index-replace parse-ext protocol-ftp subcollection urlnormalizer-basic
headings index-static parsefilter-naivebayes protocol-htmlunit tld urlnormalizer-host
index-anchor language-identifier parsefilter-regex protocol-http urlfilter-automaton urlnormalizer-pass
index-basic lib-htmlunit parse-html protocol-httpclient urlfilter-domain urlnormalizer-protocol
indexer-cloudsearch lib-http parse-js protocol-interactiveselenium urlfilter-domainblacklist urlnormalizer-querystring
indexer-dummy lib-nekohtml parse-metatags protocol-selenium urlfilter-ignoreexempt urlnormalizer-regex
indexer-elastic lib-regex-filter parse-replace publish-rabbitmq urlfilter-prefix urlnormalizer-slash
indexer-solr lib-selenium parse-swf publish-rabitmq urlfilter-regex
index-geoip lib-xml parse-tika scoring-depth urlfilter-suffix
index-links microformats-reltag parse-zip scoring-link urlfilter-validator
index-metadata mimetype-filter plugin scoring-opic urlmeta
我被这个问题困住了。如您所见,我已经包含了两个协议-(http|httpclient)。但仍然获取 url 失败。提前致谢。
较新的问题hadoop.log
2017-09-01 14:35:07,172 INFO solr.SolrIndexWriter - SolrIndexer: deleting 1/1 documents 2017-09-01 14:35:07,321 WARN output.FileOutputCommitter - Output Path is null in cleanupJob() 2017-09-01 14:35:07,323 WARN mapred.LocalJobRunner - job_local1176811933_0001 java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: java.lang.IllegalStateException: Connection pool shut down at org.apache.http.util.Asserts.check(Asserts.java:34) at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169) at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202) at org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:481) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482) at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117) at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:122) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2017-09-01 14:35:07,679 ERROR indexer.CleaningJob - CleaningJob: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:174) at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:197) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208)
我以某种方式解决了这个问题。我认为 nutch-site.xml 中的 space 导致其他人来到这里时出现新的 plugin.includes 部分。
<name>plugin.includes</name>
<value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>