仅限 Nutch 2.3.1 爬行种子 URL
Nutch 2.3.1 crawling seed URL only
我必须抓取几个 URL 的所有内链(最多)。为此,我将 Apache Nutch 2.3.1 与 hadoop 和 hbase 一起使用。以下是用于此目的的 nutch-site.xml 文件。
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>crawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more|urdu)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.robots.403.allow</name>
<value>true</value>
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
</property>
<property>
<name>http.robots.agents</name>
<value>crawler,*</value>
</property>
<!-- language-identifier plugin properties -->
<property>
<name>lang.ngram.min.length</name>
<value>1</value>
</property>
<property>
<name>lang.ngram.max.length</name>
<value>4</value>
</property>
<property>
<name>lang.analyze.max.length</name>
<value>2048</value>
</property>
<property>
<name>lang.extraction.policy</name>
<value>detect,identify</value>
</property>
<property>
<name>lang.identification.only.certain</name>
<value>true</value>
</property>
<!-- Language properties ends here -->
<property>
<name>http.timeout</name>
<value>20000</value>
</property>
<!-- These tags are included as our crawled documents has started to decrease -->
<property>
<name>fetcher.max.crawl.delay</name>
<value>10</value>
</property>
<property>
<name>generate.max.count</name>
<value>10000</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
</configuration>
当我抓取几个 URL 时,只有种子 url 被抓取,然后抓取以这条消息结束
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 20
GeneratorJob: finished at 2017-04-21 16:28:35, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1492774111-8887 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
陈述了一个类似的问题 但它是针对 1.1 版的,我已经实施了不适用于我的情况的解决方案。
你能检查一下你的 conf/regex-urlfilter.txt
是否 url 过滤正则表达式阻止了预期的外链。
# accept anything else
+.
当您将 db.ignore.external.links
设置为 true
时,Nutch 将不会从不同的主机生成外链。您还需要在 conf/nutch-default.xml
中检查 db.ignore.internal.links
属性 是否为 false
。否则,将不会生成外链。
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
HTH.
我必须抓取几个 URL 的所有内链(最多)。为此,我将 Apache Nutch 2.3.1 与 hadoop 和 hbase 一起使用。以下是用于此目的的 nutch-site.xml 文件。
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>crawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more|urdu)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.robots.403.allow</name>
<value>true</value>
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
</property>
<property>
<name>http.robots.agents</name>
<value>crawler,*</value>
</property>
<!-- language-identifier plugin properties -->
<property>
<name>lang.ngram.min.length</name>
<value>1</value>
</property>
<property>
<name>lang.ngram.max.length</name>
<value>4</value>
</property>
<property>
<name>lang.analyze.max.length</name>
<value>2048</value>
</property>
<property>
<name>lang.extraction.policy</name>
<value>detect,identify</value>
</property>
<property>
<name>lang.identification.only.certain</name>
<value>true</value>
</property>
<!-- Language properties ends here -->
<property>
<name>http.timeout</name>
<value>20000</value>
</property>
<!-- These tags are included as our crawled documents has started to decrease -->
<property>
<name>fetcher.max.crawl.delay</name>
<value>10</value>
</property>
<property>
<name>generate.max.count</name>
<value>10000</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
</configuration>
当我抓取几个 URL 时,只有种子 url 被抓取,然后抓取以这条消息结束
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 20
GeneratorJob: finished at 2017-04-21 16:28:35, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1492774111-8887 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
陈述了一个类似的问题
你能检查一下你的 conf/regex-urlfilter.txt
是否 url 过滤正则表达式阻止了预期的外链。
# accept anything else
+.
当您将 db.ignore.external.links
设置为 true
时,Nutch 将不会从不同的主机生成外链。您还需要在 conf/nutch-default.xml
中检查 db.ignore.internal.links
属性 是否为 false
。否则,将不会生成外链。
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
HTH.