为什么 Nutch (v2.3) 只抓取种子 URL，而不抓取整个网站？

Question

我正在尝试使用 Nutch 2.3 和 HBase 0.94.14 来抓取整个特定网站（忽略外部链接）。

我已按照有关如何设置和使用这些工具的分步教程（可以找到它 here）进行操作。但是，我一直无法实现我的目标。 Nutch 没有抓取我在 seed.txt 文件中写入 URL 的整个网站，而是在第一轮只检索该基数 URL。我需要运行进一步抓取，以便 Nutch 检索更多 URL。

问题是我不知道要抓取整个网站需要多少轮，所以我需要一种方法告诉 Nutch "keep crawling until the entire website has been crawled"（换句话说，"crawl the entire website in a single round" ).

以下是我到目前为止遵循的关键步骤和设置：

将基数 URL 放入 seed.txt 文件中。

http://www.whads.com/

设置Nutch的nutch-site.xml配置文件。完成本教程后，我根据其他 Whosebug 问题的建议添加了更多属性（但是，none 似乎已经解决了我的问题）。

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
        <property>
            <name>http.agent.name</name>
            <value>test-crawler</value>
        </property>
        <property>
            <name>storage.data.store.class</name>
            <value>org.apache.gora.hbase.store.HBaseStore</value>
        </property>
        <property>
            <name>plugin.includes</name>
            <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
        </property>
        <property>
            <name>db.ignore.external.links</name>
            <value>true</value>
        </property>
        <property>
            <name>db.ignore.internal.links</name>
            <value>false</value>
        </property>
        <property>
            <name>fetcher.max.crawl.delay</name>
            <value>-1</value>
        </property>
        <property>
            <name>fetcher.threads.per.queue</name>
            <value>50</value>
            <description></description>
        </property>
        <property> 
            <name>generate.count.mode</name> 
            <value>host</value>
        </property>
        <property> 
            <name>generate.max.count</name> 
            <value>-1</value>
        </property>
</configuration>

在 Nutch 的 regex-urlfilter.txt 配置文件中添加了 "accept anything else" 规则，遵循 Whosebug 和 Nutch 邮件列表中的建议。

# Already tried these two filters (one at a time, 
# and each one combined with the 'anything else' one)
#+^http://www.whads.com
#+^http://([a-z0-9]*.)*whads.com/

# accept anything else
+.

Crawling：我尝试过使用两种不同的方法（都产生相同的结果，只有一个 URL 在第一个生成和获取圆）：
- 使用bin/nutch（按照教程）：
```
bin/nutch inject urls
bin/nutch generate -topN 50000
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
```
- 使用bin/crawl:
```
bin/crawl urls whads 1
```

我是不是还漏掉了什么？难道我做错了什么？还是Nutch不能一次爬取整个网站？

在此先感谢您！

Answer 1

请像下面这样更新您的配置

    <property>
        <name>db.ignore.external.links</name>
        <value>false</value>
    </property>

实际上，您忽略了外部链接，即不抓取外部 URL

Answer 2

在尝试了我在 Internet 上找到的所有内容后，又用了几天 Nutch，最后我放弃了。有人说用Nutch已经不可能一次抓取一个旧网站了。所以，如果遇到同样问题的人偶然发现了这个问题，请按照我的方法做：放弃 Nutch 并使用类似 Scrapy (Python) 的东西。您需要手动设置爬虫，但它的工作原理非常棒，可扩展性更强，速度更快，而且效果更好。

Answer 3

你试过在最后使用-1吗？我可以看到你在最后使用 1 只运行一次爬网。

为什么 Nutch (v2.3) 只抓取种子 URL，而不抓取整个网站？

Why does Nutch (v2.3) crawl only the seed URL, instead of crawling an entire website?

apache

web-crawler

nutch