apache nutch 爬虫 - 仅保留单个检索 url

Question

INJECT 步骤一直只检索一个 URL - 试图抓取 CNN。我使用默认配置（下面是 nutch 站点）- 那可能是什么 - 根据我的价值，它不应该是 10 个文档吗？

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>crawler1</value>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
        <name>solr.server.url</name>
        <value>http://x.x.x.x:8983/solr/collection1</value>
  </property>
<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-reg
ex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|m
etatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
</property>
<property>
  <name>generate.max.count</name>
  <value>10</value>
</property>
</configuration>

Answer 1

Nutch crawl 包含 4 个基本步骤：生成、获取、解析和更新数据库。 nutch 1.x and nutch 2.x 的这些步骤相同。所有四个步骤的执行和完成构成一个 抓取周期 。

注入器是将 URL 添加到 crawdb 的第一步；如所述 here and here.

To populate initial rows for the webtable you can use the InjectorJob.

我想你已经提供了，即 cnn.com

generate.max.count 限制 URL 从单个域中获取的数量，如 here 所述。

现在重要的是你的 crawdb 有多少 URL 来自 cnn.com。

选项 1

你有 generate.max.count = 10 并且你有 seeded 或注入超过 10 URLs 到 crawdb 然后在执行爬网周期时，nutch 应该获取不超过 10 URLs

选项 2

如果您只注入了一个 URL 并且您只执行了一个爬网循环，那么在第一个循环中您将只处理一个文档，因为只有一个 URL 在您的 crawdb 中。您的 crawdb 将在每个爬网周期结束时更新。因此，在执行您的第二个爬网周期和第三个爬网周期等等时，nutch 最多只能从特定域解析 10 URLs。

apache nutch 爬虫 - 仅保留单个检索 url

apache nutch crawler - keeps retrieve only single url

apache

web-crawler

nutch