如何在 StormCrawler 中将 URL 作为文本文件播种?
How to seed URLs as a text file in StormCrawler?
我有许多 URL(大约 40,000 个)需要使用 StormCrawler 进行抓取。
有什么方法可以将这些 URL 作为文本文件而不是 crawler.flux 中的列表传递?像这样:
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
parallelism: 1
constructorArgs:
- "URLs.txt"
对于 Solr 和 Elasticsearch,有一些注入器可以从文件中读取 URL,并将它们作为已发现的项目添加到状态索引中。当然,需要使用Solr或者Elasticsearch来保存状态索引。注入器作为拓扑启动,eg.
storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...
有一个FileSpout exactly for that purpose. It is used by the topologies mentioned by @sebastian-nagel and you can use them in yours just as well, see for example this topology。
根据 @Julien Nioche 的回答,我写了一个 crawler.flux 来完成我想要的。这是文件:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "solr-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout"
parallelism: 1
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds"
- true
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 5
- id: "index"
className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
而不是 "." 您可以设置 URL 文件所在的目录,而不是 "seeds" 你可以输入你的 URL 文件名。
我有许多 URL(大约 40,000 个)需要使用 StormCrawler 进行抓取。 有什么方法可以将这些 URL 作为文本文件而不是 crawler.flux 中的列表传递?像这样:
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
parallelism: 1
constructorArgs:
- "URLs.txt"
对于 Solr 和 Elasticsearch,有一些注入器可以从文件中读取 URL,并将它们作为已发现的项目添加到状态索引中。当然,需要使用Solr或者Elasticsearch来保存状态索引。注入器作为拓扑启动,eg.
storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...
有一个FileSpout exactly for that purpose. It is used by the topologies mentioned by @sebastian-nagel and you can use them in yours just as well, see for example this topology。
根据 @Julien Nioche 的回答,我写了一个 crawler.flux 来完成我想要的。这是文件:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "solr-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout"
parallelism: 1
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds"
- true
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 5
- id: "index"
className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
而不是 "." 您可以设置 URL 文件所在的目录,而不是 "seeds" 你可以输入你的 URL 文件名。