将抓取限制在种子的子页面 url

Limit the crawl to subpages of the seed url

web-crawler
stormcrawler

我有这套，根据种子抓取页面

{ "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter", "name": "HostURLFilter", "params": { "ignoreOutsideHost": false, "ignoreOutsideDomain": true } }

但是，我怎样才能限制为种子的子页面。例如。如果我有一个种子作为“https://www.test.com/", with the above settings, the crawler also crawls and adds urls like "https://stg.test.com/”及其子页面等

如何限制抓取，如“https://www.test.com/" and just subpages of this seed, like "https://www.test.com/test1", "https://www.test.com/test2”等

TIA。

只需在 HostUrlFilter 的配置中将 ignoreOutsideHost 设置为 true。

将抓取限制在种子的子页面 url

Limit the crawl to subpages of the seed url

web-crawler

stormcrawler