将抓取限制在种子的子页面 url

Limit the crawl to subpages of the seed url

我有这套,根据种子抓取页面


{ "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter", "name": "HostURLFilter", "params": { "ignoreOutsideHost": false, "ignoreOutsideDomain": true } }


但是,我怎样才能限制为种子的子页面。 例如。如果我有一个种子作为“https://www.test.com/", with the above settings, the crawler also crawls and adds urls like "https://stg.test.com/”及其子页面等

如何限制抓取,如“https://www.test.com/" and just subpages of this seed, like "https://www.test.com/test1", "https://www.test.com/test2”等

TIA。

只需在 HostUrlFilter 的配置中将 ignoreOutsideHost 设置为 true。