将抓取限制在种子的子页面 url
Limit the crawl to subpages of the seed url
我有这套,根据种子抓取页面
{
"class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
"name": "HostURLFilter",
"params": {
"ignoreOutsideHost": false,
"ignoreOutsideDomain": true
}
}
但是,我怎样才能限制为种子的子页面。
例如。如果我有一个种子作为“https://www.test.com/", with the above settings, the crawler also crawls and adds urls like "https://stg.test.com/”及其子页面等
如何限制抓取,如“https://www.test.com/" and just subpages of this seed, like "https://www.test.com/test1", "https://www.test.com/test2”等
TIA。
只需在 HostUrlFilter 的配置中将 ignoreOutsideHost 设置为 true。
我有这套,根据种子抓取页面
{ "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter", "name": "HostURLFilter", "params": { "ignoreOutsideHost": false, "ignoreOutsideDomain": true } }
但是,我怎样才能限制为种子的子页面。 例如。如果我有一个种子作为“https://www.test.com/", with the above settings, the crawler also crawls and adds urls like "https://stg.test.com/”及其子页面等
如何限制抓取,如“https://www.test.com/" and just subpages of this seed, like "https://www.test.com/test1", "https://www.test.com/test2”等
TIA。
只需在 HostUrlFilter 的配置中将 ignoreOutsideHost 设置为 true。