Stormcrawler 的 ContentParseFilter

Question

如果我将 StormCrawler 的 ContentParseFilter 设置为

"pattern": "//DIV[@id=\"site-body\"]",

这是否意味着这是它在处理每个 url 时唯一会寻找其他页面链接的地方？我想知道我是否设置它是否会开始忽略菜单中的所有 url 等等。

谢谢！吉姆

Answer 1

The ContentFilter allows to restrict the text of a document to the text covered by a Xpath expression

它根本不影响链接的提取，而是旨在改善索引的文本。

Stormcrawler's ContentParseFilter