如何在 stormcrawler 的 parsefilter.json 中添加更多的 XPATH

Question

我正在使用 stormcrawler (v 1.16) 和 Elasticsearch(v 7.5.0) 从大约 5k 个新闻网站中提取数据。我在 parsefilter.json 中添加了一些用于提取作者姓名的 XPATH 模式。 Parsefilter.json如下图：

{

  "com.digitalpebble.stormcrawler.parse.ParseFilters": [
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
      "name": "XPathFilter",
      "params": {
        "canonical": "//*[@rel=\"canonical\"]/@href",
        "parse.description": [
            "//*[@name=\"description\"]/@content",
            "//*[@name=\"Description\"]/@content"
         ],
        "parse.title": [
            "//TITLE",
            "//META[@name=\"title\"]/@content"
         ],
         "parse.keywords": "//META[@name=\"keywords\"]/@content",
        "parse.datePublished": "//META[@itemprop=\"datePublished\"]/@content",
        "parse.author":[
        "//META[@itemprop=\"author\"]/@content",
        "//input[@id=\"authorname\"]/@value",
        "//META[@name=\"article:author\"]/@content",
        "//META[@name=\"author\"]/@content",
        "//META[@name=\"byline\"]/@content",
        "//META[@name=\"dc.creator\"]/@content",
        "//META[@name=\"byl\"]/@content",
        "//META[@itemprop=\"authorname\"]/@content",
        "//META[@itemprop=\"article:author\"]/@content",
        "//META[@itemprop=\"byline\"]/@content",
        "//META[@itemprop=\"dc.creator\"]/@content",
        "//META[@rel=\"authorname\"]/@content",
        "//META[@rel=\"article:author\"]/@content",
        "//META[@rel=\"byline\"]/@content",
        "//META[@rel=\"dc.creator\"]/@content",
        "//META[@rel=\"author\"]/@content",
        "//META[@id=\"authorname\"]/@content",
        "//META[@id=\"byline\"]/@content",
        "//META[@id=\"dc.creator\"]/@content",
        "//META[@id=\"author\"]/@content",
        "//META[@class=\"authorname\"]/@content",
        "//META[@class=\"article:author\"]/@content",
        "//META[@class=\"byline\"]/@content",
        "//META[@class=\"dc.creator\"]/@content",
        "//META[@class=\"author\"]/@content"
]



}
    },

爬虫里我也做了修改-conf.yaml如下图

    indexer.md.mapping:
    - parse.author=author
    metadata.persist:
    - author

我面临的问题是：我只得到“parse.author”的第一个模式（即“//META[@itemprop="author"]/@content”）的结果。我应该做哪些更改才能将所有模式作为输入。

Answer 1

What changes I should do so that all patterns can be taken as input.

我将此解读为 “我如何制作一个 XPath 表达式来尝试作者在文档中出现的所有不同方式？”

最简单的方法：使用 XPath 联合运算符将您已有的所有表达式合并为一个表达式 |:

input[...]|meta[...]|meta[...]|meta[...]

并且因为这个可能选择了多个节点，我们可以明确声明我们只关心第一个匹配项：

(input[...]|meta[...]|meta[...]|meta[...])[1]

这可能有效，但会很长且难以阅读。 XPath 可以做得更好。

你的表达式都非常重复，这是减少表达式大小的一个很好的起点。例如，这两个是相同的，除了属性值：

//meta[@class='author']/@content|//meta[@class='authorname']/@content

我们可以使用 or 并且它已经变短了：

//meta[@class='author' or @class='authorname']/@content

但是当你有 5 或 6 个潜在值时，它仍然很长。下次尝试，属性的谓词：

//meta[@class[.='author' or .='authorname']]/@content

稍微短一点，因为我们不需要一直输入 @class。但仍然很长，有 5 或 6 个潜在值。值列表和子字符串搜索怎么样（我使用 / 作为分隔符）：

//meta[contains(
    '/author/authorname/',
    concat('/', @class, '/')
)]/@content

现在我们可以轻松扩展有效值列表，甚至可以查看不同的属性：

//meta[contains(
    '/author/authorname/article:author/',
    concat('/', @class|@id , '/')
)]/@content

并且由于我们要在多个可能的属性中寻找几乎相同的可能字符串，我们可以使用一个固定的值列表来检查所有可能的属性：

//meta[
    contains(
        '/author/article:author/authorname/dc.creator/byline/byl/',
        concat('/', @name|@itemprop|@rel|@id|@class, '/')
    )
]/@content

结合前两点，我们可以得到：

(
    //meta[
        contains(
            '/author/article:author/authorname/dc.creator/byline/byl/',
            concat('/', @name|@itemprop|@rel|@id|@class, '/')
        )
    ]/@content
    |
    //input[
        @id='authorname'
    ]/@value
)[1]

警告：只有当 <meta> 永远不会有 both 时，这才会按预期工作，例如@name 和 @rel，或者如果它们至少具有相同的值。否则 concat('/', @name|@itemprop|@rel|@id|@class, '/') 可能会选错。这是一个经过计算的风险，我认为这种情况在 HTML 中发生并不常见。但是你需要决定，你是知道你输入数据的人。

如何在 stormcrawler 的 parsefilter.json 中添加更多的 XPATH

How to add more XPATH in parsefilter.json in stormcrawler

xpath

parsing

json

web-crawler

stormcrawler