Groovy XmlParser / XmlSlurper：node.localText() 位置？

Question

我对这个问题有 follow-up 个问题：。

它解释说，为了获得 (HTML) 节点的本地内部文本而不递归地获得潜在内部 child 节点的嵌套文本，必须使用 #localText() 而不是 #text().

例如，原始问题的一个稍微增强的例子：

<html>
    <body>
        <div>
            Text I would like to get1.
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </div>
        <span>
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </span>
    </body>
</html>

应用了解决方案：

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0].localText()[0]

会 return:

[Text I would like to get1., Text I would like to get2., Text I would like to get3.]

然而，在解析本例中的<span>部分时

println htmlParsed.body.span[0].localText()

输出是

[Text I would like to get2., Text I would like to get3.]

我现在面临的问题是，显然无法确定文本的位置 ("between which child nodes")。我本以为第二次调用会产生

[, Text I would like to get2., Text I would like to get3.]

这会说清楚：位置 0（在 child 0 之前）是空的，位置 1（在 child 0 和 1 之间）是 "Text I would like to get2."，位置 2 ( child 1 和 2) 之间是 "Text I would like to get3." 但是考虑到 API 的工作原理，显然没有办法确定索引 0 处的文本 return 是否实际上是位于索引 0 或任何其他索引处，对于所有其他索引也是如此。

我用 XmlSlurper 和 XmlParser 都试过了，得到了相同的结果。

如果我没记错的话，因此也不可能使用来自解析器的信息完全重新创建原始 HTML 文档，因为此 "text index" 信息已丢失。

我的问题是：有没有办法找出那些文本位置？要求我更改解析器的答案也可以接受。

更新/解决方案：

为了进一步参考，这里是 Will P 的答案，应用于原始代码：

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlParser(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0].children().collect {it in String ? it : null}

这产生：

[Text I would like to get1., null, Text I would like to get2., null, Text I would like to get3.]

必须使用 XmlParser 而不是 XmlSlurper 和 node.children()。

Answer 1

我不知道 jsoup，我希望它不会干扰解决方案，但是使用纯 XmlParser 你可以获得一个包含原始字符串的 children() 数组：

html = '''<html>
    <body>
        <div>
            Text I would like to get1.
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </div>
        <span>
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </span>
    </body>
</html>'''

def root = new XmlParser().parseText html

root.body.div[0].children().with {
    assert get(0).trim() == 'Text I would like to get1.'
    assert get(0).getClass() == String

    assert get(1).name() == 'a'
    assert get(1).getClass() == Node

    assert get(2) == '''
            Text I would like to get2.
            '''
}

Groovy XmlParser / XmlSlurper：node.localText() 位置？

Groovy XmlParser / XmlSlurper: node.localText() position?

groovy

xmlslurper

html-parsing