敏捷包 Select innerText 但跳过特定标签

AgilityPack Select innerText but skip specific tag

关于这样的例子:

<p>there is something here <span>we can't have this</span> again here <em>but we keep this one</em> we are good to go now </p>

我有办法删除 span 节点,这样我就可以只获取所有其他标签的内部文本。但是我需要保留 span 标签,但在我得到它时跳过他的 innerText 。现在我有这个:

var paragraphe = html.DocumentNode.SelectNodes("p");
for (int i = 0; i < paragraphe.Count; i++)
{
    string innerTextOfP = paragraphe[i].InnerText;
    if (string.IsNullOrEmpty(innerTextOfP))
    {
        //Do something later.
    }
    else
    {
        //something is done here with the text I get.
    }
}

我能想到的最好的方法是有另一个像:

var nodeSpan = html.DocumentNode.SelectNodes("span");

并在我使用字符串缓冲区迭代 P 部分的 children 以获取文本并跳过内容时进行比较 paragraphe.childNode = nodeSpan 但我认为 Agility Pack 还有另一种方法这种东西,但我不知道是什么。

在我的例子中,我还需要跳过 DIV(和他的 children)的内容,如果类不是 "contenu"

所以我打算为 Span 做的方式对 DIV 部分不利。

我应该如何使用 agilityPack?

编辑:这种情况的预期结果是:

string innerTextOfP = "there is something here again here but we keep this one we are good to go now"

您可以从段落中删除 span 个子项:

var paragraphes = html.DocumentNode.SelectNodes("//p");

foreach (var p in paragraphes)
{
    var clone = p.Clone(); // to avoid modification of original html
    foreach (var span in clone.SelectNodes("span"))
        clone.RemoveChild(span);

    foreach (var div in clone.SelectNodes("div[not(@class='contenu')]"))
        clone.RemoveChild(div);

    // remove other nodes which you want to skip here

    string innerTextOfP = Regex.Replace(clone.InnerText, @"\s+", " ");
}

请注意,我使用正则表达式将几个连续的白色 space 替换为一个白色 space。输出为:

there is something here again here but we keep this one we are good to go now