敏捷包 Select innerText 但跳过特定标签
AgilityPack Select innerText but skip specific tag
关于这样的例子:
<p>there is something here <span>we can't have this</span> again here <em>but we keep this one</em> we are good to go now </p>
我有办法删除 span 节点,这样我就可以只获取所有其他标签的内部文本。但是我需要保留 span 标签,但在我得到它时跳过他的 innerText 。现在我有这个:
var paragraphe = html.DocumentNode.SelectNodes("p");
for (int i = 0; i < paragraphe.Count; i++)
{
string innerTextOfP = paragraphe[i].InnerText;
if (string.IsNullOrEmpty(innerTextOfP))
{
//Do something later.
}
else
{
//something is done here with the text I get.
}
}
我能想到的最好的方法是有另一个像:
var nodeSpan = html.DocumentNode.SelectNodes("span");
并在我使用字符串缓冲区迭代 P 部分的 children 以获取文本并跳过内容时进行比较 paragraphe.childNode = nodeSpan
但我认为 Agility Pack 还有另一种方法这种东西,但我不知道是什么。
在我的例子中,我还需要跳过 DIV(和他的 children)的内容,如果类不是 "contenu"
所以我打算为 Span 做的方式对 DIV 部分不利。
我应该如何使用 agilityPack?
编辑:这种情况的预期结果是:
string innerTextOfP = "there is something here again here but we keep this one we are good to go now"
您可以从段落中删除 span
个子项:
var paragraphes = html.DocumentNode.SelectNodes("//p");
foreach (var p in paragraphes)
{
var clone = p.Clone(); // to avoid modification of original html
foreach (var span in clone.SelectNodes("span"))
clone.RemoveChild(span);
foreach (var div in clone.SelectNodes("div[not(@class='contenu')]"))
clone.RemoveChild(div);
// remove other nodes which you want to skip here
string innerTextOfP = Regex.Replace(clone.InnerText, @"\s+", " ");
}
请注意,我使用正则表达式将几个连续的白色 space 替换为一个白色 space。输出为:
there is something here again here but we keep this one we are good to
go now
关于这样的例子:
<p>there is something here <span>we can't have this</span> again here <em>but we keep this one</em> we are good to go now </p>
我有办法删除 span 节点,这样我就可以只获取所有其他标签的内部文本。但是我需要保留 span 标签,但在我得到它时跳过他的 innerText 。现在我有这个:
var paragraphe = html.DocumentNode.SelectNodes("p");
for (int i = 0; i < paragraphe.Count; i++)
{
string innerTextOfP = paragraphe[i].InnerText;
if (string.IsNullOrEmpty(innerTextOfP))
{
//Do something later.
}
else
{
//something is done here with the text I get.
}
}
我能想到的最好的方法是有另一个像:
var nodeSpan = html.DocumentNode.SelectNodes("span");
并在我使用字符串缓冲区迭代 P 部分的 children 以获取文本并跳过内容时进行比较 paragraphe.childNode = nodeSpan
但我认为 Agility Pack 还有另一种方法这种东西,但我不知道是什么。
在我的例子中,我还需要跳过 DIV(和他的 children)的内容,如果类不是 "contenu"
所以我打算为 Span 做的方式对 DIV 部分不利。
我应该如何使用 agilityPack?
编辑:这种情况的预期结果是:
string innerTextOfP = "there is something here again here but we keep this one we are good to go now"
您可以从段落中删除 span
个子项:
var paragraphes = html.DocumentNode.SelectNodes("//p");
foreach (var p in paragraphes)
{
var clone = p.Clone(); // to avoid modification of original html
foreach (var span in clone.SelectNodes("span"))
clone.RemoveChild(span);
foreach (var div in clone.SelectNodes("div[not(@class='contenu')]"))
clone.RemoveChild(div);
// remove other nodes which you want to skip here
string innerTextOfP = Regex.Replace(clone.InnerText, @"\s+", " ");
}
请注意,我使用正则表达式将几个连续的白色 space 替换为一个白色 space。输出为:
there is something here again here but we keep this one we are good to go now