HtmlAgilityPack 如何在某些标签之间提取 html
HtmlAgilityPack how to extract html between some tag
我需要从一个 html 中提取所有段落以及该标签之间的所有文本。
当解析为 HtmlDocument 的文本与原始文本发生变化时,此代码不起作用。在示例中
some <br />text
更改于
some <br>text
是:
string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
int lastPos = -1;
foreach (HtmlAgilityPack.HtmlNode n in nodes)
{
if (lastPos > -1)
{
string textNotInP = Doc.DocumentNode.OuterHtml.Substring(lastPos, n.StreamPosition - lastPos);
System.Diagnostics.Debug.WriteLine(textNotInP);
}
System.Diagnostics.Debug.WriteLine(n.OuterHtml);
lastPos = n.StreamPosition + n.OuterHtml.Length;
}
正确的结果是:
<p>firt paragraph</p>
some <br>text
<p>second paragraph</p>
<span>some text between span</span>
<p>third paragraph</p>
但是上面的代码return是这样的:
<p>firt paragraph</p>
some <br>text<p
<p>second paragraph</p>
pan>some text between span</span><p
<p>third paragraph</p>
原因是steamPosition return与原文相关的节点位置,而不是htmlDocument.
中的那个parserd
有没有办法return一个节点的位置与解析后的html相关?
您可以使用每个 <p>
元素的 OuterHtml
属性 来获得所需的 HTML :
string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
foreach (var item in nodes)
{
Console.WriteLine(item.OuterHtml);
}
输出:
<p>firt paragraph</p>
<p>another paragraph</p>
<p>hellow word</p>
或者,如果您想要获取第一个 <p>
和最后一个 <p>
元素之间的所有内容,您可以使用以下 XPath:
var query = "//node()[preceding-sibling::p or self::p][following-sibling::p or self::p]";
XPath 抓取所有节点(元素或文本节点):具有前同级 p
和后同级 p
,或者节点本身是 p
元素。
var nodes = doc.DocumentNode.SelectNodes(query);
foreach (var item in nodes)
{
Console.WriteLine(item.OuterHtml);
}
输出:
<p>firt paragraph</p>
some
<br />
text
<p>another paragraph</p>
<span>some text between span</span>
<p>hellow word</p>
我需要从一个 html 中提取所有段落以及该标签之间的所有文本。
当解析为 HtmlDocument 的文本与原始文本发生变化时,此代码不起作用。在示例中
some <br />text
更改于
some <br>text
是:
string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
int lastPos = -1;
foreach (HtmlAgilityPack.HtmlNode n in nodes)
{
if (lastPos > -1)
{
string textNotInP = Doc.DocumentNode.OuterHtml.Substring(lastPos, n.StreamPosition - lastPos);
System.Diagnostics.Debug.WriteLine(textNotInP);
}
System.Diagnostics.Debug.WriteLine(n.OuterHtml);
lastPos = n.StreamPosition + n.OuterHtml.Length;
}
正确的结果是:
<p>firt paragraph</p>
some <br>text
<p>second paragraph</p>
<span>some text between span</span>
<p>third paragraph</p>
但是上面的代码return是这样的:
<p>firt paragraph</p>
some <br>text<p
<p>second paragraph</p>
pan>some text between span</span><p
<p>third paragraph</p>
原因是steamPosition return与原文相关的节点位置,而不是htmlDocument.
中的那个parserd有没有办法return一个节点的位置与解析后的html相关?
您可以使用每个 <p>
元素的 OuterHtml
属性 来获得所需的 HTML :
string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
foreach (var item in nodes)
{
Console.WriteLine(item.OuterHtml);
}
输出:
<p>firt paragraph</p>
<p>another paragraph</p>
<p>hellow word</p>
或者,如果您想要获取第一个 <p>
和最后一个 <p>
元素之间的所有内容,您可以使用以下 XPath:
var query = "//node()[preceding-sibling::p or self::p][following-sibling::p or self::p]";
XPath 抓取所有节点(元素或文本节点):具有前同级 p
和后同级 p
,或者节点本身是 p
元素。
var nodes = doc.DocumentNode.SelectNodes(query);
foreach (var item in nodes)
{
Console.WriteLine(item.OuterHtml);
}
输出:
<p>firt paragraph</p>
some
<br />
text
<p>another paragraph</p>
<span>some text between span</span>
<p>hellow word</p>