查找两个锚点之间的元素包括带有 XPath 的锚点的父级
Find Elements between two anchors include parent of anchor with XPath
我尝试解析自动生成的 html 文件。它来自 HAT,我对生成的 html.
没有影响
<!DOCTYPE html>
<html lang="de">
<head>
<!-- Header bla bla -->
</head>
<body class="md-nav-expanded">
<!-- Some HTML-Elements, that doesn't matter -->
<div id="main">
<article>
<div id="topic-content" class="container-fluid">
<!-- Uninteresting div -->
<a id="main-content"></a>
<h2>Steuerelemente</h2>
<div class="main-content">
<h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
<h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
<p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>
<h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
<p class="rvps3"><span class="rvts8"><br/></span></p>
<p class="rvps2" style="clear: both;">
<span class="rvts6">Autogenerated Text</span>
<!-- This anchor should be ignored, because it has no name attribute -->
<a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
</p>
</div>
<!-- The rest of the HTML doesn't matter -->
</div> <!-- /#topic-content -->
</article>
</div> <!-- /#main -->
</body>
</html>
我尝试将 html 从 MyAnchor1(包括其父 h6 [可以是任何其他元素])提取到 MyAnchor2。从MyAnchor2到MyAnchor3,从MyAnchor3到最后
首先,我将文件加载到 HtmlDocument 中:
htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(refFile);
然后我找到 div 'main-content'
var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();
现在我很挣扎,如何在锚点之间获得 html。我尝试了 Substring,但节点中的位置(StartIndex 和 InnerLength)似乎与字符串值不匹配。
另一种方法是获取锚点本身,但我不知道如何获取元素直到下一个锚点(或结尾)。
一种无效的方法:
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var anchorName = anchor.GetAttributeValue<string>("name", null);
var followingNodes = mainContentDiv.SelectNodes(".//*[preceding::a and following::a[@name = '" + anchorName + "']]");
}
}
谁能帮帮我。谢谢。
更新:
我想要 3 个 HTML 零件:
1.
<h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
<h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
<p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>
和 3.
<h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
<p class="rvps3"><span class="rvts8"><br/></span></p>
<p class="rvps2" style="clear: both;">
<span class="rvts6">Autogenerated Text</span>
<!-- This anchor should be ignored, because it has no name attribute -->
<a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
</p>
有效解决方案:
最后,我有一个可行的解决方案,它考虑了生成的 html.
的不明确结构
var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();
var childNodes = mainContentDiv.ChildNodes;
var snippets = new Dictionary<string, string>();
snippets.Add("", mainContentDiv.InnerHtml);
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var sb = new StringBuilder();
var anchorName = anchor.GetAttributeValue<string>("name", null);
var node = anchor;
while (node.ParentNode.GetAttributeValue<string>("class", null) != "main-content" && node.ParentNode.SelectNodes(".//a[@name]").Count == 1)
{
node = node.ParentNode;
}
sb.Append(node.OuterHtml);
while (node.NextSibling != null)
{
var nodeCollection = node.NextSibling.SelectNodes(".//a[@name]");
if (nodeCollection != null)
break;
node = node.NextSibling;
sb.Append(node.OuterHtml);
}
snippets.Add(anchorName, sb.ToString());
}
}
htmlSnippes.Add(helpContextId, snippets);
感谢大家的帮助。
您可以尝试使用以下代码:
List<string> htmlParts = new List<string>();
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var node = anchor.ParentNode;
StringBuilder sb = new StringBuilder(node.OuterHtml);
while ((node = node.NextSibling) != null)
{
if (node.SelectSingleNode(".//a[@name]") != null)
break;
else
sb.Append(node.OuterHtml);
}
htmlParts.Add(sb.ToString());
}
}
代码假定每个锚元素总是有一个父元素。如果情况并非总是如此,您将不得不对其进行调整。
我尝试解析自动生成的 html 文件。它来自 HAT,我对生成的 html.
没有影响<!DOCTYPE html>
<html lang="de">
<head>
<!-- Header bla bla -->
</head>
<body class="md-nav-expanded">
<!-- Some HTML-Elements, that doesn't matter -->
<div id="main">
<article>
<div id="topic-content" class="container-fluid">
<!-- Uninteresting div -->
<a id="main-content"></a>
<h2>Steuerelemente</h2>
<div class="main-content">
<h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
<h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
<p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>
<h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
<p class="rvps3"><span class="rvts8"><br/></span></p>
<p class="rvps2" style="clear: both;">
<span class="rvts6">Autogenerated Text</span>
<!-- This anchor should be ignored, because it has no name attribute -->
<a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
</p>
</div>
<!-- The rest of the HTML doesn't matter -->
</div> <!-- /#topic-content -->
</article>
</div> <!-- /#main -->
</body>
</html>
我尝试将 html 从 MyAnchor1(包括其父 h6 [可以是任何其他元素])提取到 MyAnchor2。从MyAnchor2到MyAnchor3,从MyAnchor3到最后
首先,我将文件加载到 HtmlDocument 中:
htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(refFile);
然后我找到 div 'main-content'
var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();
现在我很挣扎,如何在锚点之间获得 html。我尝试了 Substring,但节点中的位置(StartIndex 和 InnerLength)似乎与字符串值不匹配。
另一种方法是获取锚点本身,但我不知道如何获取元素直到下一个锚点(或结尾)。
一种无效的方法:
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var anchorName = anchor.GetAttributeValue<string>("name", null);
var followingNodes = mainContentDiv.SelectNodes(".//*[preceding::a and following::a[@name = '" + anchorName + "']]");
}
}
谁能帮帮我。谢谢。
更新:
我想要 3 个 HTML 零件: 1.
<h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
<h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
<p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>
和 3.
<h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
<p class="rvps3"><span class="rvts8"><br/></span></p>
<p class="rvps2" style="clear: both;">
<span class="rvts6">Autogenerated Text</span>
<!-- This anchor should be ignored, because it has no name attribute -->
<a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
</p>
有效解决方案: 最后,我有一个可行的解决方案,它考虑了生成的 html.
的不明确结构var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();
var childNodes = mainContentDiv.ChildNodes;
var snippets = new Dictionary<string, string>();
snippets.Add("", mainContentDiv.InnerHtml);
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var sb = new StringBuilder();
var anchorName = anchor.GetAttributeValue<string>("name", null);
var node = anchor;
while (node.ParentNode.GetAttributeValue<string>("class", null) != "main-content" && node.ParentNode.SelectNodes(".//a[@name]").Count == 1)
{
node = node.ParentNode;
}
sb.Append(node.OuterHtml);
while (node.NextSibling != null)
{
var nodeCollection = node.NextSibling.SelectNodes(".//a[@name]");
if (nodeCollection != null)
break;
node = node.NextSibling;
sb.Append(node.OuterHtml);
}
snippets.Add(anchorName, sb.ToString());
}
}
htmlSnippes.Add(helpContextId, snippets);
感谢大家的帮助。
您可以尝试使用以下代码:
List<string> htmlParts = new List<string>();
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var node = anchor.ParentNode;
StringBuilder sb = new StringBuilder(node.OuterHtml);
while ((node = node.NextSibling) != null)
{
if (node.SelectSingleNode(".//a[@name]") != null)
break;
else
sb.Append(node.OuterHtml);
}
htmlParts.Add(sb.ToString());
}
}
代码假定每个锚元素总是有一个父元素。如果情况并非总是如此,您将不得不对其进行调整。