查找两个锚点之间的元素包括带有 XPath 的锚点的父级

Find Elements between two anchors include parent of anchor with XPath

我尝试解析自动生成的 html 文件。它来自 HAT,我对生成的 html.

没有影响
<!DOCTYPE html>
<html lang="de">
    <head>
      <!-- Header bla bla -->
    </head>

    <body class="md-nav-expanded">
      <!-- Some HTML-Elements, that doesn't matter -->

      <div id="main">
        <article>
            <div id="topic-content" class="container-fluid">
                <!-- Uninteresting div -->

                <a id="main-content"></a>

                <h2>Steuerelemente</h2>

                <div class="main-content">
                    
                    <h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
                    <p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
                    <h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
                    <p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
                    <p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>
                    <h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
                    <p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
                    <p class="rvps3"><span class="rvts8"><br/></span></p>
                    <p class="rvps2" style="clear: both;">
                        <span class="rvts6">Autogenerated Text</span>
                        <!-- This anchor should be ignored, because it has no name attribute -->
                        <a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
                    </p>
                </div>
                <!-- The rest of the HTML doesn't matter -->
            </div>  <!-- /#topic-content -->
        </article>
      </div>  <!-- /#main -->
    </body>
</html>

我尝试将 html 从 MyAnchor1(包括其父 h6 [可以是任何其他元素])提取到 MyAnchor2。从MyAnchor2到MyAnchor3,从MyAnchor3到最后

首先,我将文件加载到 HtmlDocument 中:

htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(refFile);

然后我找到 div 'main-content'

var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();

现在我很挣扎,如何在锚点之间获得 html。我尝试了 Substring,但节点中的位置(StartIndex 和 InnerLength)似乎与字符串值不匹配。

另一种方法是获取锚点本身,但我不知道如何获取元素直到下一个锚点(或结尾)。

一种无效的方法:

var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
    foreach (var anchor in anchors)
    {
        var anchorName = anchor.GetAttributeValue<string>("name", null);
        var followingNodes = mainContentDiv.SelectNodes(".//*[preceding::a and following::a[@name = '" + anchorName + "']]");
    }
}

谁能帮帮我。谢谢。

更新:

我想要 3 个 HTML 零件: 1.

<h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
<h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
<p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>

和 3.

<h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
<p class="rvps3"><span class="rvts8"><br/></span></p>
<p class="rvps2" style="clear: both;">
    <span class="rvts6">Autogenerated Text</span>
    <!-- This anchor should be ignored, because it has no name attribute -->
    <a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
</p>

有效解决方案: 最后,我有一个可行的解决方案,它考虑了生成的 html.

的不明确结构
var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();
var childNodes = mainContentDiv.ChildNodes;

var snippets = new Dictionary<string, string>();
snippets.Add("", mainContentDiv.InnerHtml);

var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
    foreach (var anchor in anchors)
    {
        var sb = new StringBuilder();

        var anchorName = anchor.GetAttributeValue<string>("name", null);
        var node = anchor;
        while (node.ParentNode.GetAttributeValue<string>("class", null) != "main-content" && node.ParentNode.SelectNodes(".//a[@name]").Count == 1)
        {
            node = node.ParentNode;
        }

        sb.Append(node.OuterHtml);
        while (node.NextSibling != null)
        {
            var nodeCollection = node.NextSibling.SelectNodes(".//a[@name]");
            if (nodeCollection != null)
                break;

            node = node.NextSibling;
            sb.Append(node.OuterHtml);
        }

        snippets.Add(anchorName, sb.ToString());
    }
}

htmlSnippes.Add(helpContextId, snippets);

感谢大家的帮助。

您可以尝试使用以下代码:

List<string> htmlParts = new List<string>();
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
     foreach (var anchor in anchors)
     {                    
         var node = anchor.ParentNode;

         StringBuilder sb = new StringBuilder(node.OuterHtml);

         while ((node = node.NextSibling) != null)                    
         {
              if (node.SelectSingleNode(".//a[@name]") != null)
                  break;
              else
                  sb.Append(node.OuterHtml);
         }                   

         htmlParts.Add(sb.ToString());
    }
}

代码假定每个锚元素总是有一个父元素。如果情况并非总是如此,您将不得不对其进行调整。