C# HTMLAGILITYPACK 在两个标签之间抓取数据

Question

使用 Html Agility Pack，我必须从 //h2 标签之间设置的所有 //dd 标签（在本例中为名为“Applicant”和“Agent”的 h2 标签之间）抓取 innerText。如何做到这一点？

以下只是一段 HTML 代码，我必须从中抓取数据：

<!-- Applicants section  -->

    <h2 class="GridTitle">Applicant</h2>
    
        
            <h3 class="DataTitle">1</h3>
        
        <dl class="Grid LeftCol">
            <dt>Name:</dt>
            <dd>Some name here</dd>
            <dt>Legal Form:</dt>
            <dd></dd>              
            <dt>From:</dt>
            <dd>06/08/2020</dd>
        </dl>
        <dl class="Grid RightCol">
            <dt>Address:</dt>
            <dd>Some address here</dd>
            <dt>To:</dt>
            <dd></dd>
        </dl>
    
        
            <h3 class="DataTitle">2</h3>
        
        <dl class="Grid LeftCol">
            <dt>Name:</dt>
            <dd>Some name here1</dd>
            <dt>Legal Form:</dt>
            <dd></dd>              
            <dt>From:</dt>
            <dd>04/08/2010</dd>
        </dl>
        <dl class="Grid RightCol">
            <dt>Address:</dt>
            <dd>Some address here1</dd>
            <dt>To:</dt>
            <dd>06/08/2020</dd>
        </dl>
    



<!-- Agents section  -->

    <h2 class="GridTitle">Agent</h2>

这是我尝试过的方法，但是需要先 //dd above //h2(Agent)

var h2Tags = doc.DocumentNode.SelectNodes("//h2[text() = 'Applicant']");
var h2Tags1 = doc.DocumentNode.SelectNodes("//h2[text() = 'Agent']");
var lineNum = h2Tags[0].Line;
var lineNum1 = h2Tags1[0].Line;
var Applicants = doc.DocumentNode.SelectNodes("//dd").Where(x => x.Line > lineNum).Where(x => x.Line < lineNum1);

foreach (HtmlNode g in Applicants)
{
      TMOwner = g.InnerText;
}

Answer 1

您完全可以使用 XPath 查询来执行此操作，如下所示。您已经对 select 您的开始和结束 h2 节点进行了 XPath 查询。然后你可以 select 它们之间的所有 dd 节点如下：

var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
// TODO: Handle the case where startnode is null/missing here.

var endnode = startnode.SelectSingleNode("./following::h2");                     // And select the following end node using whatever criteria you need.
// TODO: Handle the case where endnode is null/missing here.

var followingXPath = $"./following::dd";                                         // Select nodes following the current node, which will be startnode
var precedingXPath = $"{endnode.XPath}/preceding::dd";                           // Select nodes preceding the end node explicitly.
var intersectedXPath = $"{followingXPath}[count(. | {precedingXPath}) = count({precedingXPath})]";

var query = startnode.SelectNodes(intersectedXPath);

var innerTexts = query.Select(n => n.InnerText).ToList();

或者，您可以像这样将更简单的 XPath 查询与 Linq TakeWhile() 结合起来：

var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
var endnode = startnode.SelectSingleNode("./following::h2");                     // And select the following end node using whatever criteria you need.

var query = startnode.SelectNodes("./following::node()") // Select all nodes following startnode
    .TakeWhile(n => n != endnode)                        // Until endnode is reached
    .Where(n => n.Name == "dd");                         // With name "dd".

备注：

/following::dd、./following::h2 和 /preceding::dd 是 axes of location steps 的示例。 following 轴 selects 节点与上下文节点在文档顺序中位于上下文节点之后，而 preceding 轴 selects 节点在同一文档中文档作为上下文节点，在文档顺序中位于上下文节点之前。

如果您想 select 下一个具有特定文本值的 <h2> 节点，比如“代理”，您可以这样做：
```
var endnode = startnode.SelectSingleNode("./following::h2[text() = 'Agent']");
```
intersectedXPath的公式取自this answer by Dimitre Novatchev to How would you find all nodes between two H3's using XPATH?。那里的情况类似，但是您的问题并未将元素限制为 selected 为兄弟姐妹。

演示 fiddle here for XPath; here for XPath + LINQ; and here https://bpp.economie.fgov.be/fo-eregister-view/search/details/721770937_EPV/0/0/1/10/0/0/0/null/null?locale=en。

C# HTMLAGILITYPACK 在两个标签之间抓取数据

C# HTMLAGILITYPACK scrape data between two tags

c#

html-agility-pack