C# HTMLAGILITYPACK 在两个标签之间抓取数据

C# HTMLAGILITYPACK scrape data between two tags

使用 Html Agility Pack,我必须从 //h2 标签之间设置的所有 //dd 标签(在本例中为名为“Applicant”和“Agent”的 h2 标签之间)抓取 innerText。如何做到这一点?

以下只是一段 HTML 代码,我必须从中抓取数据:

<!-- Applicants section  -->

    <h2 class="GridTitle">Applicant</h2>
    
        
            <h3 class="DataTitle">1</h3>
        
        <dl class="Grid LeftCol">
            <dt>Name:</dt>
            <dd>Some name here</dd>
            <dt>Legal Form:</dt>
            <dd></dd>              
            <dt>From:</dt>
            <dd>06/08/2020</dd>
        </dl>
        <dl class="Grid RightCol">
            <dt>Address:</dt>
            <dd>Some address here</dd>
            <dt>To:</dt>
            <dd></dd>
        </dl>
    
        
            <h3 class="DataTitle">2</h3>
        
        <dl class="Grid LeftCol">
            <dt>Name:</dt>
            <dd>Some name here1</dd>
            <dt>Legal Form:</dt>
            <dd></dd>              
            <dt>From:</dt>
            <dd>04/08/2010</dd>
        </dl>
        <dl class="Grid RightCol">
            <dt>Address:</dt>
            <dd>Some address here1</dd>
            <dt>To:</dt>
            <dd>06/08/2020</dd>
        </dl>
    



<!-- Agents section  -->

    <h2 class="GridTitle">Agent</h2>

这是我尝试过的方法,但是需要先 //dd above //h2(Agent)

var h2Tags = doc.DocumentNode.SelectNodes("//h2[text() = 'Applicant']");
var h2Tags1 = doc.DocumentNode.SelectNodes("//h2[text() = 'Agent']");
var lineNum = h2Tags[0].Line;
var lineNum1 = h2Tags1[0].Line;
var Applicants = doc.DocumentNode.SelectNodes("//dd").Where(x => x.Line > lineNum).Where(x => x.Line < lineNum1);

foreach (HtmlNode g in Applicants)
{
      TMOwner = g.InnerText;
}

您完全可以使用 XPath 查询来执行此操作,如下所示。您已经对 select 您的开始和结束 h2 节点进行了 XPath 查询。然后你可以 select 它们之间的所有 dd 节点如下:

var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
// TODO: Handle the case where startnode is null/missing here.

var endnode = startnode.SelectSingleNode("./following::h2");                     // And select the following end node using whatever criteria you need.
// TODO: Handle the case where endnode is null/missing here.

var followingXPath = $"./following::dd";                                         // Select nodes following the current node, which will be startnode
var precedingXPath = $"{endnode.XPath}/preceding::dd";                           // Select nodes preceding the end node explicitly.
var intersectedXPath = $"{followingXPath}[count(. | {precedingXPath}) = count({precedingXPath})]";

var query = startnode.SelectNodes(intersectedXPath);

var innerTexts = query.Select(n => n.InnerText).ToList();

或者,您可以像这样将更简单的 XPath 查询与 Linq TakeWhile() 结合起来:

var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
var endnode = startnode.SelectSingleNode("./following::h2");                     // And select the following end node using whatever criteria you need.

var query = startnode.SelectNodes("./following::node()") // Select all nodes following startnode
    .TakeWhile(n => n != endnode)                        // Until endnode is reached
    .Where(n => n.Name == "dd");                         // With name "dd".

备注:

  • /following::dd./following::h2/preceding::ddaxes of location steps 的示例。 following 轴 selects 节点与上下文节点在文档顺序中位于上下文节点之后,而 preceding 轴 selects 节点在同一文档中文档作为上下文节点,在文档顺序中位于上下文节点之前。

    如果您想 select 下一个具有特定文本值的 <h2> 节点,比如“代理”,您可以这样做:

    var endnode = startnode.SelectSingleNode("./following::h2[text() = 'Agent']");
    
  • intersectedXPath的公式取自this answer by Dimitre Novatchev to How would you find all nodes between two H3's using XPATH?。那里的情况类似,但是您的问题并未将元素限制为 selected 为兄弟姐妹。

演示 fiddle here for XPath; here for XPath + LINQ; and here https://bpp.economie.fgov.be/fo-eregister-view/search/details/721770937_EPV/0/0/1/10/0/0/0/null/null?locale=en