C# HTMLAGILITYPACK 在两个标签之间抓取数据
C# HTMLAGILITYPACK scrape data between two tags
使用 Html Agility Pack,我必须从 //h2 标签之间设置的所有 //dd 标签(在本例中为名为“Applicant”和“Agent”的 h2 标签之间)抓取 innerText。如何做到这一点?
以下只是一段 HTML 代码,我必须从中抓取数据:
<!-- Applicants section -->
<h2 class="GridTitle">Applicant</h2>
<h3 class="DataTitle">1</h3>
<dl class="Grid LeftCol">
<dt>Name:</dt>
<dd>Some name here</dd>
<dt>Legal Form:</dt>
<dd></dd>
<dt>From:</dt>
<dd>06/08/2020</dd>
</dl>
<dl class="Grid RightCol">
<dt>Address:</dt>
<dd>Some address here</dd>
<dt>To:</dt>
<dd></dd>
</dl>
<h3 class="DataTitle">2</h3>
<dl class="Grid LeftCol">
<dt>Name:</dt>
<dd>Some name here1</dd>
<dt>Legal Form:</dt>
<dd></dd>
<dt>From:</dt>
<dd>04/08/2010</dd>
</dl>
<dl class="Grid RightCol">
<dt>Address:</dt>
<dd>Some address here1</dd>
<dt>To:</dt>
<dd>06/08/2020</dd>
</dl>
<!-- Agents section -->
<h2 class="GridTitle">Agent</h2>
这是我尝试过的方法,但是需要先 //dd above //h2(Agent)
var h2Tags = doc.DocumentNode.SelectNodes("//h2[text() = 'Applicant']");
var h2Tags1 = doc.DocumentNode.SelectNodes("//h2[text() = 'Agent']");
var lineNum = h2Tags[0].Line;
var lineNum1 = h2Tags1[0].Line;
var Applicants = doc.DocumentNode.SelectNodes("//dd").Where(x => x.Line > lineNum).Where(x => x.Line < lineNum1);
foreach (HtmlNode g in Applicants)
{
TMOwner = g.InnerText;
}
您完全可以使用 XPath 查询来执行此操作,如下所示。您已经对 select 您的开始和结束 h2 节点进行了 XPath 查询。然后你可以 select 它们之间的所有 dd
节点如下:
var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
// TODO: Handle the case where startnode is null/missing here.
var endnode = startnode.SelectSingleNode("./following::h2"); // And select the following end node using whatever criteria you need.
// TODO: Handle the case where endnode is null/missing here.
var followingXPath = $"./following::dd"; // Select nodes following the current node, which will be startnode
var precedingXPath = $"{endnode.XPath}/preceding::dd"; // Select nodes preceding the end node explicitly.
var intersectedXPath = $"{followingXPath}[count(. | {precedingXPath}) = count({precedingXPath})]";
var query = startnode.SelectNodes(intersectedXPath);
var innerTexts = query.Select(n => n.InnerText).ToList();
或者,您可以像这样将更简单的 XPath 查询与 Linq TakeWhile()
结合起来:
var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
var endnode = startnode.SelectSingleNode("./following::h2"); // And select the following end node using whatever criteria you need.
var query = startnode.SelectNodes("./following::node()") // Select all nodes following startnode
.TakeWhile(n => n != endnode) // Until endnode is reached
.Where(n => n.Name == "dd"); // With name "dd".
备注:
/following::dd
、./following::h2
和 /preceding::dd
是 axes of location steps 的示例。 following
轴 selects 节点与上下文节点在文档顺序中位于上下文节点之后,而 preceding
轴 selects 节点在同一文档中文档作为上下文节点,在文档顺序中位于上下文节点之前。
如果您想 select 下一个具有特定文本值的 <h2>
节点,比如“代理”,您可以这样做:
var endnode = startnode.SelectSingleNode("./following::h2[text() = 'Agent']");
intersectedXPath
的公式取自this answer by Dimitre Novatchev to How would you find all nodes between two H3's using XPATH?。那里的情况类似,但是您的问题并未将元素限制为 selected 为兄弟姐妹。
演示 fiddle here for XPath; here for XPath + LINQ; and here https://bpp.economie.fgov.be/fo-eregister-view/search/details/721770937_EPV/0/0/1/10/0/0/0/null/null?locale=en
。
使用 Html Agility Pack,我必须从 //h2 标签之间设置的所有 //dd 标签(在本例中为名为“Applicant”和“Agent”的 h2 标签之间)抓取 innerText。如何做到这一点?
以下只是一段 HTML 代码,我必须从中抓取数据:
<!-- Applicants section -->
<h2 class="GridTitle">Applicant</h2>
<h3 class="DataTitle">1</h3>
<dl class="Grid LeftCol">
<dt>Name:</dt>
<dd>Some name here</dd>
<dt>Legal Form:</dt>
<dd></dd>
<dt>From:</dt>
<dd>06/08/2020</dd>
</dl>
<dl class="Grid RightCol">
<dt>Address:</dt>
<dd>Some address here</dd>
<dt>To:</dt>
<dd></dd>
</dl>
<h3 class="DataTitle">2</h3>
<dl class="Grid LeftCol">
<dt>Name:</dt>
<dd>Some name here1</dd>
<dt>Legal Form:</dt>
<dd></dd>
<dt>From:</dt>
<dd>04/08/2010</dd>
</dl>
<dl class="Grid RightCol">
<dt>Address:</dt>
<dd>Some address here1</dd>
<dt>To:</dt>
<dd>06/08/2020</dd>
</dl>
<!-- Agents section -->
<h2 class="GridTitle">Agent</h2>
这是我尝试过的方法,但是需要先 //dd above //h2(Agent)
var h2Tags = doc.DocumentNode.SelectNodes("//h2[text() = 'Applicant']");
var h2Tags1 = doc.DocumentNode.SelectNodes("//h2[text() = 'Agent']");
var lineNum = h2Tags[0].Line;
var lineNum1 = h2Tags1[0].Line;
var Applicants = doc.DocumentNode.SelectNodes("//dd").Where(x => x.Line > lineNum).Where(x => x.Line < lineNum1);
foreach (HtmlNode g in Applicants)
{
TMOwner = g.InnerText;
}
您完全可以使用 XPath 查询来执行此操作,如下所示。您已经对 select 您的开始和结束 h2 节点进行了 XPath 查询。然后你可以 select 它们之间的所有 dd
节点如下:
var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
// TODO: Handle the case where startnode is null/missing here.
var endnode = startnode.SelectSingleNode("./following::h2"); // And select the following end node using whatever criteria you need.
// TODO: Handle the case where endnode is null/missing here.
var followingXPath = $"./following::dd"; // Select nodes following the current node, which will be startnode
var precedingXPath = $"{endnode.XPath}/preceding::dd"; // Select nodes preceding the end node explicitly.
var intersectedXPath = $"{followingXPath}[count(. | {precedingXPath}) = count({precedingXPath})]";
var query = startnode.SelectNodes(intersectedXPath);
var innerTexts = query.Select(n => n.InnerText).ToList();
或者,您可以像这样将更简单的 XPath 查询与 Linq TakeWhile()
结合起来:
var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
var endnode = startnode.SelectSingleNode("./following::h2"); // And select the following end node using whatever criteria you need.
var query = startnode.SelectNodes("./following::node()") // Select all nodes following startnode
.TakeWhile(n => n != endnode) // Until endnode is reached
.Where(n => n.Name == "dd"); // With name "dd".
备注:
/following::dd
、./following::h2
和/preceding::dd
是 axes of location steps 的示例。following
轴 selects 节点与上下文节点在文档顺序中位于上下文节点之后,而preceding
轴 selects 节点在同一文档中文档作为上下文节点,在文档顺序中位于上下文节点之前。如果您想 select 下一个具有特定文本值的
<h2>
节点,比如“代理”,您可以这样做:var endnode = startnode.SelectSingleNode("./following::h2[text() = 'Agent']");
intersectedXPath
的公式取自this answer by Dimitre Novatchev to How would you find all nodes between two H3's using XPATH?。那里的情况类似,但是您的问题并未将元素限制为 selected 为兄弟姐妹。
演示 fiddle here for XPath; here for XPath + LINQ; and here https://bpp.economie.fgov.be/fo-eregister-view/search/details/721770937_EPV/0/0/1/10/0/0/0/null/null?locale=en
。