HTML 解析 C# HTML 敏捷包
HTML Parsing C# HTMLAgilityPack
我在使用 HTMLAgilityPack 从 HTML 字符串读取某些值时遇到问题。
我想看的两个项目是报纸:82548828 和鱼:8545852485
但是使用我到目前为止编写的代码我只能取回报纸项目。
我怀疑我使用的 XPATH 不完全正确,我认为第一个循环的 XPATH 是正确的,因为这让我返回了两个
我希望我的第二个循环遍历这两项(它认为有 6 个???)
也就是div2.SelectSingleNode(sXPathT);提取groupLabel的正确方法?或者有更好的方法吗?
谢谢
下面是完整的测试代码
string strTestHTML = @"<div class=\""content\"" data-id=\""123456789\"">" +
" <div class=\"m-group item\">" +
" <span class=\"group\">" +
" <a href=\"javascript:void(0);\">" +
" <span class=\"group-label\">Newspaper </span>" +
" <span class=\"group-value\">82548828</span>" +
" </a>" +
" </span>" +
" <span class=\"group\">" +
" <a href=\"javascript:void(0);\">" +
" <span class=\"group-label\">Fish </span>" +
" <span class=\"group-value\">8545852485</span>" +
" </a>" +
" </span>" +
" </div>" +
"</div>";
//<div class="content" data-id="123456789">
string sNewXpath = "//div[contains(@class,'content') and contains(@data-id, '" + "123456789" + "')]";
//<div class="m-group item">
string sSecondXPath = "/div[contains(@class,'m-group item')]";
//<span class="group"
string sThirdXPath = "//span[contains(@class,'group')]";
string sXPathT = "//span[contains(@class,'group-label')]";
string sXPathO = "//span[contains(@class,'group-value')]";
HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);
foreach (HtmlNode div in Doc.DocumentNode.SelectNodes(sNewXpath + sSecondXPath))
{
foreach (HtmlNode div2 in div.SelectNodes(sThirdXPath))
{
var vOddL = div2.SelectSingleNode(sXPathT);
var vOddP = div2.SelectSingleNode(sXPathO);
string GroupLabel = vOddL.InnerText.Trim();
string GroupValue = vOddP.InnerText.Trim();
}
}
编辑:
弄清楚为什么我在 forloop 中得到 6 个项目
sThirdXPath 是:string sThirdXPath = "//span[contains(@class,'group')]";
应该是:
string sThirdXPath = "//span[@class='group']";
仍在尝试寻找正确的方法来询问 div2 中包含的 HTMLNode 以找到感兴趣的值。我假设它需要 XPath 来匹配当前节点内的 iinside,而不是 HTML 文档范围。
已更新 HTML 示例:
<div class="content" data-id="123456789">
<div class="m-group item">
<span class="group">
<a href="javascript:void(0);">
<span class="group-label">Newspaper </span>
<span class="group-value">82548828</span>
</a>
</span>
<span class="group">
<a href="javascript:void(0);">
<span class="group-label">Fish </span>
<span class="group-value">8545852485</span>
</a>
</span>
</div>
</div>
<div class="content" data-id="987654321">
<div class="m-group item">
<span class="group">
<a href="javascript:void(0);">
<span class="group-label">Bread</span>
<span class="group-value">82548828</span>
</a>
</span>
<span class="group">
<a href="javascript:void(0);">
<span class="group-label">Milk </span>
<span class="group-value">8545852485</span>
</a>
</span>
</div>
</div>
在上面的示例中,访问 Just Bread and Its Value 和 Milk and its Value 的正确 XPATH 是什么。我假设我需要在 XPath 中过滤 data-id="987654321?
您的怀疑是正确的,您已经为完整路径指定了 XPath 查询,因此不需要循环。要在此示例中获取 "Newspaper" 和 "Fish" 节点,您可以简单地使用 SelectNodes 而不是循环和调用 SelectSingleNode。当然,如果有更多项可以遍历结果集,我在本例中通过索引访问它们,因为它们只有两个。
string sXPathT = "//span[contains(@class,'group-label')]";
string sXPathO = "//span[contains(@class,'group-value')]";
HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);
var vOddL = Doc.DocumentNode.SelectNodes(sXPathT);
var vOddP = Doc.DocumentNode.SelectNodes(sXPathO);
string GroupLabelNewsPaper = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelFish = vOddL.ElementAt(1).InnerText.Trim();
string GroupValueNewspaper = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueFish = vOddP.ElementAt(1).InnerText.Trim();
Console.WriteLine($"{GroupLabelNewsPaper}\t{GroupValueNewspaper}");
Console.WriteLine($"{GroupLabelFish}\t{GroupValueFish}");
输出:
Newspaper 82548828
Fish 8545852485
更新:
要获取特定的内容节点,您可以使用此 XPath:
string xpathForDataId = "//div[@class='content' and @data-id='987654321']";
你可以用上面的表达式过滤div,然后像这样得到它的子节点:
string sXPathT = ".//span[contains(@class,'group-label')]";
string sXPathO = ".//span[contains(@class,'group-value')]";
string xpathForDataId = "//div[@class='content' and @data-id='987654321']";
HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);
var specificNode = Doc.DocumentNode.SelectSingleNode(xpathForDataId);
var vOddL = specificNode.SelectNodes(sXPathT);
var vOddP = specificNode.SelectNodes(sXPathO);
string GroupLabelBread = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelMilk = vOddL.ElementAt(1).InnerText.Trim();
string GroupValueBread = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueMilk = vOddP.ElementAt(1).InnerText.Trim();
Console.WriteLine($"{GroupLabelBread}\t{GroupValueBread}");
Console.WriteLine($"{GroupLabelMilk}\t{GroupValueMilk}");
注意 sXPathT 和 sXPathO 中的“.//”。这样我们只搜索当前上下文而不是整个文档。
输出:
Bread 82548828
Milk 8545852485
我在使用 HTMLAgilityPack 从 HTML 字符串读取某些值时遇到问题。
我想看的两个项目是报纸:82548828 和鱼:8545852485
但是使用我到目前为止编写的代码我只能取回报纸项目。
我怀疑我使用的 XPATH 不完全正确,我认为第一个循环的 XPATH 是正确的,因为这让我返回了两个
我希望我的第二个循环遍历这两项(它认为有 6 个???)
也就是div2.SelectSingleNode(sXPathT);提取groupLabel的正确方法?或者有更好的方法吗?
谢谢
下面是完整的测试代码
string strTestHTML = @"<div class=\""content\"" data-id=\""123456789\"">" +
" <div class=\"m-group item\">" +
" <span class=\"group\">" +
" <a href=\"javascript:void(0);\">" +
" <span class=\"group-label\">Newspaper </span>" +
" <span class=\"group-value\">82548828</span>" +
" </a>" +
" </span>" +
" <span class=\"group\">" +
" <a href=\"javascript:void(0);\">" +
" <span class=\"group-label\">Fish </span>" +
" <span class=\"group-value\">8545852485</span>" +
" </a>" +
" </span>" +
" </div>" +
"</div>";
//<div class="content" data-id="123456789">
string sNewXpath = "//div[contains(@class,'content') and contains(@data-id, '" + "123456789" + "')]";
//<div class="m-group item">
string sSecondXPath = "/div[contains(@class,'m-group item')]";
//<span class="group"
string sThirdXPath = "//span[contains(@class,'group')]";
string sXPathT = "//span[contains(@class,'group-label')]";
string sXPathO = "//span[contains(@class,'group-value')]";
HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);
foreach (HtmlNode div in Doc.DocumentNode.SelectNodes(sNewXpath + sSecondXPath))
{
foreach (HtmlNode div2 in div.SelectNodes(sThirdXPath))
{
var vOddL = div2.SelectSingleNode(sXPathT);
var vOddP = div2.SelectSingleNode(sXPathO);
string GroupLabel = vOddL.InnerText.Trim();
string GroupValue = vOddP.InnerText.Trim();
}
}
编辑:
弄清楚为什么我在 forloop 中得到 6 个项目
sThirdXPath 是:string sThirdXPath = "//span[contains(@class,'group')]";
应该是:
string sThirdXPath = "//span[@class='group']";
仍在尝试寻找正确的方法来询问 div2 中包含的 HTMLNode 以找到感兴趣的值。我假设它需要 XPath 来匹配当前节点内的 iinside,而不是 HTML 文档范围。
已更新 HTML 示例:
<div class="content" data-id="123456789">
<div class="m-group item">
<span class="group">
<a href="javascript:void(0);">
<span class="group-label">Newspaper </span>
<span class="group-value">82548828</span>
</a>
</span>
<span class="group">
<a href="javascript:void(0);">
<span class="group-label">Fish </span>
<span class="group-value">8545852485</span>
</a>
</span>
</div>
</div>
<div class="content" data-id="987654321">
<div class="m-group item">
<span class="group">
<a href="javascript:void(0);">
<span class="group-label">Bread</span>
<span class="group-value">82548828</span>
</a>
</span>
<span class="group">
<a href="javascript:void(0);">
<span class="group-label">Milk </span>
<span class="group-value">8545852485</span>
</a>
</span>
</div>
</div>
在上面的示例中,访问 Just Bread and Its Value 和 Milk and its Value 的正确 XPATH 是什么。我假设我需要在 XPath 中过滤 data-id="987654321?
您的怀疑是正确的,您已经为完整路径指定了 XPath 查询,因此不需要循环。要在此示例中获取 "Newspaper" 和 "Fish" 节点,您可以简单地使用 SelectNodes 而不是循环和调用 SelectSingleNode。当然,如果有更多项可以遍历结果集,我在本例中通过索引访问它们,因为它们只有两个。
string sXPathT = "//span[contains(@class,'group-label')]";
string sXPathO = "//span[contains(@class,'group-value')]";
HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);
var vOddL = Doc.DocumentNode.SelectNodes(sXPathT);
var vOddP = Doc.DocumentNode.SelectNodes(sXPathO);
string GroupLabelNewsPaper = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelFish = vOddL.ElementAt(1).InnerText.Trim();
string GroupValueNewspaper = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueFish = vOddP.ElementAt(1).InnerText.Trim();
Console.WriteLine($"{GroupLabelNewsPaper}\t{GroupValueNewspaper}");
Console.WriteLine($"{GroupLabelFish}\t{GroupValueFish}");
输出:
Newspaper 82548828
Fish 8545852485
更新: 要获取特定的内容节点,您可以使用此 XPath:
string xpathForDataId = "//div[@class='content' and @data-id='987654321']";
你可以用上面的表达式过滤div,然后像这样得到它的子节点:
string sXPathT = ".//span[contains(@class,'group-label')]";
string sXPathO = ".//span[contains(@class,'group-value')]";
string xpathForDataId = "//div[@class='content' and @data-id='987654321']";
HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);
var specificNode = Doc.DocumentNode.SelectSingleNode(xpathForDataId);
var vOddL = specificNode.SelectNodes(sXPathT);
var vOddP = specificNode.SelectNodes(sXPathO);
string GroupLabelBread = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelMilk = vOddL.ElementAt(1).InnerText.Trim();
string GroupValueBread = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueMilk = vOddP.ElementAt(1).InnerText.Trim();
Console.WriteLine($"{GroupLabelBread}\t{GroupValueBread}");
Console.WriteLine($"{GroupLabelMilk}\t{GroupValueMilk}");
注意 sXPathT 和 sXPathO 中的“.//”。这样我们只搜索当前上下文而不是整个文档。
输出:
Bread 82548828
Milk 8545852485