HtmlAgilityPack select 来自子节点的数据

Question

我要解析以下 HTML 代码：

<h3 class='bar'>
  <a href='http://anysite.com/index.php?showuser=7195' title='Profile view'>THIS_IS_USERNAME</a>
  &nbsp;
  <a href='http://anysite.com/index.php?showuser=7195&amp;f=' class='__user __id7195' title='Profile view'>
      <img src='http://anysite.com/public/style_images/car/user_popup.png' alt='' />
   </a>
</h3>

我需要的是 select 用户名 ("THIS_IS_USERNAME") 和 link 个人资料 ("http://anysite.com/index.php?showuser=7195")

我可以 select 使用下一个代码的顶部 h3 节点：

List<HtmlNode> resultSearch = HTMLPage.DocumentNode.Descendants()
                .Where(
                         x => x.Name.Equals("h3")
                         && x.Attributes["class"] != null
                         && x.Attributes["class"].Value.Equals("bar")                         
                      )
                .ToList();

但是我怎么能得到的不是 "h3" 节点本身，而是 "h3" 里面的 "a" ，这个属性 link 包含用户名和 link要配置文件我需要什么？

Answer 1

直接查询link节点即可，有Title属性很有特色

在这种情况下，使用 XPath 可能更简单，因为它处理所有中间空值检查，而且它与 type-safe 一样，因为您的 Linq 查询将有很多常量字符串：

var node = HTMLPage.DocumentNode.SelectSingleNode("//hr[@class='Bar']/a[@title='Profile View' and @href");
if (node != null)
{
    string link = node.Attributes["href"].Value;
    string username = node.InnerText;
}

您可以使用 Linq 语法编写类似的代码，它首先搜索 link 标签，然后回溯为它找到一个 h3 parent。这样你就不必检查中间空值 ;):

var node = HtmlPage.DocumentNode.Descendants("a")
    .Where(a =>
        a.Ascendants("h3")
            .Any(h3 =>
                h3.Attributes["class"] != null 
                && a.Attributes["class"].Value == "bar"
            )
    )
    .Where(a => 
        a.Attributes["title"] != null 
        && a.Attributes["title"].Value == "Profile View"
        && a.Attributes["href"] != null
    )
    .FirstOrDefault();

if (node != null)
{
    string link = node.Attributes["href"].value;
    string username = node.InnerText;
}

或者你可以使用它的位置作为 "bar" 的第一个 <a> child:

// the call to First() will throw an exception if the h3 isn't found.
// returning an empty HtmlNode will allow you to ignore that

var node = (HtmlPage.DocumentNode.Descendants("h3")
    .FirstOrDefault( h => 
            h3.Attributes["class"] != null 
            && a.Attributes["class"].Value == "bar")
    ) ?? HtmlPage.CreateElement("h3")) 
    .Elements("a").FirstOrDefault();

if (node != null)
{
    string link = node.Attributes["href"].value;
    string username = node.InnerText;
}

HtmlAgilityPack select 来自子节点的数据

HtmlAgilityPack select data from sub-node

html

c#

linq

parsing

html-agility-pack