使用 LINQ 解析 HTML

Question

我在解析 HTML 文件时需要帮助。我是 C# 和 LINQ 的新手，我尝试的一切都没有成功提取 "link" 和 "Name 1"

     <tr class="Row">
              <td width="80">
                <div align="left"> <a href="link">details</a>
                </div>
              </td> 
              <td width="152">Name 1</td> 
              <td width="151">Name 2</td> 
              <td width="152">Name 3</td> 
              <td width="151">Name 4</td> 
              <td width="151">Name 5</td> 
              <td width="152">Name 6</td>
      </tr>

      <tr class="Row">
              <td width="80">
                <div align="left"> <a href="link">details</a>
                </div>
              </td> 
              <td width="152">Name 1</td> 
              <td width="151">Name 2</td> 
              <td width="152">Name 3</td> 
              <td width="151">Name 4</td> 
              <td width="151">Name 5</td> 
              <td width="152">Name 6</td>
      </tr>

这是我试过的：

                var links = htmlDoc.DocumentNode.Descendants()
                    .Where(n => n.Name == "tr")
                    .Where(x => x.Attributes["class"] != null && x.Attributes["class"].Value == "Row")
                    .Select(x => x.Descendants()
                    .Where(s => s.Name == "href"));

                foreach (var link in links)
                {
                    Debug.WriteLine(link);
                }

Answer 1

var nodes= htmlDoc.DocumentNode.Descendants()
                    .Where(n => n.Name == "a" || 
(n.Name == "td" && n.Attribute["width"] != null && n.Attribute["width"].Value != "80" && n.Parent.Name == "tr" && n.Parent.Attribute["class"] != null && n.Parent.Attribute["class"].Value = "Row"));


                foreach (var node in nodes)
                {
                    if(node.Attribute["href"] != null)
                    {
                         Debug.WriteLine(node.Attribute["href"].Value);
                    }
                    else
                    {
                         Debug.WriteLine(node.InnerText);
                    }
                }

你需要这样的东西。您正在获取名称为 a 的每个节点或宽度不为 80 且 tr 父节点具有 class="Row"

的每个 td 节点

Answer 2

您的 linq 没有反映 html 的结构。它可以简单地使用 xpath.

来实现

 var links = htmlDoc.DocumentElement
    .SelectNodes("//tr[class='Row']/td/div/a")
    .Select(aElem=>aElem.Attributes["href"].Value)

使用 LINQ 解析 HTML

Parsing HTML using LINQ

c#

linq

parsing

html-parsing