如何在 C# 中提取特定的 link？

Question

我正在使用 HtmlAgilitypack 从以下网站提取一些数据：

 <div class="pull-right">
          <ul class="list-inline">
            <li class="social">
              <a target="_blank" href="https://www.facebook.com/wsat.a?ref=ts&amp;fref=ts" class="">
                <i class="icon fa fa-facebook" aria-hidden="true"></i>
              </a>
            </li>
            <li class="social">
              <a target="_blank" href="https://twitter.com/wsat_News" class="">
                <i class="icon fa fa-twitter" aria-hidden="true"></i>
              </a>
            </li>
            <li>
                <a href="/user" class="hide">
                <i class=" icon fa fa-user" aria-hidden="true"></i>
              </a>
            </li>
            <li>
              <a onclick="ga('send', 'event', 'PDF', 'Download', '');" href="https://wsat.com/pdf/issue15170/index.html" target="_blank" class="">

                PDF
                <i class="icon fa fa-file-pdf-o" aria-hidden="true"></i>
              </a>
            </li>

我已经成功地编写了这段代码来提取 html 脚本中的第一个 link，即 https://www.facebook.com/wsat。但是，我只想用 pdf 提取 link https://wsat.com/pdf/issue15170/index.html 但没有任何运气。如何指定要提取的 link？

        var url = "https://wsat.com/";
        var HttpClient = new HttpClient();
        var html = await HttpClient.GetStringAsync(url);
        var htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(html);


        var links = htmlDocument.DocumentNode.Descendants("div").Where(node => node.GetAttributeValue("class", "").Equals("pull-right")).ToList();

        var alink = links.First().Descendants("a").FirstOrDefault().ChildAttributes("href")?.FirstOrDefault().Value;

        await Launcher.OpenAsync(alink);

Answer 1

在您的查询中 Descendants("a") 选择了根 div 中的所有 link。在 FirstOrDefault() returns 之后，您只是第一个 link。所以你可以做的是将每个 link 映射到它的 href，然后对集合使用字符串操作来找到合适的。

        var alink = links.First().Descendants("a")
            .Select(node => node.ChildAttributes("href").FirstOrDefault()?.Value)
            .Where(s => !string.IsNullOrEmpty(s))
            .ToList();
        foreach (var l in alink)
        {
            Console.WriteLine(l);
        }
        Console.WriteLine();

        var wsatCom = alink.FirstOrDefault(s => s.StartsWith("https://wsat.com"));
        Console.WriteLine(wsatCom);

此外。 ?. 运算符需要在 FirstOrDefault() 之后而不是之前，如果你想处理没有 href 的 links。我相信在那种情况下 ChildAttributes("href") returns 空集合，FirstOrDefault returns null，并且你有空引用异常。

Answer 2

使用 xpath 表达式作为选择器：

var alink = htmlDocument.DocumentNode
    .SelectSingleNode("//li/a[contains(@onclick, 'PDF')]")
    .GetAttributeValue("href", "");

xpath 说明（按要求）：

将文档中任意深度的 li 标记与直接子 a 标记匹配，该标记的属性 onclick 包含字符串 'PDF'.

Answer 3

Regex 可以帮助您吗？我认为这比使用 HTML 敏捷包遍历链接要容易得多，而且感觉不像是幸运的机会。

using System;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        string pattern = @"https:\/\/wsat\.com\/[\w\-\.]+[^#?\s][^""]+";
        string input = @"<div class=""pull-right"">
          <ul class=""list-inline"">
            <li class=""social"">
              <a target=""_blank"" href=""https://www.facebook.com/wsat.a?ref=ts&amp;fref=ts"" class="""">
                <i class=""icon fa fa-facebook"" aria-hidden=""true""></i>
              </a>
            </li>
            <li class=""social"">
              <a target=""_blank"" href=""https://twitter.com/wsat_News"" class="""">
                <i class=""icon fa fa-twitter"" aria-hidden=""true""></i>
              </a>
            </li>
            <li>
                <a href=""/user"" class=""hide"">
                <i class="" icon fa fa-user"" aria-hidden=""true""></i>
              </a>
            </li>
            <li>
              <a onclick=""ga('send', 'event', 'PDF', 'Download', '');"" href=""https://wsat.com/pdf/issue15170/index.html"" target=""_blank"" class="""">

                PDF
                <i class=""icon fa fa-file-pdf-o"" aria-hidden=""true""></i>
              </a>
            </li>";
        RegexOptions options = RegexOptions.Multiline;

        foreach (Match m in Regex.Matches(input, pattern, options))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    }
}

Answer 4

对于这种工作，我建议使用 AngleSharp 它允许您使用 css select 或 select 任何您需要的元素。

var doc = new HtmlParser().ParseDocument(myHtml);
var pdfUrl = doc.QuerySelector("ul.list-inline a:nth-child(4)").GetAttribute("href");

或

var links = doc.QuerySelectorAll("ul.list-inline a").Where(a=> a.GetAttribute("href").StartsWith("https://wsat.com/pdf/")).ToList();

好处是您可以随时测试您的 select 或在任何浏览器开发人员控制台中，而无需 code/compile 您的 C#

如何在 C# 中提取特定的 link？

How to extract specific link in c#?

c#

html-agility-pack