使用 HtmlAgilityPack 获取同一域上的所有链接

Get all links on the same domain using HtmlAgilityPack

我正在编写代码来抓取给定网页上的链接。我正在尝试 HtmlAgilityPack to read the html contents (using www.google.co.uk 在我的例子中)。我使用的代码如下:

class Program
{
    static async Task Main(string[] args)
    {
        var links = GetLinks(new Uri("https://www.google.co.uk"));
        foreach (var link in links)
        {
            Console.WriteLine(link);
        }
    }

    private static List<string> GetLinks(Uri uri)
    {
        var doc = new HtmlWeb().Load(uri);
        return doc.DocumentNode.Descendants("a")
            .Select(a => a.GetAttributeValue("href", null))
            .Distinct()
            .Where(u => !string.IsNullOrEmpty(u)).ToList();
    }
}

我正在删除空链接和重复链接。这给出了以下结果:

 - https://www.google.co.uk/imghp?hl=en&tab=wi
 - https://maps.google.co.uk/maps?hl=en&tab=wl
 - https://play.google.com/?hl=en&tab=w8
 - https://www.youtube.com/?gl=GB&tab=w1 
 - https://news.google.com/?tab=wn
 - https://mail.google.com/mail/?tab=wm 
 - https://drive.google.com/?tab=wo
 - https://www.google.co.uk/intl/en/about/products?tab=wh
 - http://www.google.co.uk/history/optout?hl=en 
 - /preferences?hl=en
 - https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.co.uk/&ec=GAZAAQ
 - /advanced_search?hl=en-GB&amp;authuser=0 
 - /intl/en/ads/ 
 - /services/
 - /intl/en/about.html
 - https://www.google.co.uk/setprefdomain?prefdom=US&amp;sig=K_eDMDym3RsPb7-MzvJkS4b2Eg4ns%3D
 - /intl/en/policies/privacy/ 
 - /intl/en/policies/terms/

我想将链接进一步缩小到 select 只匹配相同子域“www.google.co.uk”的链接,包括相对 URL。结果列表将缩小为:

 - https://www.google.co.uk/imghp?hl=en&tab=wi
 - https://www.google.co.uk/intl/en/about/products?tab=wh
 - http://www.google.co.uk/history/optout?hl=en 
 - /preferences?hl=en
 - /advanced_search?hl=en-GB&amp;authuser=0 
 - /intl/en/ads/ 
 - /services/
 - /intl/en/about.html
 - https://www.google.co.uk/setprefdomain?prefdom=US&amp;sig=K_eDMDym3RsPb7-MzvJkS4b2Eg4ns%3D
 - /intl/en/policies/privacy/ 
 - /intl/en/policies/terms/

我希望以最有效的方式修改上面的代码以实现此目的,但不确定使用 HtmlAgilityPack 实现它的最佳方式。我找到了这个解决方案:

private static List<string> GetLinks(Uri uri)
{
    var doc = new HtmlWeb().Load(uri);
    return doc.DocumentNode.Descendants("a")
        .Select(a => 
        { 
            var val = a.GetAttributeValue("href", null);

            if (val.StartsWith("/"))
                val = $"{uri.Scheme}://{uri.Host}{val}";

            return val;
        })
        .Distinct()
        .Where(u =>
        {
            return !string.IsNullOrEmpty(u) 
                   && u.Contains(uri.Host); // using contains here is a problem
        }).ToList();
}

我非常清楚这里涉及的字符串操作量,将相对 URL 更改为完全限定并匹配“包含”,这可能会留下不正确的结果。有没有人对此有更少浪费的(字符串比较和操作)解决方案?

非常感谢任何建议!

除非你的 HTML 很大并且有很多链接,否则我不会那么担心这里的性能优化。

听起来主要问题只是在 Where 子句中找到 Contains 的替代项。我会为此使用 Regex 。您可以在方法的开头构建一次匹配模式,然后在 Where 中使用 IsMatch,如下所示:

private static List<string> GetLinks(Uri uri)
{
    var regex = new Regex("^http(s)?://" + uri.Host, RegexOptions.IgnoreCase);
    var doc = new HtmlWeb().Load(uri);

    return doc.DocumentNode
        .Descendants("a")
        .Select(a =>
        {
            var val = a.GetAttributeValue("href", string.Empty);
            return val.StartsWith("/") ? uri.GetLeftPart(UriPartial.Authority) + val : val;
        })
        .Distinct()
        .Where(u => !string.IsNullOrEmpty(u) && regex.IsMatch(u))
        .ToList();
}

Fiddle: https://dotnetfiddle.net/BVdv1Y