使用 HtmlAgilityPack 获取同一域上的所有链接
Get all links on the same domain using HtmlAgilityPack
我正在编写代码来抓取给定网页上的链接。我正在尝试 HtmlAgilityPack to read the html contents (using www.google.co.uk 在我的例子中)。我使用的代码如下:
class Program
{
static async Task Main(string[] args)
{
var links = GetLinks(new Uri("https://www.google.co.uk"));
foreach (var link in links)
{
Console.WriteLine(link);
}
}
private static List<string> GetLinks(Uri uri)
{
var doc = new HtmlWeb().Load(uri);
return doc.DocumentNode.Descendants("a")
.Select(a => a.GetAttributeValue("href", null))
.Distinct()
.Where(u => !string.IsNullOrEmpty(u)).ToList();
}
}
我正在删除空链接和重复链接。这给出了以下结果:
- https://www.google.co.uk/imghp?hl=en&tab=wi
- https://maps.google.co.uk/maps?hl=en&tab=wl
- https://play.google.com/?hl=en&tab=w8
- https://www.youtube.com/?gl=GB&tab=w1
- https://news.google.com/?tab=wn
- https://mail.google.com/mail/?tab=wm
- https://drive.google.com/?tab=wo
- https://www.google.co.uk/intl/en/about/products?tab=wh
- http://www.google.co.uk/history/optout?hl=en
- /preferences?hl=en
- https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.co.uk/&ec=GAZAAQ
- /advanced_search?hl=en-GB&authuser=0
- /intl/en/ads/
- /services/
- /intl/en/about.html
- https://www.google.co.uk/setprefdomain?prefdom=US&sig=K_eDMDym3RsPb7-MzvJkS4b2Eg4ns%3D
- /intl/en/policies/privacy/
- /intl/en/policies/terms/
我想将链接进一步缩小到 select 只匹配相同子域“www.google.co.uk”的链接,包括相对 URL。结果列表将缩小为:
- https://www.google.co.uk/imghp?hl=en&tab=wi
- https://www.google.co.uk/intl/en/about/products?tab=wh
- http://www.google.co.uk/history/optout?hl=en
- /preferences?hl=en
- /advanced_search?hl=en-GB&authuser=0
- /intl/en/ads/
- /services/
- /intl/en/about.html
- https://www.google.co.uk/setprefdomain?prefdom=US&sig=K_eDMDym3RsPb7-MzvJkS4b2Eg4ns%3D
- /intl/en/policies/privacy/
- /intl/en/policies/terms/
我希望以最有效的方式修改上面的代码以实现此目的,但不确定使用 HtmlAgilityPack 实现它的最佳方式。我找到了这个解决方案:
private static List<string> GetLinks(Uri uri)
{
var doc = new HtmlWeb().Load(uri);
return doc.DocumentNode.Descendants("a")
.Select(a =>
{
var val = a.GetAttributeValue("href", null);
if (val.StartsWith("/"))
val = $"{uri.Scheme}://{uri.Host}{val}";
return val;
})
.Distinct()
.Where(u =>
{
return !string.IsNullOrEmpty(u)
&& u.Contains(uri.Host); // using contains here is a problem
}).ToList();
}
我非常清楚这里涉及的字符串操作量,将相对 URL 更改为完全限定并匹配“包含”,这可能会留下不正确的结果。有没有人对此有更少浪费的(字符串比较和操作)解决方案?
非常感谢任何建议!
除非你的 HTML 很大并且有很多链接,否则我不会那么担心这里的性能优化。
听起来主要问题只是在 Where
子句中找到 Contains
的替代项。我会为此使用 Regex
。您可以在方法的开头构建一次匹配模式,然后在 Where
中使用 IsMatch
,如下所示:
private static List<string> GetLinks(Uri uri)
{
var regex = new Regex("^http(s)?://" + uri.Host, RegexOptions.IgnoreCase);
var doc = new HtmlWeb().Load(uri);
return doc.DocumentNode
.Descendants("a")
.Select(a =>
{
var val = a.GetAttributeValue("href", string.Empty);
return val.StartsWith("/") ? uri.GetLeftPart(UriPartial.Authority) + val : val;
})
.Distinct()
.Where(u => !string.IsNullOrEmpty(u) && regex.IsMatch(u))
.ToList();
}
Fiddle: https://dotnetfiddle.net/BVdv1Y
我正在编写代码来抓取给定网页上的链接。我正在尝试 HtmlAgilityPack to read the html contents (using www.google.co.uk 在我的例子中)。我使用的代码如下:
class Program
{
static async Task Main(string[] args)
{
var links = GetLinks(new Uri("https://www.google.co.uk"));
foreach (var link in links)
{
Console.WriteLine(link);
}
}
private static List<string> GetLinks(Uri uri)
{
var doc = new HtmlWeb().Load(uri);
return doc.DocumentNode.Descendants("a")
.Select(a => a.GetAttributeValue("href", null))
.Distinct()
.Where(u => !string.IsNullOrEmpty(u)).ToList();
}
}
我正在删除空链接和重复链接。这给出了以下结果:
- https://www.google.co.uk/imghp?hl=en&tab=wi
- https://maps.google.co.uk/maps?hl=en&tab=wl
- https://play.google.com/?hl=en&tab=w8
- https://www.youtube.com/?gl=GB&tab=w1
- https://news.google.com/?tab=wn
- https://mail.google.com/mail/?tab=wm
- https://drive.google.com/?tab=wo
- https://www.google.co.uk/intl/en/about/products?tab=wh
- http://www.google.co.uk/history/optout?hl=en
- /preferences?hl=en
- https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.co.uk/&ec=GAZAAQ
- /advanced_search?hl=en-GB&authuser=0
- /intl/en/ads/
- /services/
- /intl/en/about.html
- https://www.google.co.uk/setprefdomain?prefdom=US&sig=K_eDMDym3RsPb7-MzvJkS4b2Eg4ns%3D
- /intl/en/policies/privacy/
- /intl/en/policies/terms/
我想将链接进一步缩小到 select 只匹配相同子域“www.google.co.uk”的链接,包括相对 URL。结果列表将缩小为:
- https://www.google.co.uk/imghp?hl=en&tab=wi
- https://www.google.co.uk/intl/en/about/products?tab=wh
- http://www.google.co.uk/history/optout?hl=en
- /preferences?hl=en
- /advanced_search?hl=en-GB&authuser=0
- /intl/en/ads/
- /services/
- /intl/en/about.html
- https://www.google.co.uk/setprefdomain?prefdom=US&sig=K_eDMDym3RsPb7-MzvJkS4b2Eg4ns%3D
- /intl/en/policies/privacy/
- /intl/en/policies/terms/
我希望以最有效的方式修改上面的代码以实现此目的,但不确定使用 HtmlAgilityPack 实现它的最佳方式。我找到了这个解决方案:
private static List<string> GetLinks(Uri uri)
{
var doc = new HtmlWeb().Load(uri);
return doc.DocumentNode.Descendants("a")
.Select(a =>
{
var val = a.GetAttributeValue("href", null);
if (val.StartsWith("/"))
val = $"{uri.Scheme}://{uri.Host}{val}";
return val;
})
.Distinct()
.Where(u =>
{
return !string.IsNullOrEmpty(u)
&& u.Contains(uri.Host); // using contains here is a problem
}).ToList();
}
我非常清楚这里涉及的字符串操作量,将相对 URL 更改为完全限定并匹配“包含”,这可能会留下不正确的结果。有没有人对此有更少浪费的(字符串比较和操作)解决方案?
非常感谢任何建议!
除非你的 HTML 很大并且有很多链接,否则我不会那么担心这里的性能优化。
听起来主要问题只是在 Where
子句中找到 Contains
的替代项。我会为此使用 Regex
。您可以在方法的开头构建一次匹配模式,然后在 Where
中使用 IsMatch
,如下所示:
private static List<string> GetLinks(Uri uri)
{
var regex = new Regex("^http(s)?://" + uri.Host, RegexOptions.IgnoreCase);
var doc = new HtmlWeb().Load(uri);
return doc.DocumentNode
.Descendants("a")
.Select(a =>
{
var val = a.GetAttributeValue("href", string.Empty);
return val.StartsWith("/") ? uri.GetLeftPart(UriPartial.Authority) + val : val;
})
.Distinct()
.Where(u => !string.IsNullOrEmpty(u) && regex.IsMatch(u))
.ToList();
}
Fiddle: https://dotnetfiddle.net/BVdv1Y