如何使用 html 敏捷包从包含 html 内容的字符串中提取链接？

Question

for (int i = 0; i < numberoflinks; i++)
{
    string downloadString = client.DownloadString(mainlink+i+".html");
    var document = new HtmlWeb().Load(url);
    var urls = document.DocumentNode.Descendants("img")
                        .Select(e => e.GetAttributeValue("src", null))
                        .Where(s => !String.IsNullOrEmpty(s))
}

问题是 HtmlWeb().Load 需要一个 html url 但我想加载字符串 downloadString 里面已经有 html 内容。

更新：

我现在试过了：

for (int i = 0; i < numberoflinks; i++)
            {

                string downloadString = client.DownloadString(mainlink+i+".html");
                HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
                document.Load(downloadString);
                var urls = document.DocumentNode.Descendants("img")
                                                .Select(e => e.GetAttributeValue("src", null))
                                                .Where(s => !String.IsNullOrEmpty(s));
            }

但是我在线上遇到异常:

document.Load(downloadString);

路径中有非法字符

我想做的是 download/extract 每个 link 中的所有 .JPG 图像。无需先将 url 下载到硬盘，而是将内容下载到一个字符串中，然后在 html 中提取所有以 .JPG 结尾的图像 link，然后下载 JPG。

Answer 1

您应该能够使用 HtmlDocument 的 LoadHtml() 方法处理 HTML 的字符串。

来自源代码：

public void LoadHtml(string html)

Loads the HTML document from the specified string.

param name="html"

String containing the HTML document to load. May not be null.

Load 方法需要一个文件名，这是有关 illegal characters in path 的消息的原因。

如何使用 html 敏捷包从包含 html 内容的字符串中提取链接？

How can i extract links from string with html content using htmlagilitypack?

.net

c#

winforms

html-agility-pack