忽略 HtmlNode.InnerText 中的空格

Ignore whitespace in HtmlNode.InnerText

我有 HTML 片段:

<p>Rendered on a website, 
this will all be on one line.</p>
<p>This would be on another line.</p>

C# 代码:

HtmlDocument doc = new HtmlDocument();
doc.Load(path);

string text = HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);

现在 "text" 将在 3 行:

Rendered on a website, 
this will all be on one line.
This would be on another line.

但我想要:

Rendered on a website, this will all be on one line.
This would be on another line.

这可以使用 HtmlAgilityPack 吗?

你可以这样做

string html = @"<p>Rendered on a website,
                this will all be on one line.</p>
                <p>This would be on another line.</p>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

string text = HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
Regex r = new Regex(@"\s+");
var sentences = text.Replace(",\r\n", ", ").Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
var finalText = string.Join("\r\n", sentences.Select(s => r.Replace(s, " ").Trim()));

Console.WriteLine(text + "\n");
Console.WriteLine(finalText + "\n");

你并不是真的需要正则表达式,我只是用它来摆脱我在 html 变量中硬编码 html 添加的 tabular/spacing 字符。

输出: