忽略 HtmlNode.InnerText 中的空格
Ignore whitespace in HtmlNode.InnerText
我有 HTML 片段:
<p>Rendered on a website,
this will all be on one line.</p>
<p>This would be on another line.</p>
C# 代码:
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
string text = HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
现在 "text" 将在 3 行:
Rendered on a website,
this will all be on one line.
This would be on another line.
但我想要:
Rendered on a website, this will all be on one line.
This would be on another line.
这可以使用 HtmlAgilityPack 吗?
你可以这样做
string html = @"<p>Rendered on a website,
this will all be on one line.</p>
<p>This would be on another line.</p>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
Regex r = new Regex(@"\s+");
var sentences = text.Replace(",\r\n", ", ").Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
var finalText = string.Join("\r\n", sentences.Select(s => r.Replace(s, " ").Trim()));
Console.WriteLine(text + "\n");
Console.WriteLine(finalText + "\n");
你并不是真的需要正则表达式,我只是用它来摆脱我在 html
变量中硬编码 html 添加的 tabular/spacing 字符。
输出:
我有 HTML 片段:
<p>Rendered on a website,
this will all be on one line.</p>
<p>This would be on another line.</p>
C# 代码:
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
string text = HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
现在 "text" 将在 3 行:
Rendered on a website,
this will all be on one line.
This would be on another line.
但我想要:
Rendered on a website, this will all be on one line.
This would be on another line.
这可以使用 HtmlAgilityPack 吗?
你可以这样做
string html = @"<p>Rendered on a website,
this will all be on one line.</p>
<p>This would be on another line.</p>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
Regex r = new Regex(@"\s+");
var sentences = text.Replace(",\r\n", ", ").Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
var finalText = string.Join("\r\n", sentences.Select(s => r.Replace(s, " ").Trim()));
Console.WriteLine(text + "\n");
Console.WriteLine(finalText + "\n");
你并不是真的需要正则表达式,我只是用它来摆脱我在 html
变量中硬编码 html 添加的 tabular/spacing 字符。
输出: