使用正确的换行符将(渲染)HTML 转换为文本

Convert (render) HTML to Text with correct line-breaks

我需要将 HTML 字符串转换为纯文本(最好使用 HTML 敏捷包)。使用适当的空格,尤其是 适当的换行符 .

"proper line-breaks" 我的意思是这段代码:

<div>
    <div>
        <div>
            line1
        </div>
    </div>
</div>
<div>line2</div>

应转换为

line1
line2

只有一个换行符。

我见过的大多数解决方案只是将所有 <div> <br> <p> 标签转换为 \n,这显然是 s*cks.

对于 C# 的 html-to-plaintext 呈现 logic 有什么建议吗?不是完整的代码,至少像 "replace all closing DIVs with line-breaks, but only if the next sibling is not a DIV too" 这样的常见逻辑答案确实有帮助。

我尝试过的事情:简单地获取 .InnerText 属性(显然错误),正则表达式(缓慢,痛苦,大量黑客攻击,正则表达式比 HtmlAgilityPack 慢 12 倍 - 我测量了它) ,这个 solution 和类似的(returns 然后需要更多的换行符)

我对 html-agility-pack 了解不多,但这里有一个 c# 替代方案。

    public string GetPlainText()
    {
        WebRequest request = WebRequest.Create("URL for page you want to 'stringify'");
        WebResponse response = request.GetResponse();
        Stream data = response.GetResponseStream();
        string html = String.Empty;
        using (StreamReader sr = new StreamReader(data))
        {
            html = sr.ReadToEnd();
        }

        html = Regex.Replace(html, "<.*?>", "\n");

        html = Regex.Replace(html, @"\r|\n|\n|\r", @"$");
        html = Regex.Replace(html, @"$ +", @"$");
        html = Regex.Replace(html, @"($)+", Environment.NewLine);

        return html;
    }

如果您打算在 html 页面中显示此内容,请将 Environment.NewLine 替换为 <br/>

下面的 class 提供了 innerText 的替代实现。它不会为后续的 div 发出一个以上的换行符,因为它只考虑区分不同文本内容的标签。评估每个文本节点的父节点以确定是否要插入换行符或 space。因此,任何不包含直接文本的标签都会被自动忽略。

您提供的案例提供了与您期望的相同的结果。此外:

<div>ABC<br>DEF<span>GHI</span></div>

给予

ABC
DEF GHI

同时

<div>ABC<br>DEF<div>GHI</div></div>

给予

ABC
DEF
GHI

因为 div 是块标记。 scriptstyle 元素被完全忽略。 HttpUtility.HtmlDecode 实用方法(在 System.Web 中)用于解码 HTML 转义文本,如 &amp;。多次出现的白色 space (\s+) 被替换为单个 space。 br 标签如果重复不会导致多个换行符。

static class HtmlTextProvider
{
    private static readonly HashSet<string> InlineElementNames = new HashSet<string>
    {
        //from https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elemente
        "b", "big", "i", "small", "tt", "abbr", "acronym", 
        "cite", "code", "dfn", "em", "kbd", "strong", "samp", 
        "var", "a", "bdo", "br", "img", "map", "object", "q", 
        "script", "span", "sub", "sup", "button", "input", "label", 
        "select", "textarea"
    }; 

    private static readonly Regex WhitespaceNormalizer = new Regex(@"(\s+)", RegexOptions.Compiled);

    private static readonly HashSet<string> ExcludedElementNames = new HashSet<string>
    {
        "script", "style"
    }; 

    public static string GetFormattedInnerText(this HtmlDocument document)
    {
        var textBuilder = new StringBuilder();
        var root = document.DocumentNode;
        foreach (var node in root.Descendants())
        {
            if (node is HtmlTextNode && !ExcludedElementNames.Contains(node.ParentNode.Name))
            {
                var text = HttpUtility.HtmlDecode(node.InnerText);
                text = WhitespaceNormalizer.Replace(text, " ").Trim();
                if(string.IsNullOrWhiteSpace(text)) continue;
                var whitespace = InlineElementNames.Contains(node.ParentNode.Name) ? " " : Environment.NewLine;
                //only 
                if (EndsWith(textBuilder, " ") && whitespace == Environment.NewLine)
                {
                    textBuilder.Remove(textBuilder.Length - 1, 1);
                    textBuilder.AppendLine();
                }
                textBuilder.Append(text);
                textBuilder.Append(whitespace);
                if (!char.IsWhiteSpace(textBuilder[textBuilder.Length - 1]))
                {
                    if (InlineElementNames.Contains(node.ParentNode.Name))
                    {
                        textBuilder.Append(' ');
                    }
                    else
                    {
                        textBuilder.AppendLine();
                    }
                }
            }
            else if (node.Name == "br" && EndsWith(textBuilder, Environment.NewLine))
            {
                textBuilder.AppendLine();
            }
        }
        return textBuilder.ToString().TrimEnd(Environment.NewLine.ToCharArray());
    }

    private static bool EndsWith(StringBuilder builder, string value)
    {
        return builder.Length > value.Length && builder.ToString(builder.Length - value.Length, value.Length) == value;
    }
}

下面的代码适合我:

 static void Main(string[] args)
        {
              StringBuilder sb = new StringBuilder();
        string path = new WebClient().DownloadString("https://www.google.com");
        HtmlDocument htmlDoc = new HtmlDocument();
        ////htmlDoc.LoadHtml(File.ReadAllText(path));
        htmlDoc.LoadHtml(path);
        var bodySegment = htmlDoc.DocumentNode.Descendants("body").FirstOrDefault();
        if (bodySegment != null)
        {
            foreach (var item in bodySegment.ChildNodes)
            {
                if (item.NodeType == HtmlNodeType.Element && string.Compare(item.Name, "script", true) != 0)
                {
                    foreach (var a in item.Descendants())
                    {
                        if (string.Compare(a.Name, "script", true) == 0 || string.Compare(a.Name, "style", true) == 0)
                        {
                            a.InnerHtml = string.Empty;
                        }
                    }
                    sb.AppendLine(item.InnerText.Trim());
                }
            }
        }


            Console.WriteLine(sb.ToString());
            Console.Read();
        }

下面的代码与提供的示例一起正常工作,甚至处理了一些奇怪的东西,如 <div><br></div>,还有一些地方需要改进,但基本思想是存在的。看评论。

public static string FormatLineBreaks(string html)
{
    //first - remove all the existing '\n' from HTML
    //they mean nothing in HTML, but break our logic
    html = html.Replace("\r", "").Replace("\n", " ");

    //now create an Html Agile Doc object
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //remove comments, head, style and script tags
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//comment() | //script | //style | //head"))
    {
        node.ParentNode.RemoveChild(node);
    }

    //now remove all "meaningless" inline elements like "span"
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//span | //label")) //add "b", "i" if required
    {
        node.ParentNode.ReplaceChild(HtmlNode.CreateNode(node.InnerHtml), node);
    }

    //block-elements - convert to line-breaks
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//p | //div")) //you could add more tags here
    {
        //we add a "\n" ONLY if the node contains some plain text as "direct" child
        //meaning - text is not nested inside children, but only one-level deep

        //use XPath to find direct "text" in element
        var txtNode = node.SelectSingleNode("text()");

        //no "direct" text - NOT ADDDING the \n !!!!
        if (txtNode == null || txtNode.InnerHtml.Trim() == "") continue;

        //"surround" the node with line breaks
        node.ParentNode.InsertBefore(doc.CreateTextNode("\r\n"), node);
        node.ParentNode.InsertAfter(doc.CreateTextNode("\r\n"), node);
    }

    //todo: might need to replace multiple "\n\n" into one here, I'm still testing...

    //now BR tags - simply replace with "\n" and forget
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//br"))
        node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);

    //finally - return the text which will have our inserted line-breaks in it
    return doc.DocumentNode.InnerText.Trim();

    //todo - you should probably add "&code;" processing, to decode all the &nbsp; and such
}    

//here's the extension method I use
private static HtmlNodeCollection SafeSelectNodes(this HtmlNode node, string selector)
{
    return (node.SelectNodes(selector) ?? new HtmlNodeCollection(node));
}

我不认为 SO 是为了交换编写完整代码解决方案的赏金。我认为最好的答案是那些提供指导并帮助您自己解决问题的答案。本着这种精神,我想到了一个应该可行的过程:

  1. 用单个space替换任何长度的白色space字符(这是代表标准HTML白色space处理规则)
  2. </div> 的所有实例替换为换行符
  3. 用单个换行符折叠任何多个换行符实例
  4. 用换行符替换 </p><br><br/> 的实例
  5. 删除所有剩余的 html open/close 标签
  6. 展开任何实体,例如&trade; 根据需要
  7. Trim 删除尾随和前导的输出 spaces

基本上,您希望每个段落或换行符制表符有一个换行符,但要用一个闭包折叠多个 div 闭包 - 所以先做这些。

最后请注意,您真正执行的是HTML布局,这取决于标签的CSS。您看到的行为是因为 div 默认为块 display/layout 模式。 CSS 会改变这一点。如果没有无头 layout/rendering 引擎,即可以处理 CSS.

的东西,就没有简单的方法来解决这个问题。

但对于您的简单示例,上述方法应该是合理的。

疑虑:

  1. 不可见标签(脚本、样式)
  2. 块级标签
  3. 内嵌标签
  4. Br 标签
  5. 可换行空格(前导、尾随和多个空格)
  6. Hard spaces
  7. 实体

代数决定:

  plain-text = Process(Plain(html))

  Plain(node-s) => Plain(node-0), Plain(node-1), ..., Plain(node-N)
  Plain(BR) => BR
  Plain(not-visible-element(child-s)) => nil
  Plain(block-element(child-s)) => BS, Plain(child-s), BE
  Plain(inline-element(child-s)) => Plain(child-s)   
  Plain(text) => ch-0, ch-1, .., ch-N

  Process(symbol-s) => Process(start-line, symbol-s)

  Process(start-line, BR, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(start-line, BS, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, BE, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, hard-space, symbol-s) => Print(' '), Process(not-ws, symbol-s)
  Process(start-line, space, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, common-symbol, symbol-s) => Print(common-symbol), 
                                                  Process(not-ws, symbol-s)

  Process(not-ws, BR|BS|BE, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(not-ws, hard-space, symbol-s) => Print(' '), Process(not-ws, symbol-s)
  Process(not-ws, space, symbol-s) => Process(ws, symbol-s)
  Process(not-ws, common-symbol, symbol-s) => Process(ws, symbol-s)

  Process(ws, BR|BS|BE, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(ws, hard-space, symbol-s) => Print(' '), Print(' '), 
                                       Process(not-ws, symbol-s)
  Process(ws, space, symbol-s) => Process(ws, symbol-s)
  Process(ws, common-symbol, symbol-s) => Print(' '), Print(common-symbol),
                                          Process(not-ws, symbol-s)

HtmlAgilityPack 和 System.Xml.Linq 的 C# 决策:

  //HtmlAgilityPack part
  public static string ToPlainText(this HtmlAgilityPack.HtmlDocument doc)
  {
    var builder = new System.Text.StringBuilder();
    var state = ToPlainTextState.StartLine;

    Plain(builder, ref state, new[]{doc.DocumentNode});
    return builder.ToString();
  }
  static void Plain(StringBuilder builder, ref ToPlainTextState state, IEnumerable<HtmlAgilityPack.HtmlNode> nodes)
  {
    foreach (var node in nodes)
    {
      if (node is HtmlAgilityPack.HtmlTextNode)
      {
        var text = (HtmlAgilityPack.HtmlTextNode)node;
        Process(builder, ref state, HtmlAgilityPack.HtmlEntity.DeEntitize(text.Text).ToCharArray());
      }
      else
      {
        var tag = node.Name.ToLower();

        if (tag == "br")
        {
          builder.AppendLine();
          state = ToPlainTextState.StartLine;
        }
        else if (NonVisibleTags.Contains(tag))
        {
        }
        else if (InlineTags.Contains(tag))
        {
          Plain(builder, ref state, node.ChildNodes);
        }
        else
        {
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
          Plain(builder, ref state, node.ChildNodes);
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
        }

      }

    }
  }

  //System.Xml.Linq part
  public static string ToPlainText(this IEnumerable<XNode> nodes)
  {
    var builder = new System.Text.StringBuilder();
    var state = ToPlainTextState.StartLine;

    Plain(builder, ref state, nodes);
    return builder.ToString();
  }
  static void Plain(StringBuilder builder, ref ToPlainTextState state, IEnumerable<XNode> nodes)
  {
    foreach (var node in nodes)
    {
      if (node is XElement)
      {
        var element = (XElement)node;
        var tag = element.Name.LocalName.ToLower();

        if (tag == "br")
        {
          builder.AppendLine();
          state = ToPlainTextState.StartLine;
        }
        else if (NonVisibleTags.Contains(tag))
        {
        }
        else if (InlineTags.Contains(tag))
        {
          Plain(builder, ref state, element.Nodes());
        }
        else
        {
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
          Plain(builder, ref state, element.Nodes());
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
        }

      }
      else if (node is XText)
      {
        var text = (XText)node;
        Process(builder, ref state, text.Value.ToCharArray());
      }
    }
  }
  //common part
  public static void Process(System.Text.StringBuilder builder, ref ToPlainTextState state, params char[] chars)
  {
    foreach (var ch in chars)
    {
      if (char.IsWhiteSpace(ch))
      {
        if (IsHardSpace(ch))
        {
          if (state == ToPlainTextState.WhiteSpace)
            builder.Append(' ');
          builder.Append(' ');
          state = ToPlainTextState.NotWhiteSpace;
        }
        else
        {
          if (state == ToPlainTextState.NotWhiteSpace)
            state = ToPlainTextState.WhiteSpace;
        }
      }
      else
      {
        if (state == ToPlainTextState.WhiteSpace)
          builder.Append(' ');
        builder.Append(ch);
        state = ToPlainTextState.NotWhiteSpace;
      }
    }
  }
  static bool IsHardSpace(char ch)
  {
    return ch == 0xA0 || ch ==  0x2007 || ch == 0x202F;
  }

  private static readonly HashSet<string> InlineTags = new HashSet<string>
  {
      //from https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elemente
      "b", "big", "i", "small", "tt", "abbr", "acronym", 
      "cite", "code", "dfn", "em", "kbd", "strong", "samp", 
      "var", "a", "bdo", "br", "img", "map", "object", "q", 
      "script", "span", "sub", "sup", "button", "input", "label", 
      "select", "textarea"
  };

  private static readonly HashSet<string> NonVisibleTags = new HashSet<string>
  {
      "script", "style"
  };

  public enum ToPlainTextState
  {
    StartLine = 0,
    NotWhiteSpace,
    WhiteSpace,
  }

}

示例:

// <div>  1 </div>  2 <div> 3  </div>
1
2
3
//  <div>1  <br/><br/>&#160; <b> 2 </b> <div>   </div><div> </div>  &#160;3</div>
1

  2
 3
//  <span>1<style> text </style><i>2</i></span>3
123
//<div>
//    <div>
//        <div>
//            line1
//        </div>
//    </div>
//</div>
//<div>line2</div>
line1
line2

我的项目总是使用 CsQuery。据推测,它比 HtmlAgilityPack 更快,并且更容易使用 css 选择器而不是 xpath。

var html = @"<div>
    <div>
        <div>
            line1
        </div>
    </div>
</div>
<div>line2</div>";

var lines = CQ.Create(html)
              .Text()
              .Replace("\r\n", "\n") // I like to do this before splitting on line breaks
              .Split('\n')
              .Select(s => s.Trim()) // Trim elements
              .Where(s => !s.IsNullOrWhiteSpace()) // Remove empty lines
              ;

var result = string.Join(Environment.NewLine, lines);

上面的代码按预期工作,但是如果您有一个更复杂的示例和预期的结果,则可以轻松容纳此代码。

例如,如果你想保留 <br>,你可以在 html 变量中将其替换为类似“---br---”的内容,并在最后一次拆分它结果。

No-regex 解决方案:

while (text.IndexOf("\n\n") > -1 || text.IndexOf("\n \n") > -1)
{
    text = text.Replace("\n\n", "\n");
    text = text.Replace("\n \n", "\n");
}

正则表达式:

text = Regex.Replace(text, @"^\s*$\n|\r", "", RegexOptions.Multiline).TrimEnd();

另外,我记得,

text = HtmlAgilityPack.HtmlEntity.DeEntitize(text);

帮个忙。

03/2021 更新

更新包括 HtmlAgilityPack 更改(新方法而不是不存在的方法)和 HTML 解码 HTML 实体(即。)。

public static string FormatLineBreaks(string html)
{
    //first - remove all the existing '\n' from HTML
    //they mean nothing in HTML, but break our logic
    html = html.Replace("\r", "").Replace("\n", " ");

    //now create an Html Agile Doc object
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //remove comments, head, style and script tags
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//comment() | //script | //style | //head") ?? Enumerable.Empty<HtmlNode>())
    {
        node.ParentNode.RemoveChild(node);
    }

    //now remove all "meaningless" inline elements like "span"
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span | //label") ?? Enumerable.Empty<HtmlNode>()) //add "b", "i" if required
    {
        node.ParentNode.ReplaceChild(HtmlNode.CreateNode(node.InnerHtml), node);
    }

    //block-elements - convert to line-breaks
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p | //div") ?? Enumerable.Empty<HtmlNode>()) //you could add more tags here
    {
        //we add a "\n" ONLY if the node contains some plain text as "direct" child
        //meaning - text is not nested inside children, but only one-level deep

        //use XPath to find direct "text" in element
        var txtNode = node.SelectSingleNode("text()");

        //no "direct" text - NOT ADDDING the \n !!!!
        if (txtNode == null || txtNode.InnerHtml.Trim() == "") continue;

        //"surround" the node with line breaks
        node.ParentNode.InsertBefore(doc.CreateTextNode("\r\n"), node);
        node.ParentNode.InsertAfter(doc.CreateTextNode("\r\n"), node);
    }

    //todo: might need to replace multiple "\n\n" into one here, I'm still testing...

    //now BR tags - simply replace with "\n" and forget
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//br") ?? Enumerable.Empty<HtmlNode>())
        node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);

    //finally - return the text which will have our inserted line-breaks in it
    return WebUtility.HtmlDecode(doc.DocumentNode.InnerText.Trim());

    //todo - you should probably add "&code;" processing, to decode all the &nbsp; and such
}