如何遍历一个FSharp.Data的HtmlDocument来提取内容为字符串?

How to traverse an FSharp.Data HtmlDocument to extract content as a string?

我想编写一个函数让我从这个 html:

<div>
  <h1>Some header.</h1>
  <ul>
    <li>
      <p>Hello world!</p>
    </li>
    <li>
      <p>What is going on? <a href="http://example.com">This is a link</a>.</p>
    </li>
  </ul>
</div>

到这个字符串:

Some header. Hello world! What is going on? This is a link.

换句话说:我想让这个测试通过:

let testInput: string = """
<div>
  <h1>Some header.</h1>
  <ul>
    <li>
      <p>Hello world!</p>
    </li>
    <li>
      <p>What is going on? <a href="http://example.com">This is a link</a>.</p>
    </li>
  </ul>
</div>
"""

let getContentsFromHtmlDocument (doc: HtmlDocument) =
  let getInner (node: HtmlNode): string =
    // How can I traverse this tree?
    ""
  let result =
    doc.Descendants ["h1"; "p"; "a"]
    |> Seq.map getInner
    |> List.ofSeq
    |> List.fold (+) ""
  result

[<Test>]
let Test1 () =
    let htmlDoc: HtmlDocument = HtmlDocument.Parse(testInput)
    let res = getContentsFromHtmlDocument htmlDoc
    Assert.AreEqual("Some header. Hello world! What is going on? This is a link.", res)

但我无法确定如何遍历树。任何帮助,将不胜感激!谢谢

HtmlNodeExtensions 中有一个扩展方法,它提供了您通常用来遍历树的方法。对于您的特定用例,有 HtmlNodeExtensions.DirectInnerText(n).

尽管如此,要通过测试,您需要 space 分隔的内部文本,String.Join 可以更有效地做到这一点。

let getContentsFromHtmlDocument (doc: HtmlDocument) =
    let getInner (node: HtmlNode): string =
        node.DirectInnerText()

    let result =
        doc.Descendants ["h1"; "p"; "a"]
        |> Seq.map getInner
        |> fun all -> String.Join(" ", all)

    result

还有问题:

<p>What is going on? <a href="http://example.com">This is a link</a>.</p>

这将加入:

What is going on? . This is a link 而不是 What is going on? This is a link.,后者无法用您当前的结构处理。