如何遍历一个FSharp.Data的HtmlDocument来提取内容为字符串?
How to traverse an FSharp.Data HtmlDocument to extract content as a string?
我想编写一个函数让我从这个 html:
<div>
<h1>Some header.</h1>
<ul>
<li>
<p>Hello world!</p>
</li>
<li>
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
</li>
</ul>
</div>
到这个字符串:
Some header. Hello world! What is going on? This is a link.
换句话说:我想让这个测试通过:
let testInput: string = """
<div>
<h1>Some header.</h1>
<ul>
<li>
<p>Hello world!</p>
</li>
<li>
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
</li>
</ul>
</div>
"""
let getContentsFromHtmlDocument (doc: HtmlDocument) =
let getInner (node: HtmlNode): string =
// How can I traverse this tree?
""
let result =
doc.Descendants ["h1"; "p"; "a"]
|> Seq.map getInner
|> List.ofSeq
|> List.fold (+) ""
result
[<Test>]
let Test1 () =
let htmlDoc: HtmlDocument = HtmlDocument.Parse(testInput)
let res = getContentsFromHtmlDocument htmlDoc
Assert.AreEqual("Some header. Hello world! What is going on? This is a link.", res)
但我无法确定如何遍历树。任何帮助,将不胜感激!谢谢
HtmlNodeExtensions
中有一个扩展方法,它提供了您通常用来遍历树的方法。对于您的特定用例,有 HtmlNodeExtensions.DirectInnerText(n)
.
尽管如此,要通过测试,您需要 space 分隔的内部文本,String.Join
可以更有效地做到这一点。
let getContentsFromHtmlDocument (doc: HtmlDocument) =
let getInner (node: HtmlNode): string =
node.DirectInnerText()
let result =
doc.Descendants ["h1"; "p"; "a"]
|> Seq.map getInner
|> fun all -> String.Join(" ", all)
result
还有问题:
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
这将加入:
What is going on? . This is a link
而不是 What is going on? This is a link.
,后者无法用您当前的结构处理。
我想编写一个函数让我从这个 html:
<div>
<h1>Some header.</h1>
<ul>
<li>
<p>Hello world!</p>
</li>
<li>
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
</li>
</ul>
</div>
到这个字符串:
Some header. Hello world! What is going on? This is a link.
换句话说:我想让这个测试通过:
let testInput: string = """
<div>
<h1>Some header.</h1>
<ul>
<li>
<p>Hello world!</p>
</li>
<li>
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
</li>
</ul>
</div>
"""
let getContentsFromHtmlDocument (doc: HtmlDocument) =
let getInner (node: HtmlNode): string =
// How can I traverse this tree?
""
let result =
doc.Descendants ["h1"; "p"; "a"]
|> Seq.map getInner
|> List.ofSeq
|> List.fold (+) ""
result
[<Test>]
let Test1 () =
let htmlDoc: HtmlDocument = HtmlDocument.Parse(testInput)
let res = getContentsFromHtmlDocument htmlDoc
Assert.AreEqual("Some header. Hello world! What is going on? This is a link.", res)
但我无法确定如何遍历树。任何帮助,将不胜感激!谢谢
HtmlNodeExtensions
中有一个扩展方法,它提供了您通常用来遍历树的方法。对于您的特定用例,有 HtmlNodeExtensions.DirectInnerText(n)
.
尽管如此,要通过测试,您需要 space 分隔的内部文本,String.Join
可以更有效地做到这一点。
let getContentsFromHtmlDocument (doc: HtmlDocument) =
let getInner (node: HtmlNode): string =
node.DirectInnerText()
let result =
doc.Descendants ["h1"; "p"; "a"]
|> Seq.map getInner
|> fun all -> String.Join(" ", all)
result
还有问题:
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
这将加入:
What is going on? . This is a link
而不是 What is going on? This is a link.
,后者无法用您当前的结构处理。