通过镜头从树中过滤内部元素
Filter inner element from a tree via lens
我一直承认自己镜头不好,但是通过实例学习不是很好吗?我想获取 HTML,用 taggy-lens
解析它,然后从里面删除所有 script
元素。这是我的尝试:
#!/usr/bin/env stack
-- stack --resolver lts-7.1 --install-ghc runghc --package text --package lens --package taggy-lens --package string-class --package classy-prelude
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude
import Control.Lens hiding (children, element)
import Data.String.Class (toText, fromText, toString)
import Data.Text (Text)
import Text.Taggy.Lens
import qualified Text.Taggy.Lens as Taggy
import qualified Text.Taggy.Renderer as Renderer
somehtmlSmall :: Text
somehtmlSmall =
"<!doctype html><html><body>\
\<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div>\
\</body></html>"
renderWithoutScriptTag :: Text
renderWithoutScriptTag =
let mArticle :: Maybe Taggy.Element
mArticle =
(fromText somehtmlSmall) ^? html .
allAttributed (ix "id" . only "article")
mArticleFiltered =
fmap
(\el ->
el ^.. to universe . traverse .
filtered (\n -> n ^. name /= "script"))
mArticle
in maybe "" (toText . concatMap Renderer.render) mArticleFiltered
main :: IO ()
main = print renderWithoutScriptTag
将此文件标记为可执行文件并 运行 它,您将看到:
➜ tmp ./scraping-question.hs
"<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div><div>first</div><div>second</div><div>third</div>"
所以,这没有用。我愿意:
- 有可行的解决方案
- 了解工作解决方案
如果您能帮助我了解我的问题所在,将不胜感激。谢谢!
问题的根源在于 universe
,它将 DOM 树展平为一个列表。如果您再次查看输出,您会看到过滤工作正常,但树结构丢失了——因此您得到未修改的文章元素(所有子元素仍在其中)后跟子节点减去脚本元素。
一个 Control.Lens.Plated
可以做你想做的事情的组合器是 transform
,它转换 "every element in the tree, in a bottom-up manner":
transform :: Plated a => (a -> a) -> a -> a
特别是,您可以使用它递归地过滤子节点:
renderWithoutScriptTag :: Text
renderWithoutScriptTag =
let mArticle :: Maybe Taggy.Element
mArticle =
(fromText somehtmlSmall) ^? html .
allAttributed (ix "id" . only "article")
mArticleFiltered =
fmap
(transform (children %~ filter (\n ->
n ^? element . name /= Just "script")))
mArticle
in maybe "" (toText . Renderer.render) mArticleFiltered
我一直承认自己镜头不好,但是通过实例学习不是很好吗?我想获取 HTML,用 taggy-lens
解析它,然后从里面删除所有 script
元素。这是我的尝试:
#!/usr/bin/env stack
-- stack --resolver lts-7.1 --install-ghc runghc --package text --package lens --package taggy-lens --package string-class --package classy-prelude
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude
import Control.Lens hiding (children, element)
import Data.String.Class (toText, fromText, toString)
import Data.Text (Text)
import Text.Taggy.Lens
import qualified Text.Taggy.Lens as Taggy
import qualified Text.Taggy.Renderer as Renderer
somehtmlSmall :: Text
somehtmlSmall =
"<!doctype html><html><body>\
\<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div>\
\</body></html>"
renderWithoutScriptTag :: Text
renderWithoutScriptTag =
let mArticle :: Maybe Taggy.Element
mArticle =
(fromText somehtmlSmall) ^? html .
allAttributed (ix "id" . only "article")
mArticleFiltered =
fmap
(\el ->
el ^.. to universe . traverse .
filtered (\n -> n ^. name /= "script"))
mArticle
in maybe "" (toText . concatMap Renderer.render) mArticleFiltered
main :: IO ()
main = print renderWithoutScriptTag
将此文件标记为可执行文件并 运行 它,您将看到:
➜ tmp ./scraping-question.hs
"<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div><div>first</div><div>second</div><div>third</div>"
所以,这没有用。我愿意:
- 有可行的解决方案
- 了解工作解决方案
如果您能帮助我了解我的问题所在,将不胜感激。谢谢!
问题的根源在于 universe
,它将 DOM 树展平为一个列表。如果您再次查看输出,您会看到过滤工作正常,但树结构丢失了——因此您得到未修改的文章元素(所有子元素仍在其中)后跟子节点减去脚本元素。
一个 Control.Lens.Plated
可以做你想做的事情的组合器是 transform
,它转换 "every element in the tree, in a bottom-up manner":
transform :: Plated a => (a -> a) -> a -> a
特别是,您可以使用它递归地过滤子节点:
renderWithoutScriptTag :: Text
renderWithoutScriptTag =
let mArticle :: Maybe Taggy.Element
mArticle =
(fromText somehtmlSmall) ^? html .
allAttributed (ix "id" . only "article")
mArticleFiltered =
fmap
(transform (children %~ filter (\n ->
n ^? element . name /= Just "script")))
mArticle
in maybe "" (toText . Renderer.render) mArticleFiltered