在 Clojure 中使用 JSoup 解析字符串
Using JSoup to parse a String with Clojure
用JSoup用Clojure解析一个html字符串,来源如下
依赖关系
:dependencies [[org.clojure/clojure "1.10.1"]
[org.jsoup/jsoup "1.13.1"]]
源代码
(require '[clojure.string :as str])
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph paragraphs}))
(fetch_html HTML)
预期结果
{:title "Website title",
:paragraph ["Sample paragraph number 1"
"Sample paragraph number 2"]}
很遗憾,结果不如预期
user ==> (fetch_html HTML)
{:title "Website title", :paragraph []}
我有一个Clojure wrapper for TagSoup that might be useful. Try running it in this template project。要在您的项目中使用,请添加行:
[tupelo "21.01.05"]
到 project.clj
中的 :dependencies
。
代码示例:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require
[tupelo.parse.tagsoup :as tagsoup]
))
(dotest
(let [html "<html>
<head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"]
(is= (tagsoup/parse html)
{:tag :html,
:attrs {},
:content [{:tag :head,
:attrs {},
:content [{:tag :title, :attrs {}, :content ["Website title"]}]}
{:tag :body,
:attrs {},
:content [{:tag :p, :attrs {}, :content ["Sample paragraph number 1 "]}
{:tag :p, :attrs {}, :content ["Sample paragraph number 2"]}]}]})))
详情
如果你看一下源代码,你就会很容易明白为什么要使用包装函数!
(ns tupelo.parse.tagsoup
(:use tupelo.core)
(:require
[schema.core :as s]
[tupelo.parse.xml :as xml]
[tupelo.string :as ts]
[tupelo.schema :as tsk]))
(s/defn ^:private tagsoup-parse-fn
[input-source :- org.xml.sax.InputSource
content-handler]
(doto (org.ccil.cowan.tagsoup.Parser.)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/default-attributes" false)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/cdata-elements" true)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace" true)
(.setContentHandler content-handler)
(.setProperty "http://www.ccil.org/~cowan/tagsoup/properties/auto-detector"
(proxy [org.ccil.cowan.tagsoup.AutoDetector] []
(autoDetectingReader [^java.io.InputStream is]
(java.io.InputStreamReader. is "UTF-8"))))
(.setProperty "http://xml.org/sax/properties/lexical-handler" content-handler)
(.parse input-source)))
; #todo make use string input: (ts/string->stream html-str)
(s/defn parse-raw :- tsk/KeyMap
"Loads and parse an HTML resource and closes the input-stream."
[html-str :- s/Str]
(xml/parse-raw-streaming
(org.xml.sax.InputSource.
(ts/string->stream html-str))
tagsoup-parse-fn))
; #todo make use string input: (ts/string->stream html-str)
(s/defn parse :- tsk/KeyMap
"Loads and parse an HTML resource and closes the input-stream."
[html-str :- s/Str]
(xml/enlive-remove-whitespace
(xml/enlive-normalize
(parse-raw
html-str))))
(.getElementsByTag ...) returns 一个元素的序列,你需要在每个元素上调用 .text() 方法来获取文本值。我正在使用 Jsoup 版本 1.13.1。
(ns core
(:import (org.jsoup Jsoup))
(:require [clojure.string :as str]))
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph (mapv #(.text %) paragraphs)}))
(fetch_html HTML)
也可以考虑使用 Reaver,它是一个包装 JSoup 的 Clojure 库,或者像其他人建议的任何其他包装器。
用JSoup用Clojure解析一个html字符串,来源如下
依赖关系
:dependencies [[org.clojure/clojure "1.10.1"]
[org.jsoup/jsoup "1.13.1"]]
源代码
(require '[clojure.string :as str])
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph paragraphs}))
(fetch_html HTML)
预期结果
{:title "Website title",
:paragraph ["Sample paragraph number 1"
"Sample paragraph number 2"]}
很遗憾,结果不如预期
user ==> (fetch_html HTML)
{:title "Website title", :paragraph []}
我有一个Clojure wrapper for TagSoup that might be useful. Try running it in this template project。要在您的项目中使用,请添加行:
[tupelo "21.01.05"]
到 project.clj
中的 :dependencies
。
代码示例:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require
[tupelo.parse.tagsoup :as tagsoup]
))
(dotest
(let [html "<html>
<head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"]
(is= (tagsoup/parse html)
{:tag :html,
:attrs {},
:content [{:tag :head,
:attrs {},
:content [{:tag :title, :attrs {}, :content ["Website title"]}]}
{:tag :body,
:attrs {},
:content [{:tag :p, :attrs {}, :content ["Sample paragraph number 1 "]}
{:tag :p, :attrs {}, :content ["Sample paragraph number 2"]}]}]})))
详情
如果你看一下源代码,你就会很容易明白为什么要使用包装函数!
(ns tupelo.parse.tagsoup
(:use tupelo.core)
(:require
[schema.core :as s]
[tupelo.parse.xml :as xml]
[tupelo.string :as ts]
[tupelo.schema :as tsk]))
(s/defn ^:private tagsoup-parse-fn
[input-source :- org.xml.sax.InputSource
content-handler]
(doto (org.ccil.cowan.tagsoup.Parser.)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/default-attributes" false)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/cdata-elements" true)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace" true)
(.setContentHandler content-handler)
(.setProperty "http://www.ccil.org/~cowan/tagsoup/properties/auto-detector"
(proxy [org.ccil.cowan.tagsoup.AutoDetector] []
(autoDetectingReader [^java.io.InputStream is]
(java.io.InputStreamReader. is "UTF-8"))))
(.setProperty "http://xml.org/sax/properties/lexical-handler" content-handler)
(.parse input-source)))
; #todo make use string input: (ts/string->stream html-str)
(s/defn parse-raw :- tsk/KeyMap
"Loads and parse an HTML resource and closes the input-stream."
[html-str :- s/Str]
(xml/parse-raw-streaming
(org.xml.sax.InputSource.
(ts/string->stream html-str))
tagsoup-parse-fn))
; #todo make use string input: (ts/string->stream html-str)
(s/defn parse :- tsk/KeyMap
"Loads and parse an HTML resource and closes the input-stream."
[html-str :- s/Str]
(xml/enlive-remove-whitespace
(xml/enlive-normalize
(parse-raw
html-str))))
(.getElementsByTag ...) returns 一个元素的序列,你需要在每个元素上调用 .text() 方法来获取文本值。我正在使用 Jsoup 版本 1.13.1。
(ns core
(:import (org.jsoup Jsoup))
(:require [clojure.string :as str]))
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph (mapv #(.text %) paragraphs)}))
(fetch_html HTML)
也可以考虑使用 Reaver,它是一个包装 JSoup 的 Clojure 库,或者像其他人建议的任何其他包装器。