将 html 结构转换为 Clojure 结构

turning a html structure into a Clojure Structure

我有一个 html 页面,其中一个结构我想转换为 Clojure 数据结构。我在如何以惯用的方式处理这个问题上遇到了心理障碍


<div class=“group”>
  <div class=“subgroup”>
    <a href=“path1” />
  <div class=“subgroup”>
    <a href=“path2” />
<div class=“group”>
  <div class=“subgroup”>
    <a href=“path3” />


[“Title1” “subhead1” “path1”]
[“Title1” “subhead2” “path2”]
[“Title2” “subhead3” “path3”]
[“Title3” “subhead4” “path4”]
[“Title3” “subhead5” “path5”]
[“Title3” “subhead6” “path6”]


我读过 David Nolan’s enlive tutorial。如果组和子组之间存在奇偶校验,这提供了一个很好的解决方案,但在这种情况下它可以是随机的。



(require '[hickory.core :as html])

(defn classifier [tag klass]
  (comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))

(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))

(defn only [x]
  {:pre [(seq x)
         (nil? (next x))]}
  (first x))

(defn identifier [tag element]
  (->> element :content (filter (identifier? tag)) only :content only))

(defn process [data]
  (for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
        :let [title (identifier :h2 group)]
        subgroup (filter subgroup? (:content group))
        :let [subheading (identifier :h3 subgroup)]
        path (filter path? (:content subgroup))]
    [title subheading (:href (:attrs path))]))


(require '[clojure.pprint :as pprint])

(def data
"<div class=“group”>
  <div class=“subgroup”>
    <a href=“path1” />
  <div class=“subgroup”>
    <a href=“path2” />
<div class=“group”>
  <div class=“subgroup”>
    <a href=“path3” />

(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;;  ["title1" "subheading2" "“path2”"]
;;  ["title2" "subheading3" "“path3”"])


  • 解析:用clojure html parser或任何其他解析器解析它。
  • 自定义数据结构:修改解析后的html,如果需要可以使用clojure.walk

你可以用the tupelo.forest library. Here is an annotated unit test showing the approach. You can find more information in the API docs and both the unit tests and the example demos解决这个问题。其他文档即将发布。

  (with-forest (new-forest)
    (let [html-str        "<div class=“group”>
                              <div class=“subgroup”>
                                <a href=“path1” />
                              <div class=“subgroup”>
                                <a href=“path2” />
                            <div class=“group”>
                              <div class=“subgroup”>
                                <a href=“path3” />

          enlive-tree     (->> html-str
          root-hid        (add-tree-enlive enlive-tree)
          tree-1          (hid->hiccup root-hid)

          ; Removing whitespace nodes is optional; just done to keep things neat
          blank-leaf-hid? (fn fn-blank-leaf-hid? ; whitespace pred fn
                            (let [node (hid->node hid)]
                              (and (contains-key? node ::tf/value)
                                (ts/whitespace? (grab ::tf/value node)))))
          blank-leaf-hids (keep-if blank-leaf-hid? (all-leaf-hids)) ; find whitespace nodes
          >>              (apply remove-hid blank-leaf-hids) ; delete whitespace nodes found
          tree-2          (hid->hiccup root-hid)
          >>              (is= tree-2 [:html
                                        [:div {:class "“group”"}
                                         [:h2 "title1"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading1"]
                                          [:a {:href "“path1”"}]]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading2"]
                                          [:a {:href "“path2”"}]]]
                                        [:div {:class "“group”"}
                                         [:h2 "title2"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading3"]
                                          [:a {:href "“path3”"}]]]]])

          ; find consectutive nested [:div :h2] pairs at any depth in the tree
          div-h2-paths    (find-paths root-hid [:** :div :h2])
          >>              (is= (format-paths div-h2-paths)
                            [[{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title1"}]]]]
                             [{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title2"}]]]]])

          ; find the hid for each top-level :div (i.e. "group"); the next-to-last (-2) hid in each vector
          div-hids        (mapv #(idx % -2) div-h2-paths)
          ; for each of div-hids, find and collect nested :h3 values
          dif-h3-paths    (vec
                              (doseq [div-hid div-hids]
                                (let [h2-value  (find-leaf-value div-hid [:div :h2])
                                      h3-paths  (find-paths div-hid [:** :h3])
                                      h3-values (it-> h3-paths (mapv last it) (mapv hid->value it))]
                                  (doseq [h3-value h3-values]
                                    (yield [h2-value h3-value]))))))
      (is= dif-h3-paths
        [["title1" "subheading1"]
         ["title1" "subheading2"]
         ["title2" "subheading3"]])
