如何从包含 `p` 标签和内部文本的 HTML 元素中提取文本？

Question

我正在使用名为 Reaver 的围绕 jsoup 的 Clojure 包装器抓取一个结构不佳 HTML 的网站。下面是一些 HTML 结构的示例：

<div id="article">
  <aside>unwanted text</aside>
  <p>Some text</p>
  <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
  <p>More text</p>
  <h2>A headline</h2>
  <figure><figcaption>unwanted text</figcaption></figure>
  <p>More text</p>
  Here is a paragraph made of some raw text directly in the div
  <p>Another paragraph of text</p>
  More raw text and this one has an <a>anchor tag</a> inside
  <dl>
    <dd>unwanted text</dd>
  </dl>
  <p>Etc etc</p>
</div>

此 div 代表 wiki 上的一篇文章。我想从中提取文本，但如您所见，有些段落在 p 标记中，有些直接包含在 div 中。我还需要标题和锚标记文本。

我知道如何从所有 p、a 和 h 标签中解析和提取文本，并且我可以 select div 并从中提取内部文本，但问题是我最终得到两个 select 文本，我需要以某种方式合并它们。

如何从此 div 中提取文本，以便 p、a、h 标签中的所有文本，以及div里面的文字是按顺序提取的？结果应该是与 HTML 中的顺序相同的文本段落。

这是我目前用来提取的内容，但结果中缺少内部 div 文本：

(defn get-texts [url]
  (:paragraphs (extract (parse (slurp url))
                        [:paragraphs]
                        "#article > *:not(aside, nav, table, figure, dl)" text)))

另请注意，此 div 中出现了其他不需要的元素，例如 aside、figure 等。这些元素包含文本，以及带有文本的嵌套元素，这些元素不应包含在结果中。

Answer 1

您可以将整篇文章提取为 JSoup 对象（可能是 Element），然后使用 reaver/to-edn 将其转换为 EDN 表示形式。然后你通过 :content 并处理字符串（TextNodes 的结果）和具有你感兴趣的 :tag 的元素。

（由 vaer-k 编写代码）

(defn get-article [url]
  (:article (extract (parse (slurp url))
                     [:article]
                     "#article"
                     edn)))

(defn text-elem?
  [element]
  (or (string? element)
      (contains? #{:p :a :b :i} (:tag element))))

(defn extract-text
  [{content :content}]
  (let [text-children (filter text-elem? content)]
    (reduce #(if (string? %2)
               (str %1 %2)
               (str %1 (extract-text %2)))
            ""
            text-children)))

(defn extract-article [url]
  (-> url
      get-article
      extract-text))

Answer 2

您可以使用 tupelo.forest 库解决这个问题，该库在上周 Clojure/Conj 2019 年的 "Unsession" 中提出。

下面是作为单元测试编写的解决方案。首先是一些声明和示例数据：

(ns tst.demo.core
  (:use tupelo.forest tupelo.core tupelo.test)
  (:require
    [clojure.string :as str]
    [schema.core :as s]
    [tupelo.string :as ts]))

(def html-src
  "<div id=\"article\">
    <aside>unwanted text</aside>
    <p>Some text</p>
    <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
    <p>More text</p>
    <h2>A headline</h2>
    <figure><figcaption>unwanted text</figcaption></figure>
    <p>More text</p>
    Here is a paragraph made of some raw text directly in the div
    <p>Another paragraph of text</p>
    More raw text and this one has an <a>anchor tag</a> inside
    <dl>
    <dd>unwanted text</dd>
    </dl>
    <p>Etc etc</p>
  </div> ")

首先，我们在删除所有换行符等之后将 html 数据（一棵树）添加到森林中。这在内部使用 the Java "TagSoup" parser：

(dotest
  (hid-count-reset)
  (with-forest (new-forest)
    (let [root-hid            (add-tree-html
                                (ts/collapse-whitespace html-src))
          unwanted-node-paths (find-paths-with root-hid [:** :*]
                                (s/fn [path :- [HID]]
                                  (let [hid  (last path)
                                        node (hid->node hid)
                                        tag  (grab :tag node)]
                                    (or
                                      (= tag :aside)
                                      (= tag :nav)
                                      (= tag :figure)
                                      (= tag :dl)))))]
      (newline) (spyx-pretty :html-orig (hid->bush root-hid))

spyx-pretty显示数据的"bush"格式（类似于打嗝格式）：

:html-orig (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :aside, :value "unwanted text"}]
   [{:tag :p, :value "Some text"}]
   [{:tag :nav}
    [{:tag :ol} [{:tag :li} [{:tag :h2, :value "unwanted text"}]]]]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :figure} [{:tag :figcaption, :value "unwanted text"}]]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :dl} [{:tag :dd, :value "unwanted text"}]]
   [{:tag :p, :value "Etc etc"}]]]]

所以我们可以看到数据已经正确加载。现在，我们要删除由 find-paths-with 标识的所有不需要的节点。然后，打印修改后的树：

      (doseq [path unwanted-node-paths]
        (remove-path-subtree path))
      (newline) (spyx-pretty :html-cleaned (hid->bush root-hid))

:html-cleaned (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :p, :value "Some text"}]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :p, :value "Etc etc"}]]]]

此时，我们简单地遍历树并将任何幸存的文本节点累积到一个向量中：

      (let [txt-accum (atom [])]
        (walk-tree root-hid
          {:enter (fn [path]
                    (let [hid   (last path)
                          node  (hid->node hid)
                          value (:value node)] ; may not be present
                      (when (string? value)
                        (swap! txt-accum append value))))})

为了验证，我们将找到的文本节点（忽略空格）与所需结果进行比较：

        (is-nonblank=  (str/join \space @txt-accum)
          "Some text
           More text
           A headline
           More text
           Here is a paragraph made of some raw text directly in the div
           Another paragraph of text
           More raw text and this one has an
           anchor tag
            inside
           Etc etc")))))

有关详细信息，请参阅 the README file and the API docs. Be sure to also view the Lightning Talk 的概述。

如何从包含 `p` 标签和内部文本的 HTML 元素中提取文本？

How can I extract text from an HTML element containing a mix of `p` tags and inner text?

html

clojure

jsoup