像 Beautifulsoup 一样在 Enlive 中解析 HTML
Parse HTML in Enlive like in Beautifulsoup
我正在尝试使用 Enlive 从 Clojure 中的 HTML 获取链接。我可以从页面中获取所有链接的列表吗?我可以遍历它们吗?
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
links = soup.find_all('a')
或
links = soup('a')
如何使用 Enlive 在 Clojure 中执行此操作?
那会很简单:
(require '[net.cgrand.enlive-html :as enlive])
(let [data (enlive/html-resource (java.net.URL. "https://www.whosebug.com"))
all-refs (enlive/select data [:a])]
(first all-refs))
;;=> {:tag :a, :attrs {:href "https://whosebug.com", :class "-logo js-gps-track", :data-gps-track "top_nav.click({is_current:true, location:1, destination:8})"}, :content ("\n " {:tag :span, :attrs {:class "-img"}, :content ("Stack Overflow")} "\n ")}
all-refs
集合将以实时表示形式包含来自页面的所有链接。
(let [data (enlive/html-resource (java.net.URL. "https://www.whosebug.com"))
all-refs (enlive/select data [:a])]
(map #(-> % :attrs :href) all-refs))
例如将从链接
中收集所有href
值
首先,您需要使用 Enlive 的 html-resource
功能摄取一些 HTML。我们将抓住 news.google.com:
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(def goog-news (fetch-url "https://news.google.com"))
要获取所有 <a>
标签,请使用 select
函数和一个简单的 选择器(第二个参数):
(html/select goog-news [:a])
这将评估一系列地图,每个 <a>
标签一个。这是来自今日新闻的示例 <a>
标签地图:
{:tag :a,
:attrs {:class "nuEeue hzdq5d ME7ew",
:target "_blank",
:href "https://www.vanityfair.com/hollywood/2018/01/first-black-panther-reviews",
:jsname "NV4Anc"},
:content ("The First Black Panther Reviews Are Here—and They're Ecstatic")}
要获取每个 <a>
的内部文本,您可以 map
Enlive 的 text
函数处理结果,例如(map html/text *1)
。要获得每个 href
,您可以 (map (comp :href :attrs) *1)
.
我正在尝试使用 Enlive 从 Clojure 中的 HTML 获取链接。我可以从页面中获取所有链接的列表吗?我可以遍历它们吗?
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
links = soup.find_all('a')
或
links = soup('a')
如何使用 Enlive 在 Clojure 中执行此操作?
那会很简单:
(require '[net.cgrand.enlive-html :as enlive])
(let [data (enlive/html-resource (java.net.URL. "https://www.whosebug.com"))
all-refs (enlive/select data [:a])]
(first all-refs))
;;=> {:tag :a, :attrs {:href "https://whosebug.com", :class "-logo js-gps-track", :data-gps-track "top_nav.click({is_current:true, location:1, destination:8})"}, :content ("\n " {:tag :span, :attrs {:class "-img"}, :content ("Stack Overflow")} "\n ")}
all-refs
集合将以实时表示形式包含来自页面的所有链接。
(let [data (enlive/html-resource (java.net.URL. "https://www.whosebug.com"))
all-refs (enlive/select data [:a])]
(map #(-> % :attrs :href) all-refs))
例如将从链接
中收集所有href
值
首先,您需要使用 Enlive 的 html-resource
功能摄取一些 HTML。我们将抓住 news.google.com:
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(def goog-news (fetch-url "https://news.google.com"))
要获取所有 <a>
标签,请使用 select
函数和一个简单的 选择器(第二个参数):
(html/select goog-news [:a])
这将评估一系列地图,每个 <a>
标签一个。这是来自今日新闻的示例 <a>
标签地图:
{:tag :a,
:attrs {:class "nuEeue hzdq5d ME7ew",
:target "_blank",
:href "https://www.vanityfair.com/hollywood/2018/01/first-black-panther-reviews",
:jsname "NV4Anc"},
:content ("The First Black Panther Reviews Are Here—and They're Ecstatic")}
要获取每个 <a>
的内部文本,您可以 map
Enlive 的 text
函数处理结果,例如(map html/text *1)
。要获得每个 href
,您可以 (map (comp :href :attrs) *1)
.