如何使用dex、plump、clss等Common Lisp库提取网页标题？

Question

我正在使用 Emacs、Slime 和 SBCL 在台式机中开发 Common Lisp 运行 NixOS。

此外，我正在使用库 dex, plump, and clss 来提取网页标题。因此，我做了：

CL-USER> (clss:select "title" (plump:parse  (dex:get "http://www.pdelfino.com.br")))
#(#<PLUMP-DOM:ELEMENT title {1009C488E3}>)

我期待：“Pedro Delfino”。

相反，我得到了 object:

#(#<PLUMP-DOM:ELEMENT title {1009C488E3}>)

如果我描述 object 它并不能帮助我找到我想要的值：

CL-USER> (clss:select "title" (plump:parse  (dex:get "http://www.pdelfino.com.br")))
#(#<PLUMP-DOM:ELEMENT title {100A9888E3}>)
CL-USER> (describe *)
#(#<PLUMP-DOM:ELEMENT title {100A9888E3}>)
  [vector]

Element-type: T
Fill-pointer: 1
Size: 10
Adjustable: yes
Displaced: no
Storage vector: #<(SIMPLE-VECTOR 10) {100A9B65BF}>
; No value
CL-USER>

我需要的值在哪里？

谢谢

Answer 1

标题正文在其childtext-node中。

(plump:text (plump:first-child (aref (clss:select "title" (plump:parse (dex:get "http://www.pdelfino.com.br"))) 0))) 将 return 此示例中的文本。

Answer 2

你可以用plump:text向plump请求returnHTML节点内的文本。它接受一个节点，而不是一个数组（returned by clss:select），所以你必须使用 aref 来获取第一个。

(plump:text (aref  
   (clss:select "title" (plump:parse  
     (dex:get "http://www.pdelfino.com.br"))) 
   0))

plump:serialize 会 return HTML 内容（对检查结果很有用）。

您还可以通过使用 LQuery 同时使用 CLSS 和 Plump。 https://shinmera.github.io/lquery/ 我们需要用 initialize 解析 HTML，然后我们使用 $，就像 (lquery:$ <document> "selector") 一样。我们可以添加 (text) 或 (serialize) 作为最后一个参数。

(defparameter *PDELFINO-PARSED* (lquery:$ (initialize (dex:get "http://www.pdelfino.com.br"))))

(lquery:$ *PDELFINO-PARSED* "title")
#(#<PLUMP-DOM:ELEMENT title {1008645923}>)

CIEL-USER> (lquery:$ *PDELFINO-PARSED* "title" (text))
#("Pedro Delfino")

CIEL-USER> (aref * 0)
"Pedro Delfino"

CIEL-USER> (lquery:$ *PDELFINO-PARSED* "title" (serialize))
#("<title>Pedro Delfino</title>")

如何使用dex、plump、clss等Common Lisp库提取网页标题？

How to use Common Lisp libraries of dex, plump, and clss to extract the title of a web page?

http

common-lisp

html-parsing