lxml xpath - 找不到正文标签

Question

我正在尝试用 Calibre 编写一个插件来检查 epub 文档中的脚注（基本上是寻找字体大小 < 某个值）。我需要获取包含文本的 html 文件（在 <body> 标签内）中的所有子标签，但我遇到了一个问题。

LXML xpath 找不到 <body> 或其中的任何内容。

下面是从 Calibre 自己的函数创建的 html 和使用 etree.SubElement

插入的 <p>Hello World</p>

<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>Hero filtered</title>
  <link href="page_styles.css" rel="stylesheet" type="text/css"/>
  <link href="stylesheet.css" rel="stylesheet" type="text/css"/>
</head>

<body>

<p>Hello World</p></body>
</html>

这些是我试过的东西

query = ".//body" # This doesn't 
query = "body" # This doesn't 
query = ".//*/body" # This doesn't 
query = ".//*//body" # This doesn't
query = "./body" # This doesn't 
query = ".//body/*" # This doesn't 
query = ".//body/p" # This doesn't

虽然这些确实有效

query = "/*/*[2]/*[normalize-space(text())]" # this works
found= self.footnotes_file.find("{*}" + "body") # this works

我一直在使用 lxml 中的以下函数

found = self.footnotes_file.xpath(query)

其中 self.footnotes_file 是使用 Calibre 函数 parsed(self, name) 生成的，returns 传递给它的 html 文件的根元素

self.footnotes_file = current_container().parsed(footnote_file_name)

所以问题是我做错了什么！

Answer 1

您似乎运行遇到了名称空间问题。有几种方法可以处理它。两个简单的方法是从

中删除名称空间引用

<html xmlns="http://www.w3.org/1999/xhtml">

所以标签就是 <html>。

另一种方法是将您的查询更改为

//*[local-name()="body"]

看看这些是否有效。

lxml xpath - 找不到正文标签

lxml xpath - Can't find body tag

html

tags

xpath

lxml

find