用于捕获特定根下的所有嵌套文本的 XPath 表达式
XPath expression to capture all nested text under a specific root
我有一些 HTML 我想使用 Python + lxml
从中提取文本内容
<html>
<body>
<p>Some text I DON'T want</p>
<div class="container">
<p>Some text I DO want</p>
<span>
<a href="#">A link I DO want</a>
</span>
</div>
</body>
</html>
几个条件 -
我只想要文本嵌套在特定根目录下div[@class='container']
我想要 all 嵌套文本在该根目录下
所以-
if __name__=="__main__":
import lxml.html
doc=lxml.html.fromstring(HTML)
root=doc.xpath("//div[@class='container']").pop()
for xpath in ["p|a",
"//p|//a"]:
print ("%s -> %s" % (xpath,
"; ".join([el.text_content()
for el in root.xpath(xpath)])))
然后-
$ python xpath_test.py
p|a -> Some text I DO want
//p|//a -> Some text I DON'T want; Some text I DO want; A link I DO want
所以 p|a
也捕获了 一点点 (不捕获嵌套的 link)而 //p|//a
也捕获了 much(我不想要的标签)
什么 xpath 表达式 return 只有 Some text I DO want; A link I DO want
?
使用以下 XPath(指定 div
的所有文本后代,不包括空白节点):
//div[@class="container"]//text()[normalize-space()]
一段代码:
data = """HTML
<html>
<body>
<p>Some text I DON'T want</p>
<div class="container">
<p>Some text I DO want</p>
<span>
<a href="#">A link I DO want</a>
</span>
</div>
</body>
</html>
HTML"""
import lxml.html
tree = lxml.html.fromstring(data)
print (tree.xpath('//div[@class="container"]//text()[normalize-space()]'))
输出:
['Some text I DO want', 'A link I DO want']
我有一些 HTML 我想使用 Python + lxml
<html>
<body>
<p>Some text I DON'T want</p>
<div class="container">
<p>Some text I DO want</p>
<span>
<a href="#">A link I DO want</a>
</span>
</div>
</body>
</html>
几个条件 -
我只想要文本嵌套在特定根目录下
div[@class='container']
我想要 all 嵌套文本在该根目录下
所以-
if __name__=="__main__":
import lxml.html
doc=lxml.html.fromstring(HTML)
root=doc.xpath("//div[@class='container']").pop()
for xpath in ["p|a",
"//p|//a"]:
print ("%s -> %s" % (xpath,
"; ".join([el.text_content()
for el in root.xpath(xpath)])))
然后-
$ python xpath_test.py
p|a -> Some text I DO want
//p|//a -> Some text I DON'T want; Some text I DO want; A link I DO want
所以 p|a
也捕获了 一点点 (不捕获嵌套的 link)而 //p|//a
也捕获了 much(我不想要的标签)
什么 xpath 表达式 return 只有 Some text I DO want; A link I DO want
?
使用以下 XPath(指定 div
的所有文本后代,不包括空白节点):
//div[@class="container"]//text()[normalize-space()]
一段代码:
data = """HTML
<html>
<body>
<p>Some text I DON'T want</p>
<div class="container">
<p>Some text I DO want</p>
<span>
<a href="#">A link I DO want</a>
</span>
</div>
</body>
</html>
HTML"""
import lxml.html
tree = lxml.html.fromstring(data)
print (tree.xpath('//div[@class="container"]//text()[normalize-space()]'))
输出:
['Some text I DO want', 'A link I DO want']