如何提取 Python 中 HTML 页面元素的文本和 xpath
How to extract text and the xpath to that element of the HTML page in Python
我正在处理一个 Django 项目,我需要在其中提取所有包含文本的元素和该元素的 xPath。
例如:
<html>
<head>
<title>
The Demo page
</title>
</head>
<body>
<div>
<section>
<h1> Hello world
</h1>
</section>
<div>
<p>
Hope you all are doing well,
</p>
</div>
<div>
<p>
This is the example HTML
</p>
</div>
</div>
</body>
</html>
输出应该是这样的:
/head/title: The Demo Page
/body/div/section/h1: Hello world!
/body/div/div[1]/p: Hope you all are doing well,
/body/div/div[2]/p: This is the example HTML
像这样的东西应该可以工作:
from lxml import etree
html = """[your html above]"""
root = etree.fromstring(html)
targets = root.xpath('//text()[normalize-space()]/..')
tree = etree.ElementTree(root)
for target in targets:
print(tree.getpath(target),target.text.strip())
输出:
/html/head/title The Demo page
/html/body/div/section/h1 Hello world
/html/body/div/div[1]/p Hope you all are doing well,
/html/body/div/div[2]/p This is the example HTML
我正在处理一个 Django 项目,我需要在其中提取所有包含文本的元素和该元素的 xPath。 例如:
<html>
<head>
<title>
The Demo page
</title>
</head>
<body>
<div>
<section>
<h1> Hello world
</h1>
</section>
<div>
<p>
Hope you all are doing well,
</p>
</div>
<div>
<p>
This is the example HTML
</p>
</div>
</div>
</body>
</html>
输出应该是这样的:
/head/title: The Demo Page
/body/div/section/h1: Hello world!
/body/div/div[1]/p: Hope you all are doing well,
/body/div/div[2]/p: This is the example HTML
像这样的东西应该可以工作:
from lxml import etree
html = """[your html above]"""
root = etree.fromstring(html)
targets = root.xpath('//text()[normalize-space()]/..')
tree = etree.ElementTree(root)
for target in targets:
print(tree.getpath(target),target.text.strip())
输出:
/html/head/title The Demo page
/html/body/div/section/h1 Hello world
/html/body/div/div[1]/p Hope you all are doing well,
/html/body/div/div[2]/p This is the example HTML