如何在lxml中获取连接的子文本节点
How to get concatenated child text nodes in lxml
这是 HTML
示例:
<div class="wpb_text_column">
<div class="wpb_wrapper">
<p style="text-align: center;"><a href="http://somepage1.com">First text part </a></p>
<p style="text-align: center;"><a href="http://somepage2.com">Second text part </a></p>
<p style="text-align: center;"><a href="http://somepage3.com">Third text part</a></p>
</div>
</div>
<div class="wpb_text_column">
<div class="wpb_wrapper">
<p style="text-align: center;"><a href="http://somepage4.com">First text part </a></p>
<p style="text-align: center;"><a href="http://somepage5.com">Second text part</a></p>
</div>
</div>
使用以下代码
tree = html.fromstring(html_sample)
tree.xpath('//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]/p/a/text()')
我可以获得文本值列表
['First text part ', 'Second text part ', 'Third text part', 'First text part ', 'Second text part']
但是,我想从每个 div
中获取所有值作为单个字符串,如
['First text part Second text part Third text part', 'First text part Second text part']
和
//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]/normalize-space()
好像是exact XPath
to solve the problem,但是lxml
不支持/normalize-space()
语法:
lxml.etree.XPathEvalError: Invalid expression
那么如何在 lxml
中获得所需的输出?
使用以下代码解决:
[" ".join(string.text_content().split()) for string in tree.xpath('//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]')]
这是 HTML
示例:
<div class="wpb_text_column">
<div class="wpb_wrapper">
<p style="text-align: center;"><a href="http://somepage1.com">First text part </a></p>
<p style="text-align: center;"><a href="http://somepage2.com">Second text part </a></p>
<p style="text-align: center;"><a href="http://somepage3.com">Third text part</a></p>
</div>
</div>
<div class="wpb_text_column">
<div class="wpb_wrapper">
<p style="text-align: center;"><a href="http://somepage4.com">First text part </a></p>
<p style="text-align: center;"><a href="http://somepage5.com">Second text part</a></p>
</div>
</div>
使用以下代码
tree = html.fromstring(html_sample)
tree.xpath('//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]/p/a/text()')
我可以获得文本值列表
['First text part ', 'Second text part ', 'Third text part', 'First text part ', 'Second text part']
但是,我想从每个 div
中获取所有值作为单个字符串,如
['First text part Second text part Third text part', 'First text part Second text part']
和
//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]/normalize-space()
好像是exact XPath
to solve the problem,但是lxml
不支持/normalize-space()
语法:
lxml.etree.XPathEvalError: Invalid expression
那么如何在 lxml
中获得所需的输出?
使用以下代码解决:
[" ".join(string.text_content().split()) for string in tree.xpath('//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]')]