使用 lxml 剥离 html 标签时插入空格

Question

我想在使用 lxml

剥离标签和提取文本时在生成的文本中插入空格

我真的不知道lxml。通过 this answer（这似乎是基于@bluu 在同一页面上的评论），我有以下内容：

import lxml

def strip_html(s):
    return str(lxml.html.fromstring(s).text_content())

当我尝试这样做时：

strip_html("<p>This what you want.</p><p>This what you get.</p>")

我明白了：

'This what you want.This what you get.'

但我想要这个：

'This what you want. This what you get.'

我真正想要的是这样的：

from bs4 import BeautifulSoup

s = "<p>This what you want.</p><p>This what you get.</p>"

BeautifulSoup(s, "lxml").get_text(separator=" ")

这确实给出了所需的输出 - 对于所有标签 - 但我想在这种情况下避免惊人的 BeautifulSoup

我还希望它适用于所有标签，而不必拼出所有标签，或循环搜索特定字符等

我查看了 bs4 的 element.py 的代码以尝试适应 separator，我发现这不是一件简单的事情

我也在看 lxml.html.clean 和 this answer

Answer 1

您可以 select 所有包含文本的标签迭代这些和 join() ResultSet 分隔符：

s = "<p>This what you want.</p><p>This what you get.</p>"
' '.join([e.text_content() for e in lxml.html.document_fromstring(s).xpath("//*[text()]")])

例子

import lxml

def strip_html(s):
    return ' '.join([e.text_content() for e in lxml.html.document_fromstring(s).xpath("//*[text()]")])

strip_html("<p>This what you want.</p><p>This what you get.</p>")

输出

This what you want. This what you get.

使用 lxml 剥离 html 标签时插入空格

insert whitespace when stripping html tags using lxml

html

python

lxml

beautifulsoup

例子

输出