剥离 HTML 标签形成字符串 keeping/removing 之间的文本

Stripping HTML tags form string keeping/removing the text in between

我想清理 html python 3 中的一些内容,我在其中使用了一些跨度标签来标记插入的文本和删除文本的颜色和删除线。一个例子:

<p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua. <span class="inserted">
Lorem ipsum</span> Lorem ipsum dolor sit amet, consetetur 
sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut 
labore et dolore magna aliquyam erat, sed diam voluptua. At 
vero eos et accusam et justo duo dolores et ea rebum. 
<span class="strikethrough">Lorem ipsum</span> lorem 
<span class="inserted">ipsum</span>. At vero eos et accusam et 
justo duo dolores et ea rebum. Stet clita kasd gubergren, 
no sea takimata sanctus est Lorem ipsum dolor sit amet.</p>

我想做的是删除 span 标签,将 span 标签之间的文本保留为 class 'inserted' 并删除 span 标签之间的文本 'strikethrough'.

我发现这是为了去除标签,将文本保持在:

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

但是如果标签有特殊的 class ('strikethrough'),我想删除 span 标签之间的文本。

我该怎么做?

你几乎是对的。 您只需要使用 handle_starttag() and handle_endtag() 方法和一些变量来跟踪当前状态。

这个怎么样:

from html.parser import HTMLParser


class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True

        self._forbidden = False
        self._result = []

    def handle_starttag(self, tag, attrs):
        if tag in ['span']:
            if 'strikethrough' in [a for _, a in attrs]:
                self._forbidden = True

    def handle_endtag(self, tag):
        self._forbidden = False

    def handle_data(self, data):
        if not self._forbidden:
            self._result.append(data)


st = MLStripper()
st.feed('''
<p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. <span class="inserted">
Lorem ipsum</span> Lorem ipsum dolor sit amet, consetetur
sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut
labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum.
<span class="strikethrough">Lorem ipsum</span> lorem
<span class="inserted">ipsum</span>. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.</p>
''')

print(''.join(st._result))

结果:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua.
Lorem ipsum Lorem ipsum dolor sit amet, consetetur
sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut
labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum.
 lorem
ipsum. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.