从 HTML 文件中删除文本，但使用 python 保留 javascript 和结构

Question

有很多方法可以从 html 文件中提取文本，但我想做相反的事情，在结构和 javascript 代码保持不变的情况下删除文本。

例如，删除所有

，同时保留

有没有简单的方法来做到这一点？任何帮助是极大的赞赏。干杯

Answer 1

我会选择 BeautifulSoup:

from bs4 import BeautifulSoup
from bs4.element import NavigableString
from copy import copy

def strip_content(in_tag):
    tag = copy(in_tag) # remove this line if you don't care about your input
    if tag.name == 'script':
        # Do no mess with scripts
        return tag
    # strip content from all children
    children = [strip_content(child) for child in tag.children if not isinstance(child, NavigableString)]
    # remove everything from the tag
    tag.clear()
    for child in children:
        # Add back stripped children
        tag.append(child)
    return tag

def test(filename):
    soup = BeautifulSoup(open(filename))
    cleaned_soup = strip_content(soup)
    print(cleaned_soup.prettify())

if __name__ == "__main__":
    test("myfile.html")

从 HTML 文件中删除文本，但使用 python 保留 javascript 和结构

Remove text from HTML files but keep the javascript and structure using python

html

python

extract

beautifulsoup