从 HTML 文件中删除文本,但使用 python 保留 javascript 和结构
Remove text from HTML files but keep the javascript and structure using python
有很多方法可以从 html 文件中提取文本,但我想做相反的事情,在结构和 javascript 代码保持不变的情况下删除文本。
例如,删除所有
,同时保留
有没有简单的方法来做到这一点?任何帮助是极大的赞赏。
干杯
我会选择 BeautifulSoup:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
from copy import copy
def strip_content(in_tag):
tag = copy(in_tag) # remove this line if you don't care about your input
if tag.name == 'script':
# Do no mess with scripts
return tag
# strip content from all children
children = [strip_content(child) for child in tag.children if not isinstance(child, NavigableString)]
# remove everything from the tag
tag.clear()
for child in children:
# Add back stripped children
tag.append(child)
return tag
def test(filename):
soup = BeautifulSoup(open(filename))
cleaned_soup = strip_content(soup)
print(cleaned_soup.prettify())
if __name__ == "__main__":
test("myfile.html")
有很多方法可以从 html 文件中提取文本,但我想做相反的事情,在结构和 javascript 代码保持不变的情况下删除文本。
例如,删除所有
,同时保留
有没有简单的方法来做到这一点?任何帮助是极大的赞赏。 干杯
我会选择 BeautifulSoup:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
from copy import copy
def strip_content(in_tag):
tag = copy(in_tag) # remove this line if you don't care about your input
if tag.name == 'script':
# Do no mess with scripts
return tag
# strip content from all children
children = [strip_content(child) for child in tag.children if not isinstance(child, NavigableString)]
# remove everything from the tag
tag.clear()
for child in children:
# Add back stripped children
tag.append(child)
return tag
def test(filename):
soup = BeautifulSoup(open(filename))
cleaned_soup = strip_content(soup)
print(cleaned_soup.prettify())
if __name__ == "__main__":
test("myfile.html")