Python - 删除标记并从文件中读取 html?
Python - remove markup tags and read html from the file?
我有一个名为 BBC_news_home.html 的文件,我需要删除所有标记标签,这样我剩下的只是一些文本。到目前为止我得到了:
def clean_html(html):
cleaned = ''
line = html
pattern = r'(<.*?>)'
result = re.findall(pattern, line, re.S)
if result:
f = codecs.open("BBC_news_home.html", 'r', 'utf-8')
print(f.read())
else:
print('Not cleaned.')
return cleaned
我已经与 regex101.com 确认模式是正确的 我只是不确定如何打印输出以检查标记标签是否消失了?
你真的应该为此使用 BeautifulSoup。根据您需要的 python 版本执行 pip3 install BeautifulSoup4
或 pip install BeautifulSoup4
。我已经发布了一个类似问题的答案 。为了完整起见:
from bs4 import BeautifulSoup
def cleanme(html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
for script in soup(["script"]):
script.extract()
text = soup.get_text()
return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"
cleaned = cleanme(testhtml)
print (cleaned)
结果的输出仅为I need this text captured And this
。
我有一个名为 BBC_news_home.html 的文件,我需要删除所有标记标签,这样我剩下的只是一些文本。到目前为止我得到了:
def clean_html(html):
cleaned = ''
line = html
pattern = r'(<.*?>)'
result = re.findall(pattern, line, re.S)
if result:
f = codecs.open("BBC_news_home.html", 'r', 'utf-8')
print(f.read())
else:
print('Not cleaned.')
return cleaned
我已经与 regex101.com 确认模式是正确的 我只是不确定如何打印输出以检查标记标签是否消失了?
你真的应该为此使用 BeautifulSoup。根据您需要的 python 版本执行 pip3 install BeautifulSoup4
或 pip install BeautifulSoup4
。我已经发布了一个类似问题的答案
from bs4 import BeautifulSoup
def cleanme(html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
for script in soup(["script"]):
script.extract()
text = soup.get_text()
return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"
cleaned = cleanme(testhtml)
print (cleaned)
结果的输出仅为I need this text captured And this
。