保存在漂亮的汤对象中
Preserve in beautiful soup object
我的 sample.htm
文件如下:
<html><head>
<title>hello</title>
</head>
<body>
<p> Hello! he said. !</p>
</body>
</html>
我的 python 代码如下:
with open('sample.htm', 'r',encoding='utf8') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
with open("sample-output.htm", "w", encoding='utf-8') as file:
file.write(str(soup))
这会读取 sample.htm
并写入另一个 sample-output.htm
上面的输出:
<html><head>
<title>hello</title>
</head>
<body>
<p> Hello! he said. !</p>
</body>
</html>
如何在写入文件后保留
。
你总是可以使用正则表达式:
import re
import BeautifulSoup
text = '<p> Hello! he said. !</p>'
soup = BeautifulSoup(text,'html.parser')
# text_str = str(soup)
text_str = re.sub(r"\xa0"," ", str(soup))
我认为 BeautifulSoup 导入可能有误,但这个例子足以说明问题。我知道这是 post-soupify,但我希望它能提供不同的解决方案。
你可以直接使用 str.replace
:
>>> text_str.replace("\xa0", " ")
'<p> Hello! he said. !</p>'
您可以在代码中的什么地方使用它?
with open("sample-output.htm", "w", encoding='utf-8') as file:
file.write(str(soup).replace("\xa0", " "))
阅读并遵循基本文档:Output formatters
If you give Beautiful Soup a document that contains HTML entities like
“&lquot;
”, they’ll be converted to Unicode characters
…
If
you then convert the document to a string, the Unicode characters will
be encoded as UTF-8
. You won’t get the HTML entities back
…
You can change this behavior by providing a value for the
formatter
argument to prettify()
, encode()
, or decode()
…
If you pass in formatter="html"
, Beautiful Soup will
convert Unicode characters to HTML entities whenever possible:
soup_string = soup.prettify(formatter="html")
print( soup_string)
<html>
<head>
<title>
hello
</title>
</head>
<body>
<p>
Hello! he said. !
</p>
</body>
</html>
print(type(soup_string)) # for the sake of completeness
<class 'str'>
另一种方式(没有“美化”):
print(soup.encode(formatter="html").decode())
<html><head>
<title>hello</title>
</head>
<body>
<p> Hello! he said. !</p>
</body>
</html>
我的 sample.htm
文件如下:
<html><head>
<title>hello</title>
</head>
<body>
<p> Hello! he said. !</p>
</body>
</html>
我的 python 代码如下:
with open('sample.htm', 'r',encoding='utf8') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
with open("sample-output.htm", "w", encoding='utf-8') as file:
file.write(str(soup))
这会读取 sample.htm
并写入另一个 sample-output.htm
上面的输出:
<html><head>
<title>hello</title>
</head>
<body>
<p> Hello! he said. !</p>
</body>
</html>
如何在写入文件后保留
。
你总是可以使用正则表达式:
import re
import BeautifulSoup
text = '<p> Hello! he said. !</p>'
soup = BeautifulSoup(text,'html.parser')
# text_str = str(soup)
text_str = re.sub(r"\xa0"," ", str(soup))
我认为 BeautifulSoup 导入可能有误,但这个例子足以说明问题。我知道这是 post-soupify,但我希望它能提供不同的解决方案。
你可以直接使用 str.replace
:
>>> text_str.replace("\xa0", " ")
'<p> Hello! he said. !</p>'
您可以在代码中的什么地方使用它?
with open("sample-output.htm", "w", encoding='utf-8') as file:
file.write(str(soup).replace("\xa0", " "))
阅读并遵循基本文档:Output formatters
If you give Beautiful Soup a document that contains HTML entities like “
&lquot;
”, they’ll be converted to Unicode characters
…
If you then convert the document to a string, the Unicode characters will be encoded asUTF-8
. You won’t get the HTML entities back
…
You can change this behavior by providing a value for theformatter
argument toprettify()
,encode()
, ordecode()
…
If you pass informatter="html"
, Beautiful Soup will convert Unicode characters to HTML entities whenever possible:
soup_string = soup.prettify(formatter="html")
print( soup_string)
<html> <head> <title> hello </title> </head> <body> <p> Hello! he said. ! </p> </body> </html>
print(type(soup_string)) # for the sake of completeness
<class 'str'>
另一种方式(没有“美化”):
print(soup.encode(formatter="html").decode())
<html><head> <title>hello</title> </head> <body> <p> Hello! he said. !</p> </body> </html>