保存在漂亮的汤对象中

Question

我的 sample.htm 文件如下：

<html><head>
<title>hello</title>
</head>
<body>
<p>&nbsp; Hello! he said. &nbsp; !</p>
</body>
</html>

我的 python 代码如下：

with open('sample.htm', 'r',encoding='utf8') as f:
    contents = f.read()
    soup  = BeautifulSoup(contents, 'html.parser')
    
with open("sample-output.htm", "w", encoding='utf-8') as file:
    file.write(str(soup))

这会读取 sample.htm 并写入另一个 sample-output.htm

上面的输出：

<html><head>
<title>hello</title>
</head>
<body>
<p>  Hello! he said.   !</p>
</body>
</html>

如何在写入文件后保留  。

Answer 1

你总是可以使用正则表达式：

import re
import BeautifulSoup

text = '<p>&nbsp; Hello! he said. &nbsp; !</p>'
soup = BeautifulSoup(text,'html.parser')
# text_str = str(soup)
text_str = re.sub(r"\xa0","&nbsp;", str(soup))

我认为 BeautifulSoup 导入可能有误，但这个例子足以说明问题。我知道这是 post-soupify，但我希望它能提供不同的解决方案。

Answer 2

你可以直接使用 str.replace:

>>> text_str.replace("\xa0", "&nbsp;")
'<p>&nbsp; Hello! he said. &nbsp; !</p>'

您可以在代码中的什么地方使用它？

with open("sample-output.htm", "w", encoding='utf-8') as file:
    file.write(str(soup).replace("\xa0", "&nbsp;"))

Answer 3

阅读并遵循基本文档：Output formatters

If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters
…
If you then convert the document to a string, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back
…
You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode()
…
If you pass in formatter="html", Beautiful Soup will convert Unicode characters to HTML entities whenever possible:

soup_string = soup.prettify(formatter="html")
print( soup_string)

<html>
 <head>
  <title>
   hello
  </title>
 </head>
 <body>
  <p>
   &nbsp; Hello! he said. &nbsp; !
  </p>
 </body>
</html>

print(type(soup_string)) # for the sake of completeness

<class 'str'>

另一种方式（没有“美化”）：

print(soup.encode(formatter="html").decode())

<html><head>
<title>hello</title>
</head>
<body>
<p>&nbsp; Hello! he said. &nbsp; !</p>
</body>
</html>

保存在漂亮的汤对象中

Preserve   in beautiful soup object

python

string

beautifulsoup

utf-8

python-3.x

保存在漂亮的汤对象中

Preserve &nbsp; in beautiful soup object

python

string

beautifulsoup

utf-8

python-3.x

Preserve in beautiful soup object