将 HTML 个纯文本实体转换为字符

Question

我抓取了新闻文章的标题和 URL，并将标题和 URL 作为纯文本存储在 tsv 文件中。出于某种原因，我使用的 scraper 将一些字符（例如 €）转换为十六进制码。我试图在刮刀方面改变这一点，但没有成功。我想要的是将十六进制代码更改为实际字符，以便我可以将实际字符串加载到 Postgres 数据库中。

例如以下字符串：Motorists could be charged for every mile they drive to raise €35bn，它应该作为 Motorists could be charged for every mile they drive to raise €35bn

存储在数据库中

到目前为止我尝试的是找到文件中的所有十六进制代码，去掉 &#x 部分，然后将十六进制代码转换为实际字符，在 € 的情况下：

s_decoded = bytes.fromhex("20AC").decode('ascii')

和

s_decoded = bytes.fromhex("20AC").decode('utf-8')

分别给出错误：UnicodeDecodeError: 'ascii' codec can't decode byte 0xac in position 1: ordinal not in range(128)和UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 1: invalid start byte。

我已经在这里复习了很多以前的问题，但似乎无法弄清楚为什么会发生这种情况。抱歉，如果这是重复的，但如果有人可以指出可以解决我的问题的方法，那将不胜感激。

Answer 1

要解码 HTML 您示例中的实体，您可以使用以下代码。

html_encoded = 'Motorists could be charged for every mile they drive to raise &#x20AC;35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)

将 HTML 个纯文本实体转换为字符

Convert HTML entities in plain text to characters

python

ascii

utf-8