有没有办法识别和转换普通字符字符串中的非转义四位 Unicode 字符？

Question

我正在使用 requests.get 从 Google Ngram 中检索数据。

我遇到了一个问题，当我在网站上查询其中包含重音字符的字符串时（在本例中我正在搜索 "marcher d'un pas lourd"），它 returns 的信息"marcher d' un pas lourd".

如您在返回的字符串中所见，撇号已替换为撇号的四位 Unicode。

这打乱了我的其余代码，因为我使用原始字符串查询 ("marcher d'un pas lourd") 从返回的数据中查找我需要的数据。

是否有任何函数或程序可以在一串其他正常字符中搜索和转换四位 Unicode？请注意，我不想删除这些特殊字符，而是让它们在我的代码中得到正确的表示。

Answer 1

这些是调用 HTML 实体，它们可以通过以下方式进行转义：

>>> s="marcher d&#39; un pas lourd"
>>> import html
>>> html.unescape(s)
"marcher d' un pas lourd"

Is there a way to identify and convert nonescaped four-digit Unicode characters within a string of normal characters?