当文本包含表情符号字符时,model.save() 调用出现 Django "surrogates not allowed" 错误

Django "surrogates not allowed" error on model.save() call when text includes emoji character

我们目前正在构建一个通过 Django 在 PostgreSQL 数据库中存储文本的系统。然后通过 PGSync 将数据提取到 ElasticSearch。

目前我们在测试用例中遇到了以下问题

错误信息:

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 159-160: surrogates not allowed

我们确定了导致该问题的角色。这是一个表情符号。

文本本身是希腊字符、“英文字符”和看起来像是表情符号的混合体。希腊语未显示为希腊语,而是以 \u 形式显示。

导致问题的相关文本:

\u03bc\u03b5 Some English Text \ud83d\ude9b\n#SomeHashTag

\ud83d\ude9b\ 翻译成这个表情符号:

如这里所说:https://python-list.python.narkive.com/aKjK4Jje/encoding-of-surrogate-code-points-to-utf-8

The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters.

PostgreSQL 具有以下编码:

这是一个 utf8 问题吗?还是特定于表情符号?这是 django 还是 postgresql 的问题?

重现问题:

x='\u03bc\u03b5 Some English Text \ud83d\ude9b\n#SomeHashTag'
print(x)

Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode characters in position 21-22: surrogates not allowed

解决方案:应用raw_unicode_escapeunicode_escape编解码器(参见Python Specific Encodings)如下:

y = x.encode('raw_unicode_escape').decode('unicode_escape').encode('utf-16_BE','surrogatepass').decode('utf-16_BE')
print(y)
με Some English Text 
#SomeHashTag