表情符号在上传到 Bigquery 时崩溃
Emoji crashed when uploading to Big Query
目前,我在将 EMOJI 数据上传(使用 python)到 BIG QUERY
时遇到问题
这是我要上传到 BQ 的示例代码:
{"emojiCharts":{"emoji_icon":"\ud83d\udc4d","repost": 4, "doc": 4, "engagement": 0, "reach": 0, "impression": 0}}
{"emojiCharts":{"emoji_icon":"\ud83d\udc49","repost": 4, "doc": 4, "engagement": 43, "reach": 722, "impression": 4816}}
{"emojiCharts":{"emoji_icon":"\u203c","repost": 4, "doc": 4, "engagement": 0, "reach": 0, "impression": 0}}
{"emojiCharts":{"emoji_icon":"\ud83c\udf89","repost": 5, "doc": 5, "engagement": 43, "reach": 829, "impression": 5529}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude34","repost": 5, "doc": 5, "engagement": 222, "reach": 420, "impression": 2805}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude31","repost": 3, "doc": 3, "engagement": 386, "reach": 2868, "impression": 19122}}
{"emojiCharts":{"emoji_icon":"\ud83d\udc4d\ud83c\udffb","repost": 5, "doc": 5, "engagement": 43, "reach": 1064, "impression": 7098}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude3b","repost": 3, "doc": 3, "engagement": 93, "reach": 192, "impression": 1283}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude2d","repost": 6, "doc": 6, "engagement": 212, "reach": 909, "impression": 6143}}
{"emojiCharts":{"emoji_icon":"\ud83e\udd84","repost": 8, "doc": 8, "engagement": 313, "reach": 402, "impression": 2681}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude18","repost": 7, "doc": 7, "engagement": 0, "reach": 8454, "impression": 56366}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude05","repost": 5, "doc": 5, "engagement": 74, "reach": 1582, "impression": 10550}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude04","repost": 5, "doc": 5, "engagement": 73, "reach": 3329, "impression": 22206}}
问题是大查询看不到任何这种表情符号 (\ud83d\ude04
),并且只会以这种格式显示 (\u203c
)
即使字段是 STRING 它显示 2 个黑色 rombs,为什么 BQ 不能将表情符号显示为字符串而不将其转换为实际的表情符号?
问题:
有没有什么方法可以将 EMOJI 上传到 Big Query 并正确加载? - “ 将用于 Google Data Studio”
我是否应该手动(硬编码)将所有表情符号代码更改为可接受的代码,这是可接受的格式?
正如用户 'numeral' 在他们的评论中提到的:
Check out charbase.com/1f618-unicode-face-throwing-a-kiss What you want is to convert the javascript escape characters to actual unicode data.
,您需要更改表情符号的编码以使其准确表示为一个字符:
SELECT "\U0001f604 \U0001f4b8"
-- , "\ud83d\udcb8"
-- , "\ud83d\ude04"
第 2 行和第 3 行失败并出现类似 Illegal escape sequence: Unicode value \ud83d is invalid at [2:7]
的错误,但第一行在 BigQuery 和 Data Studio 中给出了正确的显示:
关于此的其他想法:
Python 不支持由多个 UTF-16 字符组成的“代理字符”表示,一些表情符号(超过 0xFFFF
)使用它们。例如,可以用 Python 中的 \U0001f3e6
(UTF-32) 表示,有些语言使用 \ud83c\udfe6
。对于那些小于 0xFFFF
的值,python 和其他语言都使用相同的表示,例如\u3020
(†).要解决编码问题,您可以手动转换表情符号字符或考虑使用一些库,例如https://github.com/hartwork/surrogates 将它们转换为 UTF-32。
另外,BigQueqry Python 客户端的 load_table_from_json
有一个关于那些值超过 0xFFFF
的字符的错误,即使你使用正确的 UTF-32 表示。几天前它刚刚发布了一个新版本来修复它。参考:https://github.com/googleapis/python-bigquery/releases/tag/v2.24.0
关于银行表情符号列出不同表示的一些参考资料:
目前,我在将 EMOJI 数据上传(使用 python)到 BIG QUERY
时遇到问题这是我要上传到 BQ 的示例代码:
{"emojiCharts":{"emoji_icon":"\ud83d\udc4d","repost": 4, "doc": 4, "engagement": 0, "reach": 0, "impression": 0}}
{"emojiCharts":{"emoji_icon":"\ud83d\udc49","repost": 4, "doc": 4, "engagement": 43, "reach": 722, "impression": 4816}}
{"emojiCharts":{"emoji_icon":"\u203c","repost": 4, "doc": 4, "engagement": 0, "reach": 0, "impression": 0}}
{"emojiCharts":{"emoji_icon":"\ud83c\udf89","repost": 5, "doc": 5, "engagement": 43, "reach": 829, "impression": 5529}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude34","repost": 5, "doc": 5, "engagement": 222, "reach": 420, "impression": 2805}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude31","repost": 3, "doc": 3, "engagement": 386, "reach": 2868, "impression": 19122}}
{"emojiCharts":{"emoji_icon":"\ud83d\udc4d\ud83c\udffb","repost": 5, "doc": 5, "engagement": 43, "reach": 1064, "impression": 7098}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude3b","repost": 3, "doc": 3, "engagement": 93, "reach": 192, "impression": 1283}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude2d","repost": 6, "doc": 6, "engagement": 212, "reach": 909, "impression": 6143}}
{"emojiCharts":{"emoji_icon":"\ud83e\udd84","repost": 8, "doc": 8, "engagement": 313, "reach": 402, "impression": 2681}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude18","repost": 7, "doc": 7, "engagement": 0, "reach": 8454, "impression": 56366}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude05","repost": 5, "doc": 5, "engagement": 74, "reach": 1582, "impression": 10550}}
{"emojiCharts":{"emoji_icon":"\ud83d\ude04","repost": 5, "doc": 5, "engagement": 73, "reach": 3329, "impression": 22206}}
问题是大查询看不到任何这种表情符号 (\ud83d\ude04
),并且只会以这种格式显示 (\u203c
)
即使字段是 STRING 它显示 2 个黑色 rombs,为什么 BQ 不能将表情符号显示为字符串而不将其转换为实际的表情符号?
问题:
有没有什么方法可以将 EMOJI 上传到 Big Query 并正确加载? - “ 将用于 Google Data Studio”
我是否应该手动(硬编码)将所有表情符号代码更改为可接受的代码,这是可接受的格式?
正如用户 'numeral' 在他们的评论中提到的:
Check out charbase.com/1f618-unicode-face-throwing-a-kiss What you want is to convert the javascript escape characters to actual unicode data.
,您需要更改表情符号的编码以使其准确表示为一个字符:
SELECT "\U0001f604 \U0001f4b8"
-- , "\ud83d\udcb8"
-- , "\ud83d\ude04"
第 2 行和第 3 行失败并出现类似 Illegal escape sequence: Unicode value \ud83d is invalid at [2:7]
的错误,但第一行在 BigQuery 和 Data Studio 中给出了正确的显示:
关于此的其他想法:
Python 不支持由多个 UTF-16 字符组成的“代理字符”表示,一些表情符号(超过 0xFFFF
)使用它们。例如,可以用 Python 中的 \U0001f3e6
(UTF-32) 表示,有些语言使用 \ud83c\udfe6
。对于那些小于 0xFFFF
的值,python 和其他语言都使用相同的表示,例如\u3020
(†).要解决编码问题,您可以手动转换表情符号字符或考虑使用一些库,例如https://github.com/hartwork/surrogates 将它们转换为 UTF-32。
另外,BigQueqry Python 客户端的 load_table_from_json
有一个关于那些值超过 0xFFFF
的字符的错误,即使你使用正确的 UTF-32 表示。几天前它刚刚发布了一个新版本来修复它。参考:https://github.com/googleapis/python-bigquery/releases/tag/v2.24.0
关于银行表情符号列出不同表示的一些参考资料: