在 BigQuery 中将 HTML 个字符转换为 unicode
Convert HTML characters to unicode in BigQuery
在 BigQuery 中,有没有办法将 HTML 实体字符替换为对应的 Unicode 字符?
例如,我在 table 中有以下行:
id | text
1 | Hello World 😜
2 | Yes 😜 It works great 😜
我想要:
id | text
1 | Hello World
2 | Yes It works great
以下通用技术有效:
- 拆分每个字符的文本,其中像
😜
这样的 HTML 实体字符被视为单个字符
- 使用
OFFSET
跟踪字符位置
- 重新加入所有字符,但使用一些 BigQuery STRING 函数魔法将 HTML 个实体替换为其 unicode 字符。
SELECT
id,
ANY_VALUE(text) AS original,
STRING_AGG(
COALESCE(
-- Support hex codepoints
CODE_POINTS_TO_STRING(
[CAST(CONCAT('0x', REGEXP_EXTRACT(char, r'(?:&#x)(\w+)(?:;)')) AS INT64)]
),
-- Support decimal codepoints
CODE_POINTS_TO_STRING(
[CAST(CONCAT('0x', FORMAT('%x', CAST(REGEXP_EXTRACT(char, r'(?:&#)(\d+)(?:;)') AS INT64))) AS INT64)]
),
-- Fall back to the character itself
char
),
'' ORDER BY char_position) AS text
FROM UNNEST([
STRUCT(1 AS id, 'Hello World 😜' AS text),
STRUCT(2 AS id, 'Yes 😜 It works great 😜'),
STRUCT(3 AS id, '—' AS text),
STRUCT(4 AS id, '—' AS text)
])
CROSS JOIN
-- Extract all characters individually except for HTML entity characters
UNNEST(REGEXP_EXTRACT_ALL(text, r'(&#\w+;|.)')) char WITH OFFSET AS char_position
GROUP BY id
在 BigQuery 中,有没有办法将 HTML 实体字符替换为对应的 Unicode 字符?
例如,我在 table 中有以下行:
id | text
1 | Hello World 😜
2 | Yes 😜 It works great 😜
我想要:
id | text
1 | Hello World
2 | Yes It works great
以下通用技术有效:
- 拆分每个字符的文本,其中像
😜
这样的 HTML 实体字符被视为单个字符 - 使用
OFFSET
跟踪字符位置
- 重新加入所有字符,但使用一些 BigQuery STRING 函数魔法将 HTML 个实体替换为其 unicode 字符。
SELECT
id,
ANY_VALUE(text) AS original,
STRING_AGG(
COALESCE(
-- Support hex codepoints
CODE_POINTS_TO_STRING(
[CAST(CONCAT('0x', REGEXP_EXTRACT(char, r'(?:&#x)(\w+)(?:;)')) AS INT64)]
),
-- Support decimal codepoints
CODE_POINTS_TO_STRING(
[CAST(CONCAT('0x', FORMAT('%x', CAST(REGEXP_EXTRACT(char, r'(?:&#)(\d+)(?:;)') AS INT64))) AS INT64)]
),
-- Fall back to the character itself
char
),
'' ORDER BY char_position) AS text
FROM UNNEST([
STRUCT(1 AS id, 'Hello World 😜' AS text),
STRUCT(2 AS id, 'Yes 😜 It works great 😜'),
STRUCT(3 AS id, '—' AS text),
STRUCT(4 AS id, '—' AS text)
])
CROSS JOIN
-- Extract all characters individually except for HTML entity characters
UNNEST(REGEXP_EXTRACT_ALL(text, r'(&#\w+;|.)')) char WITH OFFSET AS char_position
GROUP BY id