在 BigQuery 中将 HTML 个字符转换为 unicode

Convert HTML characters to unicode in BigQuery

在 BigQuery 中,有没有办法将 HTML 实体字符替换为对应的 Unicode 字符?

例如,我在 table 中有以下行:

id | text
1  | Hello World 😜
2  | Yes 😜 It works great 😜

我想要:

id | text
1  | Hello World 
2  | Yes  It works great 

以下通用技术有效:

  • 拆分每个字符的文本,其中像 😜 这样的 HTML 实体字符被视为单个字符
  • 使用 OFFSET
  • 跟踪字符位置
  • 重新加入所有字符,但使用一些 BigQuery STRING 函数魔法将 HTML 个实体替换为其 unicode 字符。
SELECT
  id,
  ANY_VALUE(text) AS original,
  STRING_AGG(
    COALESCE(
      -- Support hex codepoints
      CODE_POINTS_TO_STRING(
        [CAST(CONCAT('0x', REGEXP_EXTRACT(char, r'(?:&#x)(\w+)(?:;)')) AS INT64)]
      ),
      -- Support decimal codepoints
      CODE_POINTS_TO_STRING(
        [CAST(CONCAT('0x', FORMAT('%x', CAST(REGEXP_EXTRACT(char, r'(?:&#)(\d+)(?:;)') AS INT64))) AS INT64)]
      ),
      -- Fall back to the character itself
      char
    ),
  '' ORDER BY char_position) AS text
FROM UNNEST([
  STRUCT(1 AS id, 'Hello World 😜' AS text),
  STRUCT(2 AS id, 'Yes 😜 It works great 😜'),
  STRUCT(3 AS id, '—' AS text),
  STRUCT(4 AS id, '—' AS text)
])
CROSS JOIN
  -- Extract all characters individually except for HTML entity characters
  UNNEST(REGEXP_EXTRACT_ALL(text, r'(&#\w+;|.)')) char WITH OFFSET AS char_position
GROUP BY id