什么与 BigQuery 中的 OTRANSLATE 等效?
What is an equivalent to OTRANSLATE in BigQuery?
我正在尝试在使用 Teradata 的 OTRANSLATE function 的 BigQuery 中将查询转换为 运行。例如,
SELECT OTRANSLATE(text, 'ehlo', 'EHLO')
FROM (
SELECT 'hello world' AS text UNION ALL
SELECT 'elliott'
);
这应该产生:
HELLO wOrLd
ELLiOtt
有没有什么方法可以在 BigQuery 中表达这个函数?看起来没有直接的等价物。
是的,您可以对字符串使用数组运算来执行此操作。这是一种解决方案:
CREATE TEMP FUNCTION OTRANSLATE(s STRING, key STRING, value STRING) AS (
(SELECT
STRING_AGG(
IFNULL(
(SELECT value[OFFSET(
SELECT o FROM UNNEST(SPLIT(key, '')) AS k WITH OFFSET o2
WHERE k = c)]
),
c),
'' ORDER BY o1)
FROM UNNEST(SPLIT(s, '')) AS c WITH OFFSET o1)
)
);
SELECT OTRANSLATE(text, 'ehlo', 'EHLO')
FROM (
SELECT 'hello world' AS text UNION ALL
SELECT 'elliott'
);
思路是在value
字符串中查找与key
字符串相同位置的字符。如果 key
字符串中没有匹配的字符,我们将得到一个空偏移量,因此 IFNULL
的第二个参数导致它成为 return 未映射的字符。然后我们聚合回一个字符串,按字符偏移量排序。
编辑:这里有一个变体,它也处理键和值长度的差异:
CREATE TEMP FUNCTION otranslate(s STRING, key STRING, value STRING) AS (
IF(LENGTH(key) < LENGTH(value) OR LENGTH(s) < LENGTH(key), s,
(SELECT
STRING_AGG(
IFNULL(
(SELECT ARRAY_CONCAT([c], SPLIT(value, ''))[SAFE_OFFSET((
SELECT IFNULL(MIN(o2) + 1, 0) FROM UNNEST(SPLIT(key, '')) AS k WITH OFFSET o2
WHERE k = c))]
),
''),
'' ORDER BY o1)
FROM UNNEST(SPLIT(s, '')) AS c WITH OFFSET o1
))
);
SELECT
otranslate("hello world", "", "EHLO") AS empty_from, -- 'hello world'
otranslate("hello world", "hello world1", "EHLO") AS larger_from_than_source, -- 'hello world'
otranslate("hello world", "ehlo", "EHLO") AS equal_size_from_to, -- 'HELLO wOrLd'
otranslate("hello world", "ehlo", "EH") AS larger_size_from, -- 'HE wrd'
otranslate("hello world", "ehlo", "EHLOPQ") AS larger_size_to, -- 'hello world'
otranslate("hello world", "ehlo", "") AS empty_to; -- 'wrd'
另一种略有不同的方法(BigQuery 标准 SQL)
#standardSQL
CREATE TEMP FUNCTION OTRANSLATE(text STRING, from_string STRING, to_string STRING) AS ((
SELECT STRING_AGG(IFNULL(y, a), '' ORDER BY pos)
FROM UNNEST(SPLIT(text, '')) a WITH OFFSET pos
LEFT JOIN (
SELECT x, y
FROM UNNEST(SPLIT(from_string, '')) x WITH OFFSET
JOIN UNNEST(SPLIT(to_string, '')) y WITH OFFSET
USING(OFFSET)
)
ON a = x
));
WITH `project.dataset.table` AS (
SELECT 'hello world' AS text UNION ALL
SELECT 'elliott'
)
SELECT text, OTRANSLATE(text, 'ehlo', 'EHLO') as new_text
FROM `project.dataset.table`
输出
Row text new_text
1 hello world HELLO wOrLd
2 elliott ELLiOtt
注意:以上版本假定 from 和 to 字符串长度相等,并且 from 字符串中没有重复字符
Update to follow up on "expanded expectations" for the version of that function in BigQuery
#standardSQL
CREATE TEMP FUNCTION OTRANSLATE(text STRING, from_string STRING, to_string STRING) AS ((
SELECT STRING_AGG(IFNULL(y, a), '' ORDER BY pos)
FROM UNNEST(SPLIT(text, '')) a WITH OFFSET pos
LEFT JOIN (
SELECT x, ARRAY_AGG(IFNULL(y, '') ORDER BY OFFSET LIMIT 1)[OFFSET(0)] y
FROM UNNEST(SPLIT(from_string, '')) x WITH OFFSET
LEFT JOIN UNNEST(SPLIT(to_string, '')) y WITH OFFSET
USING(OFFSET)
GROUP BY x
)
ON a = x
));
SELECT -- text, OTRANSLATE(text, 'ehlo', 'EHLO') as new_text
OTRANSLATE("hello world", "", "EHLO") AS empty_from, -- 'hello world'
OTRANSLATE("hello world", "hello world1", "EHLO") AS larger_from_than_source, -- 'EHLLL'
OTRANSLATE("hello world", "ehlo", "EHLO") AS equal_size_from_to, -- 'HELLO wOrLd'
OTRANSLATE("hello world", "ehlo", "EH") AS larger_size_from, -- 'HE wrd'
OTRANSLATE("hello world", "ehlo", "EHLOPQ") AS larger_size_to, -- 'hello world'
OTRANSLATE("hello world", "ehlo", "") AS empty_to; -- 'wrd'
结果
Row empty_from larger_from_than_source equal_size_from_to larger_size_from larger_size_to empty_to
1 hello world EHLLL HELLO wOrLd HE wrd HELLO wOrLd wrd
.
注意:此函数的 Teradata 版本是递归的,因此当前实现不是 Teradata 的 OTRANSLATE 的精确实现
Usage Notes (from teradata documentation)
If the first character in from_string occurs in the source_string, all occurrences of it are replaced by the first character in to_string. This repeats for all characters in from_string and for all characters in from_string. The replacement is performed character-by-character, that is, the replacement of the second character is done on the string resulting from the replacement of the first character.
这可以很容易地用 JS UDF 实现,这是微不足道的,我想我不会朝这个方向发展 :o)
我正在尝试在使用 Teradata 的 OTRANSLATE function 的 BigQuery 中将查询转换为 运行。例如,
SELECT OTRANSLATE(text, 'ehlo', 'EHLO')
FROM (
SELECT 'hello world' AS text UNION ALL
SELECT 'elliott'
);
这应该产生:
HELLO wOrLd
ELLiOtt
有没有什么方法可以在 BigQuery 中表达这个函数?看起来没有直接的等价物。
是的,您可以对字符串使用数组运算来执行此操作。这是一种解决方案:
CREATE TEMP FUNCTION OTRANSLATE(s STRING, key STRING, value STRING) AS (
(SELECT
STRING_AGG(
IFNULL(
(SELECT value[OFFSET(
SELECT o FROM UNNEST(SPLIT(key, '')) AS k WITH OFFSET o2
WHERE k = c)]
),
c),
'' ORDER BY o1)
FROM UNNEST(SPLIT(s, '')) AS c WITH OFFSET o1)
)
);
SELECT OTRANSLATE(text, 'ehlo', 'EHLO')
FROM (
SELECT 'hello world' AS text UNION ALL
SELECT 'elliott'
);
思路是在value
字符串中查找与key
字符串相同位置的字符。如果 key
字符串中没有匹配的字符,我们将得到一个空偏移量,因此 IFNULL
的第二个参数导致它成为 return 未映射的字符。然后我们聚合回一个字符串,按字符偏移量排序。
编辑:这里有一个变体,它也处理键和值长度的差异:
CREATE TEMP FUNCTION otranslate(s STRING, key STRING, value STRING) AS (
IF(LENGTH(key) < LENGTH(value) OR LENGTH(s) < LENGTH(key), s,
(SELECT
STRING_AGG(
IFNULL(
(SELECT ARRAY_CONCAT([c], SPLIT(value, ''))[SAFE_OFFSET((
SELECT IFNULL(MIN(o2) + 1, 0) FROM UNNEST(SPLIT(key, '')) AS k WITH OFFSET o2
WHERE k = c))]
),
''),
'' ORDER BY o1)
FROM UNNEST(SPLIT(s, '')) AS c WITH OFFSET o1
))
);
SELECT
otranslate("hello world", "", "EHLO") AS empty_from, -- 'hello world'
otranslate("hello world", "hello world1", "EHLO") AS larger_from_than_source, -- 'hello world'
otranslate("hello world", "ehlo", "EHLO") AS equal_size_from_to, -- 'HELLO wOrLd'
otranslate("hello world", "ehlo", "EH") AS larger_size_from, -- 'HE wrd'
otranslate("hello world", "ehlo", "EHLOPQ") AS larger_size_to, -- 'hello world'
otranslate("hello world", "ehlo", "") AS empty_to; -- 'wrd'
另一种略有不同的方法(BigQuery 标准 SQL)
#standardSQL
CREATE TEMP FUNCTION OTRANSLATE(text STRING, from_string STRING, to_string STRING) AS ((
SELECT STRING_AGG(IFNULL(y, a), '' ORDER BY pos)
FROM UNNEST(SPLIT(text, '')) a WITH OFFSET pos
LEFT JOIN (
SELECT x, y
FROM UNNEST(SPLIT(from_string, '')) x WITH OFFSET
JOIN UNNEST(SPLIT(to_string, '')) y WITH OFFSET
USING(OFFSET)
)
ON a = x
));
WITH `project.dataset.table` AS (
SELECT 'hello world' AS text UNION ALL
SELECT 'elliott'
)
SELECT text, OTRANSLATE(text, 'ehlo', 'EHLO') as new_text
FROM `project.dataset.table`
输出
Row text new_text
1 hello world HELLO wOrLd
2 elliott ELLiOtt
注意:以上版本假定 from 和 to 字符串长度相等,并且 from 字符串中没有重复字符
Update to follow up on "expanded expectations" for the version of that function in BigQuery
#standardSQL
CREATE TEMP FUNCTION OTRANSLATE(text STRING, from_string STRING, to_string STRING) AS ((
SELECT STRING_AGG(IFNULL(y, a), '' ORDER BY pos)
FROM UNNEST(SPLIT(text, '')) a WITH OFFSET pos
LEFT JOIN (
SELECT x, ARRAY_AGG(IFNULL(y, '') ORDER BY OFFSET LIMIT 1)[OFFSET(0)] y
FROM UNNEST(SPLIT(from_string, '')) x WITH OFFSET
LEFT JOIN UNNEST(SPLIT(to_string, '')) y WITH OFFSET
USING(OFFSET)
GROUP BY x
)
ON a = x
));
SELECT -- text, OTRANSLATE(text, 'ehlo', 'EHLO') as new_text
OTRANSLATE("hello world", "", "EHLO") AS empty_from, -- 'hello world'
OTRANSLATE("hello world", "hello world1", "EHLO") AS larger_from_than_source, -- 'EHLLL'
OTRANSLATE("hello world", "ehlo", "EHLO") AS equal_size_from_to, -- 'HELLO wOrLd'
OTRANSLATE("hello world", "ehlo", "EH") AS larger_size_from, -- 'HE wrd'
OTRANSLATE("hello world", "ehlo", "EHLOPQ") AS larger_size_to, -- 'hello world'
OTRANSLATE("hello world", "ehlo", "") AS empty_to; -- 'wrd'
结果
Row empty_from larger_from_than_source equal_size_from_to larger_size_from larger_size_to empty_to
1 hello world EHLLL HELLO wOrLd HE wrd HELLO wOrLd wrd
.
注意:此函数的 Teradata 版本是递归的,因此当前实现不是 Teradata 的 OTRANSLATE 的精确实现
Usage Notes (from teradata documentation)
If the first character in from_string occurs in the source_string, all occurrences of it are replaced by the first character in to_string. This repeats for all characters in from_string and for all characters in from_string. The replacement is performed character-by-character, that is, the replacement of the second character is done on the string resulting from the replacement of the first character.
这可以很容易地用 JS UDF 实现,这是微不足道的,我想我不会朝这个方向发展 :o)