在 Presto (AWS Athena) 中拆分和搜索逗号分隔的列

Question

我有以下 table my_table，其中两列都是字符串-

+------------+-------------+
|     user_id|        code |
+------------+-------------+
|      ABC123|  yyy,123,333|
|        John|  xxx,USA,555|
|      qwerty|  55A,AUS,666|
|      Thomas|  zzz,666,678|
+------------+-------------+

我需要获取在 code 列值中具有 yyy 或 666 的所有 user_id。我已经在在线 MySQL 模拟中测试了以下查询，它 工作正常 ，但它在 AWS Athena-

中不起作用

SELECT user_id FROM my_table WHERE CONCAT(",", code, ",") REGEXP ",(yyy|666),";

结果应该是-

+------------+
|     user_id|
+------------+
|      ABC123|
|      qwerty|
|      Thomas|
+------------+

Answer 1

MySQL 有一个内置函数：

select t.*
from t
where find_in_set('666', code) > 0 or find_in_set('yyy', code) > 0;

虽然可以使用此功能，但强烈建议您修复数据模型，不要将列表存储在字符串中。这不是 SQLish 的存储方式。

Answer 2

使用regexp_like:

WHERE regexp_like(code, '(^|,)(xxx|yyy)(,|$)')

presto:default> SELECT regexp_like('yyy,123,333', '(^|,)(xxx|yyy)(,|$)');
 _col0
-------
 true
(1 row)

（在 Presto 322 中测试，也适用于 Athena）

对于 "more obviously correct" 方法，我建议使用 split + contains，尽管这可能会降低性能。

Answer 3

您可以使用 regexp_like() 函数来获取验证上述条件的列。这将为相应的列 return 一个布尔值。然后您可以使用 WHERE 子句过滤出结果。

最终查询：

WITH dataset AS (
     SELECT 
       user_id,
       regexp_like(code, '(^|,)(666|yyy)(,|$)') AS code 
       FROM my_table
)
SELECT user_id from dataset where code=true

在 Presto (AWS Athena) 中拆分和搜索逗号分隔的列

Split and search comma separated column in Presto (AWS Athena)

regex

amazon-web-services

concat

presto

amazon-athena