使用 LIKE 和重复行的 bigquery 映射表

Question

基于这个问题：我想出了一个后续问题，可能会以完全不同的解决方案告终。这就是为什么我发布了一个新问题而不是评论。

我指的是@jon-armstrong 发布的解决方案。用不同的数据测试后，仍然存在如果'table1'中有重复的行则不起作用的问题。当然这个问题来自 'GROUP BY' 语句 - 和 w/o 这个，UPDATE 查询不起作用，导致出现我原来问题中所述的错误信息。它也不起作用，如果我 'GROUP' 每个值，或者按照建议 here 不分组。我也想出了使用 'PARTITION BY' 的想法，但是，我在 BigQuery 中遇到语法错误。

我的 'table1'（数据）和我的映射 table 'table2' 中可能存在重复项。所以为了让它非常精确，这是我的目标：

Table1（数据table）

textWithFoundItemInIt         | foundItem
-------------------------------------------
hallo Adam                    |  
Bert says hello               | 
Bert says byebye              | 
Want to find "Caesar"bdjehg   |
Want to find "Caesar"bdjehg   |
Want to find "Caesar"again    |
Want to find "Caesar"again and also Bert    | <== It is no problem, if only MAX()=Caesar or MIN()=Bert name is found. 
Want to find "CaesarCaesar"again and again | <== This is no problem, just finding one Caesar is enough

Table2（映射table）

mappingItem
------------
Adam
Bert
Caesar
Bert
Caesar
Adam

预期结果

textWithFoundItemInIt         | foundItem
--------------------------------------------
hallo Adam                    |  Adam
Bert says hello               |  Bert
Bert says byebye              |  Bert
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"again    |  Caesar
Want to find "Caesar"again and also Bert    | Caesar [or Bert]
Want to find "CaesarCaesar"again and again | Caesar

无论从 Table2 中找到哪个 Adam 并将其插入到 Table1 中，它们都是相同的。因此，即使第一个 Adam 将被第二个 Adam 覆盖，或者一旦找到一个 Adam 查询就停止进一步搜索，这也是可以的。

如果我执行 Jon 的 'SELECT' 查询，结果会是：

textWithFoundItemInIt         | foundItem
--------------------------------------------
hallo Adam                    |  Adam
Bert says hello               |  Bert
Bert says byebye              |  Bert
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"again    |  Caesar
Want to find "Caesar"again and also Bert    | Caesar (if MAX() chosen)
Want to find "CaesarCaesar"again and again | Caesar

它（正确地）省略了第二个“想要找到“Caesar”bdjehg”，但不幸的是，这不是我需要的。

如果简单一点，如果同一行出现两个名字也可以

textWithFoundItemInIt         | foundItem
---------------------------------------------
hallo Adam and Bert           |  Adam, Bert 
Bert says hello to Caesar     |  Bert, Caesar

或

textWithFoundItemInIt         | foundItem1      | foundItem2
---------------------------------------------------------------
hallo Adam and Bert           |  Adam           | Bert 
Bert says hello to Caesar     |  Bert           | Caesar

我希望这有助于理解我的问题。用简单的话来说：“这只是一个具有多个相等行的映射”;-)

非常感谢:)

Answer 1

考虑以下方法

select textWithFoundItemInIt, 
  regexp_extract(textWithFoundItemInIt, r'(?i)' || mappingItems) foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2)

如果应用于您问题中的示例数据 - 输出为

Answer 2

来自@Mikhail 的SELECT statement 效果很好。但是当我把它放入 UPDATE statement 时，我得到了众所周知的错误：

"UPDATE/MERGE must match at most one source row for each target row".

出现问题，因为SELECT statement正确returns重复。此问题的一个简单解决方案是SELECT DISTINCT。如果这样做，就不会再有错误了。

如果应该找到多个正则表达式，那么此查询很有帮助：

select textWithFoundItemInIt, 
ARRAY_TO_STRING(regexp_extract_all(textWithFoundItemInIt, r'(?i)' || mappingItems), " --- ") AS foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2)

我希望我使用 DISTINCT 语句的逻辑在所有情况下都不会失败并且可行。如果有人有任何意见，我很乐意提供反馈。

使用 LIKE 和重复行的 bigquery 映射表

bigquery mapping tables using LIKE with duplicate rows

mapping

sql-update

google-bigquery