BigQuery 中的正则表达式
Regexp in BigQuery
如何在 bigquery 中搜索表达式并将它们分组,即使它们被分号弄乱了?
数据库示例:
:Adidas
Adidas
Adidas;
null
adidas
7up
7UP
7UP;
:7UP
null
我想将它们分组并计数。我想得到这个结果:
adidas 4
7up 4
null 2
现在 goup by 没有帮助,我在 Excel 中完成工作,这一切都很快。
首先,您需要规范化文本以仅保留有效词。下面的正则表达式只是一个简单的正则表达式,您需要匹配并扩展到您的逻辑。
SELECT normalized,
count(1) AS c
FROM
(SELECT label,
lower(REGEXP_EXTRACT(label,r'[[:punct:]]?([[:^punct:]]*)')) AS normalized
FROM
(SELECT string(':Adidas') AS label),
(SELECT string('Adidas') AS label),
(SELECT string('Adidas;') AS label),
(SELECT string(NULL) AS label),
(SELECT string('adidas') AS label),
(SELECT string('7up') AS label),
(SELECT string('7UP') AS label),
(SELECT string('7UP;') AS label),
(SELECT string(':7UP') AS label),
(SELECT string(NULL) AS label),)
GROUP BY normalized
ORDER BY c DESC
这输出:
+-----+------------+---+---+
| Row | normalized | c | |
+-----+------------+---+---+
| 1 | adidas | 4 | |
| 2 | 7up | 4 | |
| 3 | null | 2 | |
+-----+------------+---+---+
如何在 bigquery 中搜索表达式并将它们分组,即使它们被分号弄乱了?
数据库示例:
:Adidas
Adidas
Adidas;
null
adidas
7up
7UP
7UP;
:7UP
null
我想将它们分组并计数。我想得到这个结果:
adidas 4
7up 4
null 2
现在 goup by 没有帮助,我在 Excel 中完成工作,这一切都很快。
首先,您需要规范化文本以仅保留有效词。下面的正则表达式只是一个简单的正则表达式,您需要匹配并扩展到您的逻辑。
SELECT normalized,
count(1) AS c
FROM
(SELECT label,
lower(REGEXP_EXTRACT(label,r'[[:punct:]]?([[:^punct:]]*)')) AS normalized
FROM
(SELECT string(':Adidas') AS label),
(SELECT string('Adidas') AS label),
(SELECT string('Adidas;') AS label),
(SELECT string(NULL) AS label),
(SELECT string('adidas') AS label),
(SELECT string('7up') AS label),
(SELECT string('7UP') AS label),
(SELECT string('7UP;') AS label),
(SELECT string(':7UP') AS label),
(SELECT string(NULL) AS label),)
GROUP BY normalized
ORDER BY c DESC
这输出:
+-----+------------+---+---+
| Row | normalized | c | |
+-----+------------+---+---+
| 1 | adidas | 4 | |
| 2 | 7up | 4 | |
| 3 | null | 2 | |
+-----+------------+---+---+