SQL/Bigquery 文本分类
SQL/Bigquery text classification
我需要使用正则表达式实现一个简单的文本分类,为此我虽然应用了一个简单的 CASE WHEN 语句,但我想遍历所有 CASE,而不是在满足 case 1 条件的情况下
例如
with `table` as(
SELECT 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
)
SELECT
CASE
WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI'
WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering'
WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning'
END as topic,
text
FROM `table`
对于此查询,文本被分类为 AI,因为这是满足的第一个条件,但它应该在一个数组或 3 个不同的行中分类为 AI、工程和深度学习,因为所有 3 个条件都是遇见了
如何应用所有 regex/conditions 对文本进行分类?
一种方法是字符串连接:
SELECT CONCAT(CASE WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI;' ELSE '' END,
CASE WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering;' ELSE '' END,
CASE WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning;' ELSE '' END
) as topics, text
FROM `table`;
实际上,这构造了一个字符串。您可以使用类似的逻辑来构造一个数组。
以下适用于 BigQuery 标准 SQL
#standardSQL
select
array_to_string(array(select distinct lower(topic)
from unnest(regexp_extract_all(text, r'(?i)ai|computational power|deep learning')) topic
), ', ') topics,
text
from `table`
如果应用于您问题中的示例数据 - 输出为
我觉得下面是最通用和可重用的解决方案(BigQuery Standard SQL)
#standardSQL
with `table` as(
select 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
), classification as (
select 'ai' term, 'AI' topic union all
select 'computational power', 'Engineering' union all
select 'deep learning', 'Deep Learning'
), pattern as (
select r'(?i)' || string_agg(term, '|') as regexp_pattern
from classification
)
select
array_to_string(array(
select distinct topic
from unnest(regexp_extract_all(lower(text), regexp_pattern)) term
join classification using(term)
), ', ') topics,
text
from `table`, pattern
有输出
我需要使用正则表达式实现一个简单的文本分类,为此我虽然应用了一个简单的 CASE WHEN 语句,但我想遍历所有 CASE,而不是在满足 case 1 条件的情况下
例如
with `table` as(
SELECT 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
)
SELECT
CASE
WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI'
WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering'
WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning'
END as topic,
text
FROM `table`
对于此查询,文本被分类为 AI,因为这是满足的第一个条件,但它应该在一个数组或 3 个不同的行中分类为 AI、工程和深度学习,因为所有 3 个条件都是遇见了
如何应用所有 regex/conditions 对文本进行分类?
一种方法是字符串连接:
SELECT CONCAT(CASE WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI;' ELSE '' END,
CASE WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering;' ELSE '' END,
CASE WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning;' ELSE '' END
) as topics, text
FROM `table`;
实际上,这构造了一个字符串。您可以使用类似的逻辑来构造一个数组。
以下适用于 BigQuery 标准 SQL
#standardSQL
select
array_to_string(array(select distinct lower(topic)
from unnest(regexp_extract_all(text, r'(?i)ai|computational power|deep learning')) topic
), ', ') topics,
text
from `table`
如果应用于您问题中的示例数据 - 输出为
我觉得下面是最通用和可重用的解决方案(BigQuery Standard SQL)
#standardSQL
with `table` as(
select 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
), classification as (
select 'ai' term, 'AI' topic union all
select 'computational power', 'Engineering' union all
select 'deep learning', 'Deep Learning'
), pattern as (
select r'(?i)' || string_agg(term, '|') as regexp_pattern
from classification
)
select
array_to_string(array(
select distinct topic
from unnest(regexp_extract_all(lower(text), regexp_pattern)) term
join classification using(term)
), ', ') topics,
text
from `table`, pattern
有输出