如何根据字符串列表过滤列值

How to filter column values according to a list of strings

我正在尝试通过将字符串列表与列值进行比较来过滤数据集。

使用 "LIKE" 和一个字符串,使用 3GB,效果很好。

#standardSQL

SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c1.3:(\d+)') as FLOAT64)) anew,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c12.1:(\d+)') as FLOAT64)) 
ridanxietycnt, 
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'wc:(\d+)') as FLOAT64)) wordcount   
FROM `gdelt-bq.gdeltv2.gkg_partitioned` t
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019- 
02-02') 
AND V2Themes LIKE 'ECON_INFLATION'

group by daydate

但是,当 "LIKE" 使用多个字符串时,查询突然变得非常大 (8 TB)。

#standardSQL

SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c1.3:(\d+)') as FLOAT64)) anew,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c12.1:(\d+)') as FLOAT64)) 
ridanxietycnt, 
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'wc:(\d+)') as FLOAT64)) wordcount   
FROM `gdelt-bq.gdeltv2.gkg_partitioned` t
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019- 
02-02') 
AND V2Themes LIKE 'ECON_INFLATION' OR V2Themes LIKE 'ECON_STOCKMARKET'
group by daydate

是否有更有效(和更便宜)的方法来将列值与字符串列表进行比较? 任何想法将不胜感激。

注意逻辑和 OR 优先级。

这个:

WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02') 
AND V2Themes LIKE 'ECON_INFLATION' 
OR V2Themes LIKE 'ECON_STOCKMARKET'

不等于:

WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02') 
AND (
  V2Themes LIKE 'ECON_INFLATION' 
  OR V2Themes LIKE 'ECON_STOCKMARKET'
)

第一个没有分区过滤器,但第二个有。