BigQuery:根据嵌套字段中的计数进行过滤
BigQuery: filter according to counts in nested field
我正在尝试查找在嵌套字段中提及 "BE" 或 "Belgium" 5 次或更多次的记录。
以下查询没有产生任何结果:
#standardSQL
SELECT
GKGRECORDID
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-10-09') AND TIMESTAMP('2019-10-09')
and (V2Themes LIKE "%WB_%GROWTH%")
group by GKGRECORDID
having count(V2Locations LIKE "%BE%" OR V2Locations LIKE "%Belgium%")>5
如有任何想法,我将不胜感激。
这里有几点:
而不是
COUNT(V2Locations LIKE "%BE%" OR V2Locations LIKE "%Belgium%")>5
你应该使用
COUNTIF(V2Locations LIKE "%BE%" OR V2Locations LIKE "%Belgium%")>5
即使使用上述修复 - 您仍然不会得到您期望的结果,因为您只针对一个分区,并且在该分区中具有相同 GKGRECORDID
的最大行数只是 2
,所以显然没有办法输出 GKGRECORDID
超过 5
的
如果我对你的数据的理解是正确的,那么你正在尝试计算每条 V2Themes 记录中“BE”或“Belgium”的出现次数。因此,在以下示例中,计数应为 4?
1#Russia#RS#RS##60#100#RS#2475;1#Venezuela#VE#VE##8#-66#VE#471;1#Venezuela#VE#VE##8#-66#VE#1435;1#Venezuela#VE#VE##8#-66#VE#1521;1#Venezuela#VE#VE##8#-66#VE#2409;1#Russian#RS#RS##60#100#RS#2440;4#Brussels,
Bruxelles-Capitale,
Belgium#BE#BE11#5850#50.8333#4.33333#-1955538#673;4#Brussels,
Bruxelles-Capitale,
Belgium#BE#BE11#5850#50.8333#4.33333#-1955538#2342;4#Quito, Pichincha,
如果那是正确的,一种可能的解决方法就是here 所解释的解决方法。将此解决方案转换为您的需求(计算单词而不是字符),我建议使用 SPLIT 方法用给定的分隔符划分字符串,并计算其元素有无您正在搜索的字符串 for.This 将是一种解决方案针对您的问题:
#standardSQL
SELECT
GKGRECORDID,
(ARRAY_LENGTH(SPLIT(V2Locations, '#')) - ARRAY_LENGTH(SPLIT(REPLACE(V2Locations, '#BE', ''), "#"))) + (ARRAY_LENGTH(SPLIT(V2Locations, '#')) - ARRAY_LENGTH(SPLIT(REPLACE(V2Locations, '#Belgium', ''), "#"))) as bel_num,
V2Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-10-09') AND TIMESTAMP('2019-10-09')
and (V2Themes LIKE "%WB_%GROWTH%")
group by GKGRECORDID, V2Locations
having bel_num<5
我正在尝试查找在嵌套字段中提及 "BE" 或 "Belgium" 5 次或更多次的记录。 以下查询没有产生任何结果:
#standardSQL
SELECT
GKGRECORDID
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-10-09') AND TIMESTAMP('2019-10-09')
and (V2Themes LIKE "%WB_%GROWTH%")
group by GKGRECORDID
having count(V2Locations LIKE "%BE%" OR V2Locations LIKE "%Belgium%")>5
如有任何想法,我将不胜感激。
这里有几点:
而不是
COUNT(V2Locations LIKE "%BE%" OR V2Locations LIKE "%Belgium%")>5
你应该使用
COUNTIF(V2Locations LIKE "%BE%" OR V2Locations LIKE "%Belgium%")>5
即使使用上述修复 - 您仍然不会得到您期望的结果,因为您只针对一个分区,并且在该分区中具有相同 GKGRECORDID
的最大行数只是 2
,所以显然没有办法输出 GKGRECORDID
超过 5
的
如果我对你的数据的理解是正确的,那么你正在尝试计算每条 V2Themes 记录中“BE”或“Belgium”的出现次数。因此,在以下示例中,计数应为 4?
1#Russia#RS#RS##60#100#RS#2475;1#Venezuela#VE#VE##8#-66#VE#471;1#Venezuela#VE#VE##8#-66#VE#1435;1#Venezuela#VE#VE##8#-66#VE#1521;1#Venezuela#VE#VE##8#-66#VE#2409;1#Russian#RS#RS##60#100#RS#2440;4#Brussels, Bruxelles-Capitale, Belgium#BE#BE11#5850#50.8333#4.33333#-1955538#673;4#Brussels, Bruxelles-Capitale, Belgium#BE#BE11#5850#50.8333#4.33333#-1955538#2342;4#Quito, Pichincha,
如果那是正确的,一种可能的解决方法就是here 所解释的解决方法。将此解决方案转换为您的需求(计算单词而不是字符),我建议使用 SPLIT 方法用给定的分隔符划分字符串,并计算其元素有无您正在搜索的字符串 for.This 将是一种解决方案针对您的问题:
#standardSQL
SELECT
GKGRECORDID,
(ARRAY_LENGTH(SPLIT(V2Locations, '#')) - ARRAY_LENGTH(SPLIT(REPLACE(V2Locations, '#BE', ''), "#"))) + (ARRAY_LENGTH(SPLIT(V2Locations, '#')) - ARRAY_LENGTH(SPLIT(REPLACE(V2Locations, '#Belgium', ''), "#"))) as bel_num,
V2Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-10-09') AND TIMESTAMP('2019-10-09')
and (V2Themes LIKE "%WB_%GROWTH%")
group by GKGRECORDID, V2Locations
having bel_num<5