bigquery standard sql = 从字符串中提取数据
bigquery standard sql = extracting data from strings
我希望提取指定字母后的字符串部分,例如从以下字符串中提取:
wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4
我想提取 275(用于 wc)、2(用于 c1.3)和 25(用于 c12.1)。
我尝试了以下但字段重新显示,ridanxietycnt 和 wordcount 只显示 "NULL"。
SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(CAST(REGEXP_EXTRACT(GCAM, r'c1.3:([-d.]+)')AS FLOAT64)) anew,
sum(CAST(REGEXP_EXTRACT(GCAM, r'c12.1:([-d.]+)')AS FLOAT64)) ridanxietycnt,
sum(CAST(REGEXP_EXTRACT(GCAM, r'wc:(d+)')AS FLOAT64)) wordcount
FROM `gdelt-bq.gdeltv2.gkg_partitioned` where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
group by daydate
我希望看到每一列的汇总数字。
我想知道问题是否与正则表达式有关?
这应该让你得到 2(重新)、25(ridanxietycnt)、275(字数),按此顺序
SELECT
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'c1.3:(\d+)') as FLOAT64) anew,
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'c12.1:(\d+)') as FLOAT64) ridanxietycnt,
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'wc:(\d+)') as FLOAT64) wordcount
您可以将它们视为简单的 key/value 对,然后在其之上进行聚合。类似于:
select
substr(CAST((DATE) AS STRING),0,8) as daydate,
split(x,':')[safe_offset(0)] as key,
cast(split(x,':')[safe_offset(1)] as float64) as value
from `gdelt-bq.gdeltv2.gkg_partitioned`,
unnest(split(GCAM, ',')) as x
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
希望对您有所帮助。
我希望提取指定字母后的字符串部分,例如从以下字符串中提取:
wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4
我想提取 275(用于 wc)、2(用于 c1.3)和 25(用于 c12.1)。
我尝试了以下但字段重新显示,ridanxietycnt 和 wordcount 只显示 "NULL"。
SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(CAST(REGEXP_EXTRACT(GCAM, r'c1.3:([-d.]+)')AS FLOAT64)) anew,
sum(CAST(REGEXP_EXTRACT(GCAM, r'c12.1:([-d.]+)')AS FLOAT64)) ridanxietycnt,
sum(CAST(REGEXP_EXTRACT(GCAM, r'wc:(d+)')AS FLOAT64)) wordcount
FROM `gdelt-bq.gdeltv2.gkg_partitioned` where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
group by daydate
我希望看到每一列的汇总数字。
我想知道问题是否与正则表达式有关?
这应该让你得到 2(重新)、25(ridanxietycnt)、275(字数),按此顺序
SELECT
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'c1.3:(\d+)') as FLOAT64) anew,
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'c12.1:(\d+)') as FLOAT64) ridanxietycnt,
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'wc:(\d+)') as FLOAT64) wordcount
您可以将它们视为简单的 key/value 对,然后在其之上进行聚合。类似于:
select
substr(CAST((DATE) AS STRING),0,8) as daydate,
split(x,':')[safe_offset(0)] as key,
cast(split(x,':')[safe_offset(1)] as float64) as value
from `gdelt-bq.gdeltv2.gkg_partitioned`,
unnest(split(GCAM, ',')) as x
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
希望对您有所帮助。