BigQuery - 复制具有修改值的行

Question

post 的标题可能无法准确表达我想做的事情。我有一个 BigQuery table，其中包含一个 userId 列和一堆特征列。假设 table 是这样的。

_____________________________
|userId| col1 | col2 | col3 |
-------|------|------|-------
|u1    | 0.3  | 0.0  | 0.0  |
|u2    | 0.0  | 0.1  | 0.6  |
-----------------------------

每一行都有一个 userId（userId 在各行中可能不同，也可能不同），后跟一些特征值。除了少数之外，大多数都是 0。

现在，对于每一行，我想创建额外的行，其中只有一个 non-zero 特征被替换为 0。对于上面的示例，生成的 table 将如下所示。

_____________________________
|userId| col1 | col2 | col3 |
-------|------|------|-------
|u1    | 0.3  | 0.0  | 0.0  |
|u1    | 0.0* | 0.0  | 0.0  |
|u2    | 0.0  | 0.1  | 0.6  |
|u2    | 0.0  | 0.0* | 0.6  |
|u2    | 0.0  | 0.1  | 0.0* |
-----------------------------

带星号的值表示 non-zero 值设置为 0 的列。由于 u1 有 1 个非零特征，因此只添加了一个额外的行 col1值设置为 0。u2 有 2 non-zero 列（col2 和 col3）。因此，添加了另外两行，其中一行 col2 设置为 0，另一行 col3 设置为 0。

table 有大约 2000 列和超过 2000 万行。

通常，我会 post 我能想到的粗略尝试。但是，在这种情况下，我什至不知道从哪里开始。我确实有一个奇怪的想法，那就是将这个 table 与它的一个非透视版本结合起来。但是，我不知道如何反转 BQ table。

Answer 1

一种方法是暴力破解：

select userid, col1, col2, col3
from t
union all
select userid, 0 as col1, col2, col3
from t
where col1 = 0
union all
select userid, col1, 0 as col2, col3
from t
where col2 = 0
union all
select userid, col1, col2, 0 as col3
from t
where col3 = 0;

这很冗长 -- 并且包含数百个列。我想不出更简单的方法。

Answer 2

以下适用于 BigQuery 标准 SQL

它足够通用 - 您不需要指定列名或重复相同的代码块 2000 次！

假设你的初始数据在project.dataset.tabletable

#standardSQL
create temp table flatten as 
with temp as (
  select userid, offset, 
    split(col_kv, ':')[offset(0)] as col,
    cast(split(col_kv, ':')[offset(1)] as float64) as val
  from `project.dataset.table` t,
  unnest(split(translate(to_json_string(t), '{}"', ''))) col_kv with offset
  where split(col_kv, ':')[offset(0)] != 'userid'
), numbers as (
  select * from unnest((
  select generate_array(1, max(offset))
  from temp)) as grp
), targets as (
  select userid, grp from temp, numbers  
  where grp = offset and val != 0
), flatten_result as (
  select *, 0 as grp from temp union all
  select userid, offset, col, if(offset = grp, 0, val) as val, grp
  from temp left join targets using(userid)   
)
select * from flatten_result;

execute immediate '''create temp table pivot as 
select userid, ''' || (
  select string_agg(distinct "max(if(col = '" || col || "', val, null)) as " || col)
  from flatten
) || ''' from flatten group by userid, grp''';

select * from pivot order by userid;

你的最终输出是温度 table pivot

如果将以上应用到您的问题脚本输出中的样本数据是

并且枢轴 table 的输出低于最后一个 VIW RESULT link

BigQuery - 复制具有修改值的行

BigQuery - replicate rows with modified values

sql

join

self-join

google-bigquery