在 BigQuery 上查找重复值的每个 2 组合

Find Every Combination of 2 for repeated value on BigQuery

我之前post编辑过这个,但是我不够清楚,所以我会在这个post中提供一个更好的例子。我有一个 table 坐在 bigquery 中,有几个列,我会列出相关的。

我的目标是为 markers.code 找到长度为 2 的所有可能组合,并找出它被看到的总次数和每天被看到的唯一设备的数量,版本,device_type.

架构:

device_timestamp TIMESTAMP NULLABLE
, mac_address STRING NULLABLE
, version STRING NULLABLE
, device_type STRING NULLABLE
, markers RECORD REPEATED [{code STRING NULLABLE, value STRING NULLABLE}]

示例数据:

2020-01-01 00:00:15, "abcdefgh", "1.01", "android", {"power_off": 2, "buffer_error": 1, "out_of_memory": 1}
2020-01-01 00:00:25, "zasdpqld", "1.01", "android", {"failed_state": 5, "load_error": 2, "power_off": 1, "buffer_error": 1}
2020-01-01 00:53:13, "apelddsa", "1.02", "android", {"black_screen": 1, "kernel_crash": 1, "power_off": 1}

所需的输出架构:

架构

date DATE
, version STRING
, device_type STRING
, target_marker STRING
, secondary_marker STRING
, total_seen INT64
, unique_devices INT64

示例期望输出:

2020-01-01, "1.01", "android", "power_off", "buffer_error", 2, 2
2020-01-01, "1.01", "android", "power_off", "out_of_memory", 1, 1
2020-01-01, "1.01", "android", "buffer_error", "power_off", 2, 2
2020-01-01, "1.01", "android", "buffer_error", "out_of_memory", 1, 1 
2020-01-01, "1.01", "android", "out_of_memory", "power_off", 1, 1
2020-01-01, "1.01", "android", "out_of_memory", "buffer_error", 1, 1

2020-01-01, "1.01", "android", "power_off", "failed_state", 1, 1
2020-01-01, "1.01", "android", "power_off", "load_error", 1, 1
2020-01-01, "1.01", "android", "failed_state", "load_error", 1, 1
2020-01-01, "1.01", "android", "failed_state", "power_off", 1, 1
2020-01-01, "1.01", "android", "load_error", "power_off", 1, 1

2020-01-01, "1.02", "android", "black_screen", "kernel_crash", 1, 1
2020-01-01, "1.02", "android", "black_screen", "power_off", 1, 1,
2020-01-01, "1.02", "android", "kernel_crash", "black_screen", 1, 1
2020-01-01, "1.02", "android", "kernel_crash", "power_off", 1, 1
2020-01-01, "1.02", "android", "power_off", "black_screen", 1, 1
2020-01-01, "1.02", "android", "power_off", "kernel_crash", 1, 1

以上是一个很容易理解问题复杂性的例子。真正的数据集,会有很多版本,mac地址,设备类型和标记代码两者的多种组合。 Total Seen 将是 COUNT(*),唯一设备将是 COUNT(DISTINCT mac_address),它们将按日期、版本、device_type、target_marker、[=39 分组=].

我希望这是有道理的;如果需要更多信息来完成此问题,请发表评论。

谢谢!

考虑以下

with flatten_data as (
  select date(device_timestamp) date, mac_address, version, device_type, code, value, format('%t', t) as entry
  from your_table t, t.markers 
)
select date, version, device_type,
  t1.code as target_marker,
  t2.code as secondary_marker,
  count(*) as total_seen,
  count(distinct t2.mac_address) as unique_devices 
from flatten_data t1
join flatten_data t2
using(entry, date, version, device_type)
where t1.code != t2.code
group by date, version, device_type, target_marker, secondary_marker            

如果应用于您问题中的示例数据 - 输出为(仅显示前几行)