在 BigQuery 上查找重复值的每个 2 组合

Question

我之前post编辑过这个，但是我不够清楚，所以我会在这个post中提供一个更好的例子。我有一个 table 坐在 bigquery 中，有几个列，我会列出相关的。

我的目标是为 markers.code 找到长度为 2 的所有可能组合，并找出它被看到的总次数和每天被看到的唯一设备的数量，版本，device_type.

架构：

device_timestamp TIMESTAMP NULLABLE
, mac_address STRING NULLABLE
, version STRING NULLABLE
, device_type STRING NULLABLE
, markers RECORD REPEATED [{code STRING NULLABLE, value STRING NULLABLE}]

示例数据：

2020-01-01 00:00:15, "abcdefgh", "1.01", "android", {"power_off": 2, "buffer_error": 1, "out_of_memory": 1}
2020-01-01 00:00:25, "zasdpqld", "1.01", "android", {"failed_state": 5, "load_error": 2, "power_off": 1, "buffer_error": 1}
2020-01-01 00:53:13, "apelddsa", "1.02", "android", {"black_screen": 1, "kernel_crash": 1, "power_off": 1}

所需的输出架构：

架构

date DATE
, version STRING
, device_type STRING
, target_marker STRING
, secondary_marker STRING
, total_seen INT64
, unique_devices INT64

示例期望输出：

2020-01-01, "1.01", "android", "power_off", "buffer_error", 2, 2
2020-01-01, "1.01", "android", "power_off", "out_of_memory", 1, 1
2020-01-01, "1.01", "android", "buffer_error", "power_off", 2, 2
2020-01-01, "1.01", "android", "buffer_error", "out_of_memory", 1, 1 
2020-01-01, "1.01", "android", "out_of_memory", "power_off", 1, 1
2020-01-01, "1.01", "android", "out_of_memory", "buffer_error", 1, 1

2020-01-01, "1.01", "android", "power_off", "failed_state", 1, 1
2020-01-01, "1.01", "android", "power_off", "load_error", 1, 1
2020-01-01, "1.01", "android", "failed_state", "load_error", 1, 1
2020-01-01, "1.01", "android", "failed_state", "power_off", 1, 1
2020-01-01, "1.01", "android", "load_error", "power_off", 1, 1

2020-01-01, "1.02", "android", "black_screen", "kernel_crash", 1, 1
2020-01-01, "1.02", "android", "black_screen", "power_off", 1, 1,
2020-01-01, "1.02", "android", "kernel_crash", "black_screen", 1, 1
2020-01-01, "1.02", "android", "kernel_crash", "power_off", 1, 1
2020-01-01, "1.02", "android", "power_off", "black_screen", 1, 1
2020-01-01, "1.02", "android", "power_off", "kernel_crash", 1, 1

以上是一个很容易理解问题复杂性的例子。真正的数据集，会有很多版本，mac地址，设备类型和标记代码两者的多种组合。 Total Seen 将是 COUNT(*)，唯一设备将是 COUNT(DISTINCT mac_address)，它们将按日期、版本、device_type、target_marker、[=39 分组=].

我希望这是有道理的；如果需要更多信息来完成此问题，请发表评论。

谢谢！

Answer 1

考虑以下

with flatten_data as (
  select date(device_timestamp) date, mac_address, version, device_type, code, value, format('%t', t) as entry
  from your_table t, t.markers 
)
select date, version, device_type,
  t1.code as target_marker,
  t2.code as secondary_marker,
  count(*) as total_seen,
  count(distinct t2.mac_address) as unique_devices 
from flatten_data t1
join flatten_data t2
using(entry, date, version, device_type)
where t1.code != t2.code
group by date, version, device_type, target_marker, secondary_marker

如果应用于您问题中的示例数据 - 输出为（仅显示前几行）

在 BigQuery 上查找重复值的每个 2 组合

Find Every Combination of 2 for repeated value on BigQuery

sql

database

google-bigquery

unnest

telemetry