BigQuery:检索跨两列唯一的行,否则检索第三列最大的行
BigQuery: Retrieve rows that are unique across two columns, otherwise row with largest third column
我有一个 BigQuery table、my_table
,如下所示:
+---------+---------+-------+------------------+----------+--------+-----+--------+
| poll_id | user_id | count | timestamp | timezone | answer | age | gender |
+---------+---------+-------+------------------+----------+--------+-----+--------+
| 1 | 1 | 5 | 2019-08-06 11:00 | 1 | no | 25 | male |
| 1 | 1 | 10 | 2019-08-06 10:00 | 1 | no | 25 | male |
| 1 | 1 | 10 | 2019-08-06 10:30 | 1 | yes | 25 | male |
| 1 | 2 | 10 | 2019-08-06 11:00 | 1 | no | 35 | male |
| 1 | 2 | 20 | 2019-08-06 11:00 | 1 | no | 35 | male |
| 1 | 2 | 35 | 2019-08-06 11:00 | 1 | NULL | 35 | male |
| 2 | 1 | 10 | 2019-08-06 10:35 | 1 | no | 25 | male |
| 3 | 1 | 10 | 2019-08-06 10:35 | 1 | NULL | 25 | male |
+---------+---------+-------+------------------+----------+--------+-----+--------+
我想检索满足以下要求的行:
- 如果行 具有
poll_id
和 user_id
的唯一组合,如果它在 [=16] 中具有非 NULL 值,则包括该行=]
- 如果行 没有 具有 poll_id 和 user_id 的唯一组合:
- 在
answer
列中包含最大 count
且不为 NULL 的行
- 如果有两行具有相同的
count
(和非NULL answer
),包括具有最大timestamp
的行
我还希望能够将搜索限制在特定的日期和时区,例如日期 2019-08-06 和时区 1,我不想检索具有user_id
.
中的 NULL 值
到目前为止,我已经尝试了以下标准 SQL 语句:
SELECT
t1.poll_id,
t1.user_id,
t1.count,
t1.timestamp,
t1.timezone,
t1.answer,
t1.age,
t1.gender,
FROM
`my_table` t1
LEFT JOIN
`my_table` t2
ON
t1.poll_id = t2.poll_id
AND t1.user_id = t2.user_id
AND t1.count < t2.count
AND t2.answer IS NOT NULL
AND DATE(t2.timestamp, "+1:00") = "2019-08-06"
WHERE
t1.user_id IS NOT NULL
AND t1.answer IS NOT NULL
AND DATE(t1.timestamp, "+1:00") = "2019-08-06"
AND t1.timezone = 1
AND t2.count IS NULL
显示的 table 的预期结果是:
+---------+---------+-------+------------------+----------+--------+-----+--------+
| poll_id | user_id | count | timestamp | timezone | answer | age | gender |
+---------+---------+-------+------------------+----------+--------+-----+--------+
| 1 | 1 | 10 | 2019-08-06 10:30 | 1 | yes | 25 | male | // count = 10 and largest timestamp
| 1 | 2 | 20 | 2019-08-06 11:00 | 1 | no | 35 | male | // count = 20 (the 35 row had NULL in 'answer')
| 2 | 1 | 10 | 2019-08-06 10:35 | 1 | no | 25 | male | // unique 'poll_id', 'user_id' combination
+---------+---------+-------+------------------+----------+--------+-----+--------+
不过,好像有两个问题:
- 如果有多行具有相同(最大)
count
值,则检索所有这些行。这意味着在此示例中检索了第 2 行和第 3 行。
- 如果
poll_id
、user_id
组合恰好有两行,即使它们具有不同的 count
值,也不会检索任何一行。
至少看起来是这样。我很难跟踪问题,当然也很难找出正确的查询。
如有任何帮助,我们将不胜感激。
对于此类查询,row_number()
通常是合适的。我认为这符合您的描述:
select t.*
from (select t.*,
row_number() over (partition by poll_id, user_id order by count desc, timestamp desc) as seqnum
from my_table t
where answer is not nll
) t
where seqnum = 1;
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT * EXCEPT(pos)
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY poll_id, user_id ORDER BY count DESC, timestamp DESC) AS pos
FROM `project.dataset.table`
WHERE NOT answer IS NULL
AND NOT user_id IS NULL
AND timezone = 1
AND SUBSTR(timestamp, 1, 10) = '2019-08-06'
)
WHERE pos = 1
如果应用于您问题中的示例数据 - 结果是
Row poll_id user_id count timestamp timezone answer age gender
1 1 1 10 2019-08-06 10:30 1 yes 25 male
2 1 2 20 2019-08-06 11:00 1 no 35 male
3 2 1 10 2019-08-06 10:35 1 no 25 male
我有一个 BigQuery table、my_table
,如下所示:
+---------+---------+-------+------------------+----------+--------+-----+--------+
| poll_id | user_id | count | timestamp | timezone | answer | age | gender |
+---------+---------+-------+------------------+----------+--------+-----+--------+
| 1 | 1 | 5 | 2019-08-06 11:00 | 1 | no | 25 | male |
| 1 | 1 | 10 | 2019-08-06 10:00 | 1 | no | 25 | male |
| 1 | 1 | 10 | 2019-08-06 10:30 | 1 | yes | 25 | male |
| 1 | 2 | 10 | 2019-08-06 11:00 | 1 | no | 35 | male |
| 1 | 2 | 20 | 2019-08-06 11:00 | 1 | no | 35 | male |
| 1 | 2 | 35 | 2019-08-06 11:00 | 1 | NULL | 35 | male |
| 2 | 1 | 10 | 2019-08-06 10:35 | 1 | no | 25 | male |
| 3 | 1 | 10 | 2019-08-06 10:35 | 1 | NULL | 25 | male |
+---------+---------+-------+------------------+----------+--------+-----+--------+
我想检索满足以下要求的行:
- 如果行 具有
poll_id
和user_id
的唯一组合,如果它在 [=16] 中具有非 NULL 值,则包括该行=] - 如果行 没有 具有 poll_id 和 user_id 的唯一组合:
- 在
answer
列中包含最大count
且不为 NULL 的行- 如果有两行具有相同的
count
(和非NULLanswer
),包括具有最大timestamp
的行
- 如果有两行具有相同的
- 在
我还希望能够将搜索限制在特定的日期和时区,例如日期 2019-08-06 和时区 1,我不想检索具有user_id
.
到目前为止,我已经尝试了以下标准 SQL 语句:
SELECT
t1.poll_id,
t1.user_id,
t1.count,
t1.timestamp,
t1.timezone,
t1.answer,
t1.age,
t1.gender,
FROM
`my_table` t1
LEFT JOIN
`my_table` t2
ON
t1.poll_id = t2.poll_id
AND t1.user_id = t2.user_id
AND t1.count < t2.count
AND t2.answer IS NOT NULL
AND DATE(t2.timestamp, "+1:00") = "2019-08-06"
WHERE
t1.user_id IS NOT NULL
AND t1.answer IS NOT NULL
AND DATE(t1.timestamp, "+1:00") = "2019-08-06"
AND t1.timezone = 1
AND t2.count IS NULL
显示的 table 的预期结果是:
+---------+---------+-------+------------------+----------+--------+-----+--------+
| poll_id | user_id | count | timestamp | timezone | answer | age | gender |
+---------+---------+-------+------------------+----------+--------+-----+--------+
| 1 | 1 | 10 | 2019-08-06 10:30 | 1 | yes | 25 | male | // count = 10 and largest timestamp
| 1 | 2 | 20 | 2019-08-06 11:00 | 1 | no | 35 | male | // count = 20 (the 35 row had NULL in 'answer')
| 2 | 1 | 10 | 2019-08-06 10:35 | 1 | no | 25 | male | // unique 'poll_id', 'user_id' combination
+---------+---------+-------+------------------+----------+--------+-----+--------+
不过,好像有两个问题:
- 如果有多行具有相同(最大)
count
值,则检索所有这些行。这意味着在此示例中检索了第 2 行和第 3 行。 - 如果
poll_id
、user_id
组合恰好有两行,即使它们具有不同的count
值,也不会检索任何一行。
至少看起来是这样。我很难跟踪问题,当然也很难找出正确的查询。
如有任何帮助,我们将不胜感激。
对于此类查询,row_number()
通常是合适的。我认为这符合您的描述:
select t.*
from (select t.*,
row_number() over (partition by poll_id, user_id order by count desc, timestamp desc) as seqnum
from my_table t
where answer is not nll
) t
where seqnum = 1;
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT * EXCEPT(pos)
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY poll_id, user_id ORDER BY count DESC, timestamp DESC) AS pos
FROM `project.dataset.table`
WHERE NOT answer IS NULL
AND NOT user_id IS NULL
AND timezone = 1
AND SUBSTR(timestamp, 1, 10) = '2019-08-06'
)
WHERE pos = 1
如果应用于您问题中的示例数据 - 结果是
Row poll_id user_id count timestamp timezone answer age gender
1 1 1 10 2019-08-06 10:30 1 yes 25 male
2 1 2 20 2019-08-06 11:00 1 no 35 male
3 2 1 10 2019-08-06 10:35 1 no 25 male