如何避免在列中重复出现
How to avoid repeat occurrences in a column
我有一个 table 描述了为不同客户订票的代理人。
以下数据描述了一个客户数据。
根据以上数据,我期望的是
输出的意思是,我想先对队列进行分组,他订了一些去新加坡的机票,然后是奥斯汀,又是新加坡和德里
我们如何在 SQL 中实现这一点请帮助我
如果输出如下也有帮助
这是一个缺口和孤岛问题。要解决它,您需要生成一组相邻的记录。这通常是通过比较两个不同分区的行号来完成的。
考虑:
select
agent_id,
travel_destination,
min(date_of_booking) first_date_of_booking,
max(date_of_booking) max_date_of_booking
from (
select
t.*,
row_number()
over(partition by agent_id order by date_of_booking) rn1,
row_number()
over(partition by agent_id, travel_destination order by date_of_booking) rn2
from mytable t
) t
group by
agent_id,
rn1 - rn2,
travel_destination
order by first_date_of_booking
请注意,我在答案中添加了每组的开始和结束日期,因为我发现它使答案更有意义。
另注:根据你的样本数据,不清楚是否要将customerid
放入组中;我假设不是(如果是,您需要将该列添加到两个分区)。
给定这个(简化的)数据集:
agent_id | travel_destination | customer_id | date_of_booking
:------- | :----------------- | :---------- | :--------------
A1001 | Singapore | C1001 | 2019-06-10
A1001 | Singapore | C1001 | 2019-06-11
A1001 | Austin | C1001 | 2019-06-12
A1001 | Singapore | C1001 | 2019-06-13
A1001 | Singapore | C1001 | 2019-06-14
A1001 | Dehli | C1001 | 2019-06-15
查询returns:
agent_id | travel_destination | first_date_of_booking | max_date_of_booking
:------- | :----------------- | :-------------------- | :------------------
A1001 | Singapore | 2019-06-10 | 2019-06-11
A1001 | Austin | 2019-06-12 | 2019-06-12
A1001 | Singapore | 2019-06-13 | 2019-06-14
A1001 | Dehli | 2019-06-15 | 2019-06-15
要实现您演示的第二个输出,您可以添加另一个聚合级别并使用 string_agg()
select
agent_id,
string_agg(travel_destination order by first_date_of_booking) travel_destination
from (
-- above query
) t
group by agent_id
试试这个 - 至少如果你的数据库有像 Vertica 中的 LISTAGG 这样的函数......
WITH
-- this is your input - next time put it in so it can be
-- copy-pasted and formatted to the below ....
input(agent_id,travel_dest,cust_id,bookdt) AS (
SELECT 'A1001','Singapore','C1001',DATE '2109-06-10'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-06-11'
UNION ALL SELECT 'A1001','Austin' ,'C1001',DATE '2019-06-19'
UNION ALL SELECT 'A1001','Austin' ,'C1001',DATE '2019-06-19'
UNION ALL SELECT 'A1001','Austin' ,'C1001',DATE '2019-06-20'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-07-30'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-07-31'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-01'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-25'
)
-- real WITH clause starts here - substitute comma below with "WITH" ...
,
with_prev AS (
SELECT
agent_id
, travel_dest
, LAG(travel_dest,1,'') OVER (PARTITION BY agent_id ORDER BY bookdt) AS prev_dest
FROM input
)
,
de_duped AS (
SELECT
agent_id
, travel_dest
FROM with_prev
WHERE travel_dest <> prev_dest
)
SELECT
agent_id
, LISTAGG(travel_dest) AS travel_dest
FROM de_duped
GROUP BY 1
;
你得到:
agent_id | travel_dest
----------+--------------------------------------------
A1001 | Singapore,Austin,Singapore,Delhi,Singapore
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT agent_id,
STRING_AGG(DISTINCT travel_destination) AS travel_destination
FROM `project.dataset.table`
GROUP BY agent_id
它将产生以下输出
Row agent_id travel_destination
1 A1001 Singapore,Austin,Delhi
看起来预期的输出是 Singapore,Austin,Singapore,Delhi
- 下面是这个
的另一个选项
#standardSQL
CREATE TEMP FUNCTION DedupConsecutive(line STRING) RETURNS STRING LANGUAGE js AS """
return line.split(",").filter(function(value,index,arr){return value != arr[index+1];}).join(",");
""";
SELECT agent_id,
DedupConsecutive(STRING_AGG(travel_destination ORDER BY date_of_booking)) destinations
FROM `project.dataset.table`
GROUP BY agent_id
与 Gordon 的观点相同 - I cannot think of a simpler solution.
:o)
我会用 lag()
:
SELECT t.agent_id, t.travel_dest
FROM (SELECT t.*,
LAG(travel_dest) OVER (PARTITION BY agent_id ORDER BY bookdt) as prev_travel_dest
FROM t
) t
WHERE prev_travel_dest IS NULL OR prev_travel_dest <> travel_dest
ORDER BY agent_id, bookdt;
我想不出更简单的解决方案。
我有一个 table 描述了为不同客户订票的代理人。 以下数据描述了一个客户数据。
根据以上数据,我期望的是
输出的意思是,我想先对队列进行分组,他订了一些去新加坡的机票,然后是奥斯汀,又是新加坡和德里
我们如何在 SQL 中实现这一点请帮助我
如果输出如下也有帮助
这是一个缺口和孤岛问题。要解决它,您需要生成一组相邻的记录。这通常是通过比较两个不同分区的行号来完成的。
考虑:
select
agent_id,
travel_destination,
min(date_of_booking) first_date_of_booking,
max(date_of_booking) max_date_of_booking
from (
select
t.*,
row_number()
over(partition by agent_id order by date_of_booking) rn1,
row_number()
over(partition by agent_id, travel_destination order by date_of_booking) rn2
from mytable t
) t
group by
agent_id,
rn1 - rn2,
travel_destination
order by first_date_of_booking
请注意,我在答案中添加了每组的开始和结束日期,因为我发现它使答案更有意义。
另注:根据你的样本数据,不清楚是否要将customerid
放入组中;我假设不是(如果是,您需要将该列添加到两个分区)。
给定这个(简化的)数据集:
agent_id | travel_destination | customer_id | date_of_booking :------- | :----------------- | :---------- | :-------------- A1001 | Singapore | C1001 | 2019-06-10 A1001 | Singapore | C1001 | 2019-06-11 A1001 | Austin | C1001 | 2019-06-12 A1001 | Singapore | C1001 | 2019-06-13 A1001 | Singapore | C1001 | 2019-06-14 A1001 | Dehli | C1001 | 2019-06-15
查询returns:
agent_id | travel_destination | first_date_of_booking | max_date_of_booking :------- | :----------------- | :-------------------- | :------------------ A1001 | Singapore | 2019-06-10 | 2019-06-11 A1001 | Austin | 2019-06-12 | 2019-06-12 A1001 | Singapore | 2019-06-13 | 2019-06-14 A1001 | Dehli | 2019-06-15 | 2019-06-15
要实现您演示的第二个输出,您可以添加另一个聚合级别并使用 string_agg()
select
agent_id,
string_agg(travel_destination order by first_date_of_booking) travel_destination
from (
-- above query
) t
group by agent_id
试试这个 - 至少如果你的数据库有像 Vertica 中的 LISTAGG 这样的函数......
WITH
-- this is your input - next time put it in so it can be
-- copy-pasted and formatted to the below ....
input(agent_id,travel_dest,cust_id,bookdt) AS (
SELECT 'A1001','Singapore','C1001',DATE '2109-06-10'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-06-11'
UNION ALL SELECT 'A1001','Austin' ,'C1001',DATE '2019-06-19'
UNION ALL SELECT 'A1001','Austin' ,'C1001',DATE '2019-06-19'
UNION ALL SELECT 'A1001','Austin' ,'C1001',DATE '2019-06-20'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-07-30'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-07-31'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-01'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi' ,'C1001',DATE '2019-08-25'
)
-- real WITH clause starts here - substitute comma below with "WITH" ...
,
with_prev AS (
SELECT
agent_id
, travel_dest
, LAG(travel_dest,1,'') OVER (PARTITION BY agent_id ORDER BY bookdt) AS prev_dest
FROM input
)
,
de_duped AS (
SELECT
agent_id
, travel_dest
FROM with_prev
WHERE travel_dest <> prev_dest
)
SELECT
agent_id
, LISTAGG(travel_dest) AS travel_dest
FROM de_duped
GROUP BY 1
;
你得到:
agent_id | travel_dest
----------+--------------------------------------------
A1001 | Singapore,Austin,Singapore,Delhi,Singapore
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT agent_id,
STRING_AGG(DISTINCT travel_destination) AS travel_destination
FROM `project.dataset.table`
GROUP BY agent_id
它将产生以下输出
Row agent_id travel_destination
1 A1001 Singapore,Austin,Delhi
看起来预期的输出是 Singapore,Austin,Singapore,Delhi
- 下面是这个
#standardSQL
CREATE TEMP FUNCTION DedupConsecutive(line STRING) RETURNS STRING LANGUAGE js AS """
return line.split(",").filter(function(value,index,arr){return value != arr[index+1];}).join(",");
""";
SELECT agent_id,
DedupConsecutive(STRING_AGG(travel_destination ORDER BY date_of_booking)) destinations
FROM `project.dataset.table`
GROUP BY agent_id
与 Gordon 的观点相同 - I cannot think of a simpler solution.
:o)
我会用 lag()
:
SELECT t.agent_id, t.travel_dest
FROM (SELECT t.*,
LAG(travel_dest) OVER (PARTITION BY agent_id ORDER BY bookdt) as prev_travel_dest
FROM t
) t
WHERE prev_travel_dest IS NULL OR prev_travel_dest <> travel_dest
ORDER BY agent_id, bookdt;
我想不出更简单的解决方案。