通过获取最新数据加载来删除重复值
Remove duplicate values by taking latest data load
我正在处理如下所示的企业数据。
| load_number | id | time | slot_time | region | network |
|-------------|-----------|----------|-----------|--------|---------|
| 1692 | 641131146 | 00:20:00 | 00:20:00 | FX-4 | SBOB |
| 1692 | 641131146 | 00:20:00 | 00:20:30 | FX-4 | SBOB |
| 1442 | 570732257 | 00:20:00 | 00:20:00 | FX-4 | SBOB |
| 1442 | 570732257 | 00:20:00 | 00:20:30 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:00 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:30 | FX-4 | SBOB |
| 1442 | 570732258 | 00:55:00 | 00:55:00 | FX-4 | SBOB |
| 1442 | 570732258 | 00:55:00 | 00:55:30 | FX-4 | SBOB |
问题是该公司的数据做法和 changes/reuses ID 不当,但只更新了 load_number
字段。
如何构建我的 sql 查询来提取最新加载的数据,如下所示:
| load_number | id | time | slot_time | region | network |
|-------------|-----------|----------|-----------|--------|---------|
| 1692 | 641131146 | 00:20:00 | 00:20:00 | FX-4 | SBOB |
| 1692 | 641131146 | 00:20:00 | 00:20:30 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:00 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:30 | FX-4 | SBOB |
除了id
和load_number
,基本上每个字段都会匹配。因此,假设除了这两个字段之外的每个字段都匹配,我可以通过获取 load_number
.
较高的行来删除 'duplicates'
我在考虑 load_number
上的某种下降 rank()
,
非常感谢任何帮助!
尝试这样的事情
with max_load_numbers_by_id AS (
SELECT et.id, MAX(et.load_number) AS max_load_number
FROM enterprise_table et
GROUP BY et.id
)
SELECT et.*
FROM enterprise_table et
JOIN max_load_numbers_by_id mlnbi
ON et.id = mlnbi.id
AND et.max_load_number = mlnbi.load_number
您可以使用 window 函数 rank
或 dense_rank
到 select 最近的 load_number。这是 demo.
select
load_number,
id,
time,
slot_time,
region,
network
from
(
select
*,
dense_rank() over(order by load_number desc) as rn
from myTable
) subq
where rn = 1;
输出:
| load_number | id | time | slot_time | region | network |
| ----------- | --------- | -------- | --------- | ------ | ------- |
| 1692 | 641131146 | 00:20:00 | 00:20:00 | FX-4 | SBOB |
| 1692 | 641131146 | 00:20:00 | 00:20:30 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:00 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:30 | FX-4 | SBOB |
你可以直接使用 distinct on
:
select distinct on (time, slot_time, region, network) t.*
from mytable t
order by time, slot_time, region, network, load_number desc
load_number | id | time | slot_time | region | network
----------: | --------: | :------- | :-------- | :----- | :------
1692 | 641131146 | 00:20:00 | 00:20:00 | FX-4 | SBOB
1692 | 641131146 | 00:20:00 | 00:20:30 | FX-4 | SBOB
1692 | 641131147 | 00:55:00 | 00:55:00 | FX-4 | SBOB
1692 | 641131147 | 00:55:00 | 00:55:30 | FX-4 | SBOB
我正在处理如下所示的企业数据。
| load_number | id | time | slot_time | region | network |
|-------------|-----------|----------|-----------|--------|---------|
| 1692 | 641131146 | 00:20:00 | 00:20:00 | FX-4 | SBOB |
| 1692 | 641131146 | 00:20:00 | 00:20:30 | FX-4 | SBOB |
| 1442 | 570732257 | 00:20:00 | 00:20:00 | FX-4 | SBOB |
| 1442 | 570732257 | 00:20:00 | 00:20:30 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:00 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:30 | FX-4 | SBOB |
| 1442 | 570732258 | 00:55:00 | 00:55:00 | FX-4 | SBOB |
| 1442 | 570732258 | 00:55:00 | 00:55:30 | FX-4 | SBOB |
问题是该公司的数据做法和 changes/reuses ID 不当,但只更新了 load_number
字段。
如何构建我的 sql 查询来提取最新加载的数据,如下所示:
| load_number | id | time | slot_time | region | network |
|-------------|-----------|----------|-----------|--------|---------|
| 1692 | 641131146 | 00:20:00 | 00:20:00 | FX-4 | SBOB |
| 1692 | 641131146 | 00:20:00 | 00:20:30 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:00 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:30 | FX-4 | SBOB |
除了id
和load_number
,基本上每个字段都会匹配。因此,假设除了这两个字段之外的每个字段都匹配,我可以通过获取 load_number
.
我在考虑 load_number
上的某种下降 rank()
,
非常感谢任何帮助!
尝试这样的事情
with max_load_numbers_by_id AS (
SELECT et.id, MAX(et.load_number) AS max_load_number
FROM enterprise_table et
GROUP BY et.id
)
SELECT et.*
FROM enterprise_table et
JOIN max_load_numbers_by_id mlnbi
ON et.id = mlnbi.id
AND et.max_load_number = mlnbi.load_number
您可以使用 window 函数 rank
或 dense_rank
到 select 最近的 load_number。这是 demo.
select
load_number,
id,
time,
slot_time,
region,
network
from
(
select
*,
dense_rank() over(order by load_number desc) as rn
from myTable
) subq
where rn = 1;
输出:
| load_number | id | time | slot_time | region | network |
| ----------- | --------- | -------- | --------- | ------ | ------- |
| 1692 | 641131146 | 00:20:00 | 00:20:00 | FX-4 | SBOB |
| 1692 | 641131146 | 00:20:00 | 00:20:30 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:00 | FX-4 | SBOB |
| 1692 | 641131147 | 00:55:00 | 00:55:30 | FX-4 | SBOB |
你可以直接使用 distinct on
:
select distinct on (time, slot_time, region, network) t.*
from mytable t
order by time, slot_time, region, network, load_number desc
load_number | id | time | slot_time | region | network ----------: | --------: | :------- | :-------- | :----- | :------ 1692 | 641131146 | 00:20:00 | 00:20:00 | FX-4 | SBOB 1692 | 641131146 | 00:20:00 | 00:20:30 | FX-4 | SBOB 1692 | 641131147 | 00:55:00 | 00:55:00 | FX-4 | SBOB 1692 | 641131147 | 00:55:00 | 00:55:30 | FX-4 | SBOB