如何在单个查询中将长字符串标识符唯一地映射到数值(出于带宽原因)?
How can I uniquely map a long string identifier to a numerical value in a single query (for bandwidth reasons)?
我有一个 Postgresql 数据库(从技术上讲是 Greenplum),其中包含个人随时间变化的数据。数据库具有三个字段:user_id
、monthly_date
和account_value
。当我进行查询时,我必须从远程服务器下载结果,因此带宽是个问题。由于 user_id
字段是一个很长的字符串(大约 50 个字符),我想 return 一个数值对应 1:1 与 user_id
的每个值,因为这将占用更少 space.
例如,数据库可能有这样的示例数据:
63a9364385350b13473279 Jan-2000
63a9364385350b13473279 Feb-2000
2066937e2887w206010393 Apr-2001
036686037e507d01764237 Mar-2003
036686037e507d01764237 Jun-2003
036686037e507d01764237 Jul-2003
036686037e507d01764237 Dec-2003
90829x098327549n286418 Apr-2004
90829x098327549n286418 Sep-2004
67518x834512306933u500 Nov-2000
并且我正在尝试使用 ROW_NUMBER()
和各种 window 函数(例如 PARTITION BY
来计算查询,以获得如下结果:
1 Jan-2000
1 Feb-2000
2 Apr-2001
3 Mar-2003
3 Jun-2003
3 Jul-2003
3 Dec-2003
4 Apr-2004
4 Sep-2004
5 Nov-2000
我知道这些不是实际的数据库格式,但我只是将它们用作示例数据。这可能吗?我不在乎(尽管看到它会很好而且非常整洁)如果 63a9364385350b13473279
在一个查询中映射到 1
而在下一个查询中映射到 2
,但是在任何给定查询,63a9364385350b13473279
应始终映射到相同的值,而不管日期如何。映射的数字不需要按顺序排列,也不需要具有唯一性之外的任何有意义的值。
如果您只需要一个唯一编号,这就可以了:
SELECT
id,
split_part(t.d, '-', 2),
row_number() OVER all_window - row_number() OVER group_window AS a_unique_number_by_id
FROM (
VALUES
('63a9364385350b13473279','Jan-2000'),
('63a9364385350b13473279','Feb-2000'),
('2066937e2887w206010393','Apr-2001'),
('036686037e507d01764237','Mar-2003'),
('036686037e507d01764237','Jun-2003'),
('036686037e507d01764237','Jul-2003'),
('036686037e507d01764237','Dec-2003'),
('90829x098327549n286418','Apr-2004'),
('90829x098327549n286418','Sep-2004'),
('67518x834512306933u500','Nov-2000')
) as t(id, d)
WINDOW group_window AS (
PARTITION BY id
ORDER BY split_part(t.d, '-', 2)
), all_window AS (
ORDER BY split_part(t.d, '-', 2)
);
结果如下:
id | split_part | a_unique_number_by_id
------------------------+------------+-----------------------
63a9364385350b13473279 | 2000 | 0
63a9364385350b13473279 | 2000 | 0
67518x834512306933u500 | 2000 | 2
2066937e2887w206010393 | 2001 | 3
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
90829x098327549n286418 | 2004 | 8
90829x098327549n286418 | 2004 | 8
(10 rows)
您应该用另一列重新排序以保持原来的顺序。
试试下面的脚本
create table test_schema.source_data (id varchar(50), dt varchar(50));
insert into test_schema.source_data
values ('63a9364385350b13473279','Jan-2000'),
('63a9364385350b13473279','Feb-2000'),
('2066937e2887w206010393','Apr-2001'),
('036686037e507d01764237','Mar-2003'),
('036686037e507d01764237','Jun-2003'),
('036686037e507d01764237','Jul-2003'),
('036686037e507d01764237','Dec-2003'),
('90829x098327549n286418','Apr-2004'),
('90829x098327549n286418','Sep-2004'),
('67518x834512306933u500','Nov-2000');
create temporary table id_mapping
as
select t1.id, row_number() over(order by t1.id) rownum
from (
SELECT distinct id
FROM test_schema.source_data
) t1;
select t1.id, t1.dt, t2.rownum
from
test_schema.source_data t1
join id_mapping t2
on t1.id = t2.id;
这是结果
id dt rownum
------------------------+------------+-----
036686037e507d01764237 Dec-2003 1
036686037e507d01764237 Jul-2003 1
036686037e507d01764237 Jun-2003 1
036686037e507d01764237 Mar-2003 1
2066937e2887w206010393 Apr-2001 2
63a9364385350b13473279 Feb-2000 3
63a9364385350b13473279 Jan-2000 3
67518x834512306933u500 Nov-2000 4
90829x098327549n286418 Sep-2004 5
90829x098327549n286418 Apr-2004 5
我认为您正在寻找 dense_rank()。
create table sample_data
(userid varchar(50) not null,
monthly_date date not null)
distributed by (userid);
insert into sample_data (userid, monthly_date) values
('63a9364385350b13473279','2000-01-01'),
('63a9364385350b13473279','2000-02-01'),
('2066937e2887w206010393','2001-04-01'),
('036686037e507d01764237','2003-03-01'),
('036686037e507d01764237','2003-06-01'),
('036686037e507d01764237','2003-07-01'),
('036686037e507d01764237','2003-12-01'),
('90829x098327549n286418','2004-04-01'),
('90829x098327549n286418','2004-09-01'),
('67518x834512306933u500','2000-11-01');
select dense_rank() over(order by userid) as new_userid, userid, monthly_date
from sample_data
order by 2;
new_userid | userid | monthly_date
------------+------------------------+--------------
1 | 036686037e507d01764237 | 2003-06-01
1 | 036686037e507d01764237 | 2003-07-01
1 | 036686037e507d01764237 | 2003-12-01
1 | 036686037e507d01764237 | 2003-03-01
2 | 2066937e2887w206010393 | 2001-04-01
3 | 63a9364385350b13473279 | 2000-02-01
3 | 63a9364385350b13473279 | 2000-01-01
4 | 67518x834512306933u500 | 2000-11-01
5 | 90829x098327549n286418 | 2004-09-01
5 | 90829x098327549n286418 | 2004-04-01
(10 rows)
我有一个 Postgresql 数据库(从技术上讲是 Greenplum),其中包含个人随时间变化的数据。数据库具有三个字段:user_id
、monthly_date
和account_value
。当我进行查询时,我必须从远程服务器下载结果,因此带宽是个问题。由于 user_id
字段是一个很长的字符串(大约 50 个字符),我想 return 一个数值对应 1:1 与 user_id
的每个值,因为这将占用更少 space.
例如,数据库可能有这样的示例数据:
63a9364385350b13473279 Jan-2000
63a9364385350b13473279 Feb-2000
2066937e2887w206010393 Apr-2001
036686037e507d01764237 Mar-2003
036686037e507d01764237 Jun-2003
036686037e507d01764237 Jul-2003
036686037e507d01764237 Dec-2003
90829x098327549n286418 Apr-2004
90829x098327549n286418 Sep-2004
67518x834512306933u500 Nov-2000
并且我正在尝试使用 ROW_NUMBER()
和各种 window 函数(例如 PARTITION BY
来计算查询,以获得如下结果:
1 Jan-2000
1 Feb-2000
2 Apr-2001
3 Mar-2003
3 Jun-2003
3 Jul-2003
3 Dec-2003
4 Apr-2004
4 Sep-2004
5 Nov-2000
我知道这些不是实际的数据库格式,但我只是将它们用作示例数据。这可能吗?我不在乎(尽管看到它会很好而且非常整洁)如果 63a9364385350b13473279
在一个查询中映射到 1
而在下一个查询中映射到 2
,但是在任何给定查询,63a9364385350b13473279
应始终映射到相同的值,而不管日期如何。映射的数字不需要按顺序排列,也不需要具有唯一性之外的任何有意义的值。
如果您只需要一个唯一编号,这就可以了:
SELECT
id,
split_part(t.d, '-', 2),
row_number() OVER all_window - row_number() OVER group_window AS a_unique_number_by_id
FROM (
VALUES
('63a9364385350b13473279','Jan-2000'),
('63a9364385350b13473279','Feb-2000'),
('2066937e2887w206010393','Apr-2001'),
('036686037e507d01764237','Mar-2003'),
('036686037e507d01764237','Jun-2003'),
('036686037e507d01764237','Jul-2003'),
('036686037e507d01764237','Dec-2003'),
('90829x098327549n286418','Apr-2004'),
('90829x098327549n286418','Sep-2004'),
('67518x834512306933u500','Nov-2000')
) as t(id, d)
WINDOW group_window AS (
PARTITION BY id
ORDER BY split_part(t.d, '-', 2)
), all_window AS (
ORDER BY split_part(t.d, '-', 2)
);
结果如下:
id | split_part | a_unique_number_by_id
------------------------+------------+-----------------------
63a9364385350b13473279 | 2000 | 0
63a9364385350b13473279 | 2000 | 0
67518x834512306933u500 | 2000 | 2
2066937e2887w206010393 | 2001 | 3
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
90829x098327549n286418 | 2004 | 8
90829x098327549n286418 | 2004 | 8
(10 rows)
您应该用另一列重新排序以保持原来的顺序。
试试下面的脚本
create table test_schema.source_data (id varchar(50), dt varchar(50));
insert into test_schema.source_data
values ('63a9364385350b13473279','Jan-2000'),
('63a9364385350b13473279','Feb-2000'),
('2066937e2887w206010393','Apr-2001'),
('036686037e507d01764237','Mar-2003'),
('036686037e507d01764237','Jun-2003'),
('036686037e507d01764237','Jul-2003'),
('036686037e507d01764237','Dec-2003'),
('90829x098327549n286418','Apr-2004'),
('90829x098327549n286418','Sep-2004'),
('67518x834512306933u500','Nov-2000');
create temporary table id_mapping
as
select t1.id, row_number() over(order by t1.id) rownum
from (
SELECT distinct id
FROM test_schema.source_data
) t1;
select t1.id, t1.dt, t2.rownum
from
test_schema.source_data t1
join id_mapping t2
on t1.id = t2.id;
这是结果
id dt rownum
------------------------+------------+-----
036686037e507d01764237 Dec-2003 1
036686037e507d01764237 Jul-2003 1
036686037e507d01764237 Jun-2003 1
036686037e507d01764237 Mar-2003 1
2066937e2887w206010393 Apr-2001 2
63a9364385350b13473279 Feb-2000 3
63a9364385350b13473279 Jan-2000 3
67518x834512306933u500 Nov-2000 4
90829x098327549n286418 Sep-2004 5
90829x098327549n286418 Apr-2004 5
我认为您正在寻找 dense_rank()。
create table sample_data
(userid varchar(50) not null,
monthly_date date not null)
distributed by (userid);
insert into sample_data (userid, monthly_date) values
('63a9364385350b13473279','2000-01-01'),
('63a9364385350b13473279','2000-02-01'),
('2066937e2887w206010393','2001-04-01'),
('036686037e507d01764237','2003-03-01'),
('036686037e507d01764237','2003-06-01'),
('036686037e507d01764237','2003-07-01'),
('036686037e507d01764237','2003-12-01'),
('90829x098327549n286418','2004-04-01'),
('90829x098327549n286418','2004-09-01'),
('67518x834512306933u500','2000-11-01');
select dense_rank() over(order by userid) as new_userid, userid, monthly_date
from sample_data
order by 2;
new_userid | userid | monthly_date
------------+------------------------+--------------
1 | 036686037e507d01764237 | 2003-06-01
1 | 036686037e507d01764237 | 2003-07-01
1 | 036686037e507d01764237 | 2003-12-01
1 | 036686037e507d01764237 | 2003-03-01
2 | 2066937e2887w206010393 | 2001-04-01
3 | 63a9364385350b13473279 | 2000-02-01
3 | 63a9364385350b13473279 | 2000-01-01
4 | 67518x834512306933u500 | 2000-11-01
5 | 90829x098327549n286418 | 2004-09-01
5 | 90829x098327549n286418 | 2004-04-01
(10 rows)