如何在单个查询中将长字符串标识符唯一地映射到数值(出于带宽原因)?

How can I uniquely map a long string identifier to a numerical value in a single query (for bandwidth reasons)?

我有一个 Postgresql 数据库(从技术上讲是 Greenplum),其中包含个人随时间变化的数据。数据库具有三个字段:user_idmonthly_dateaccount_value。当我进行查询时,我必须从远程服务器下载结果,因此带宽是个问题。由于 user_id 字段是一个很长的字符串(大约 50 个字符),我想 return 一个数值对应 1:1 与 user_id 的每个值,因为这将占用更少 space.

例如,数据库可能有这样的示例数据:

63a9364385350b13473279    Jan-2000
63a9364385350b13473279    Feb-2000
2066937e2887w206010393    Apr-2001
036686037e507d01764237    Mar-2003
036686037e507d01764237    Jun-2003
036686037e507d01764237    Jul-2003
036686037e507d01764237    Dec-2003
90829x098327549n286418    Apr-2004
90829x098327549n286418    Sep-2004
67518x834512306933u500    Nov-2000

并且我正在尝试使用 ROW_NUMBER() 和各种 window 函数(例如 PARTITION BY 来计算查询,以获得如下结果:

1    Jan-2000
1    Feb-2000
2    Apr-2001
3    Mar-2003
3    Jun-2003
3    Jul-2003
3    Dec-2003
4    Apr-2004
4    Sep-2004
5    Nov-2000

我知道这些不是实际的数据库格式,但我只是将它们用作示例数据。这可能吗?我不在乎(尽管看到它会很好而且非常整洁)如果 63a9364385350b13473279 在一个查询中映射到 1 而在下一个查询中映射到 2,但是在任何给定查询,63a9364385350b13473279 应始终映射到相同的值,而不管日期如何。映射的数字不需要按顺序排列,也不需要具有唯一性之外的任何有意义的值。

如果您只需要一个唯一编号,这就可以了:

SELECT
        id,
        split_part(t.d, '-', 2),
        row_number() OVER all_window - row_number() OVER group_window AS a_unique_number_by_id
FROM (
VALUES
        ('63a9364385350b13473279','Jan-2000'),
        ('63a9364385350b13473279','Feb-2000'),
        ('2066937e2887w206010393','Apr-2001'),
        ('036686037e507d01764237','Mar-2003'),
        ('036686037e507d01764237','Jun-2003'),
        ('036686037e507d01764237','Jul-2003'),
        ('036686037e507d01764237','Dec-2003'),
        ('90829x098327549n286418','Apr-2004'),
        ('90829x098327549n286418','Sep-2004'),
        ('67518x834512306933u500','Nov-2000')
) as t(id, d)
WINDOW group_window AS (
        PARTITION BY id
        ORDER BY split_part(t.d, '-', 2)
), all_window AS (
        ORDER BY split_part(t.d, '-', 2)
);

结果如下:

           id           | split_part | a_unique_number_by_id
------------------------+------------+-----------------------
 63a9364385350b13473279 | 2000       |                     0
 63a9364385350b13473279 | 2000       |                     0
 67518x834512306933u500 | 2000       |                     2
 2066937e2887w206010393 | 2001       |                     3
 036686037e507d01764237 | 2003       |                     4
 036686037e507d01764237 | 2003       |                     4
 036686037e507d01764237 | 2003       |                     4
 036686037e507d01764237 | 2003       |                     4
 90829x098327549n286418 | 2004       |                     8
 90829x098327549n286418 | 2004       |                     8
(10 rows)

您应该用另一列重新排序以保持原来的顺序。

试试下面的脚本

create table test_schema.source_data (id varchar(50), dt varchar(50));

insert into test_schema.source_data
values ('63a9364385350b13473279','Jan-2000'),
    ('63a9364385350b13473279','Feb-2000'),
    ('2066937e2887w206010393','Apr-2001'),
    ('036686037e507d01764237','Mar-2003'),
    ('036686037e507d01764237','Jun-2003'),
    ('036686037e507d01764237','Jul-2003'),
    ('036686037e507d01764237','Dec-2003'),
    ('90829x098327549n286418','Apr-2004'),
    ('90829x098327549n286418','Sep-2004'),
    ('67518x834512306933u500','Nov-2000');


create temporary table id_mapping
as 
select t1.id, row_number() over(order by t1.id) rownum
from (
SELECT distinct id
FROM test_schema.source_data
 ) t1;

select t1.id, t1.dt, t2.rownum
from
test_schema.source_data t1
join id_mapping t2
on t1.id = t2.id;

这是结果

id                      dt          rownum
------------------------+------------+-----
036686037e507d01764237  Dec-2003    1
036686037e507d01764237  Jul-2003    1
036686037e507d01764237  Jun-2003    1
036686037e507d01764237  Mar-2003    1
2066937e2887w206010393  Apr-2001    2
63a9364385350b13473279  Feb-2000    3
63a9364385350b13473279  Jan-2000    3
67518x834512306933u500  Nov-2000    4
90829x098327549n286418  Sep-2004    5
90829x098327549n286418  Apr-2004    5

我认为您正在寻找 dense_rank()。

create table sample_data
(userid varchar(50) not null,
 monthly_date date not null)
distributed by (userid);

insert into sample_data (userid, monthly_date) values 
('63a9364385350b13473279','2000-01-01'),
('63a9364385350b13473279','2000-02-01'),
('2066937e2887w206010393','2001-04-01'),
('036686037e507d01764237','2003-03-01'),
('036686037e507d01764237','2003-06-01'),
('036686037e507d01764237','2003-07-01'),
('036686037e507d01764237','2003-12-01'),
('90829x098327549n286418','2004-04-01'),
('90829x098327549n286418','2004-09-01'),
('67518x834512306933u500','2000-11-01');

select dense_rank() over(order by userid) as new_userid, userid, monthly_date 
from sample_data
order by 2;

 new_userid |         userid         | monthly_date 
------------+------------------------+--------------
      1     | 036686037e507d01764237 | 2003-06-01
      1     | 036686037e507d01764237 | 2003-07-01
      1     | 036686037e507d01764237 | 2003-12-01
      1     | 036686037e507d01764237 | 2003-03-01
      2     | 2066937e2887w206010393 | 2001-04-01
      3     | 63a9364385350b13473279 | 2000-02-01
      3     | 63a9364385350b13473279 | 2000-01-01
      4     | 67518x834512306933u500 | 2000-11-01
      5     | 90829x098327549n286418 | 2004-09-01
      5     | 90829x098327549n286418 | 2004-04-01
(10 rows)