在 table 上生成日期直方图

Question

我使用 Postgres CLI 在终端中写了一个 returns 条形图的查询。查询缓慢且效率低下。我想改变它。

在基础上，我们有一个非常简单的查询。我们希望每一行都是 table 中总行数的一部分。假设我们的硬编码行数是 N_ROWS，而我们的 table 是 my_table.

此外，假设 N_ROWS 等于 8。

select
    (select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level 
from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)

在我的例子中，这个 returns 我的图表 Y 轴为：

您已经可以看到该查询的问题。

我可以使用 N_ROWS 以编程方式生成许多行，而不是在 VALUES 中对每个行值进行硬编码吗？显然，我也不喜欢我如何对每一行的整个 table 执行新计数。

我们现在需要我们的 X 轴，这就是我想出的：

select
    r.level,
    case
        when (
            select count(id) from my_table where created_at_utc<= '2019-01-01 00:00:00'::timestamp without time zone
        ) >= r.level then true
    end as "2019-01-01"
from (
    select (select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)
) as r;

returns 我们的第一个桶：

 level | 2019-01-01
-------+------------
 71760 |
 62790 |
 53820 |
 44850 |
 35880 |
 26910 | t
 17940 | t
  8970 | t
     0 | t

我宁愿不为每个存储桶硬编码一个 case 语句，但是，当然，我就是这样做的。结果正是我要找的。

 level | 2019-01-01 | 2019-02-01 | 2019-03-01 | 2019-04-01 | 2019-05-01 | 2019-06-01 | 2019-07-01 | 2019-08-01 | 2019-09-01 | 2019-10-01 | 2019-11-01 | 2019-12-01
-------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
 71760 |            |            |            |            |            |            |            |            |            |            |            | t
 62790 |            |            |            |            | t          | t          | t          | t          | t          | t          | t          | t
 53820 |            |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t
 44850 |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 35880 |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 26910 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 17940 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
  8970 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
     0 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t

Answer 1

我们当然可以做出一些改进。

首先，让我们用一些数据做一个测试table：

CREATE TABLE test (id bigint, dt date);
-- Add 1 million rows
insert into test select generate_series(1,100000, 1);
-- Add dates from 2019-01-01 to 2019-01-11
update test set dt='2019-01-01'::date + (id/10000)::int;

我们可以几乎用这个更快的查询替换您的第一个查询来查找关卡：

   SELECT unnest(percentile_disc(
                  (
                      SELECT array_agg(x) 
                      FROM generate_series(0, 1, (1::numeric)/8) as g(x))
                  ) WITHIN GROUP (ORDER BY id)
                 ) as l
    FROM test;
   l
--------
      1
  12500
  25000
  37500
  50000
  62500
  75000
  87500
 100000
(9 rows)

注意第一级是1而不是0，其他的应该是一样的

我们还可以使用其他一些技巧：

我们将从 generate_series
我们可以按天对测试中的数据进行分组（或者 date_trunc(timestamp) 如果您有时间戳数据）并计算 id。在此计数上使用 window 函数将为我们提供每天 ID 的累计总和。
我们可以在 psql 中使用 \crosstabview 来透视结果查询

WITH num_levels AS (
    SELECT 8 as num_levels
), levels as (
   SELECT unnest(percentile_disc(
                  (
                      SELECT array_agg(x) 
                      FROM num_levels
                      CROSS JOIN LATERAL generate_series(0, 1, (1::numeric)/num_levels.num_levels) as g(x))
                  ) WITHIN GROUP (ORDER BY id)
                 ) as l
    FROM test
), dates as (
  SELECT d
  FROM generate_series('2019-01-01T00:00:00'::timestamp, '2019-01-11T00:00:00'::timestamp, '1 day') as g(d)
), counts_per_day AS (
  SELECT dt, 
         sum(counts) OVER (ORDER BY dt) as cum_sum -- the cumulative count
  FROM (
    SELECT dt, 
    count(id) as counts -- The count per day
    FROM test
    GROUP BY dt
  ) sub
)
SELECT l, dt, CASE WHEN cum_sum >= l THEN true ELSE null END
FROM levels, dates
LEFT JOIN counts_per_day ON dt = d
ORDER BY l DESC, d asc
\crosstabview
   l    | 2019-01-01 | 2019-01-02 | 2019-01-03 | 2019-01-04 | 2019-01-05 | 2019-01-06 | 2019-01-07 | 2019-01-08 | 2019-01-09 | 2019-01-10 | 2019-01-11
--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
 100000 |            |            |            |            |            |            |            |            |            |            | t
  87500 |            |            |            |            |            |            |            |            | t          | t          | t
  75000 |            |            |            |            |            |            |            | t          | t          | t          | t
  62500 |            |            |            |            |            |            | t          | t          | t          | t          | t
  50000 |            |            |            |            |            | t          | t          | t          | t          | t          | t
  37500 |            |            |            | t          | t          | t          | t          | t          | t          | t          | t
  25000 |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t
  12500 |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
      1 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
(9 rows)

该查询运行在我的笔记本电脑上用了 40 毫秒。

可以从测试中日期的最大值和最小值中选择日期 table 并且间隔可以从 1 天更改，具体取决于最大值和最小值之间需要多少列。

在 table 上生成日期直方图

Generate date-histogram over table

postgresql

date-histogram