按布尔值计算不同

Count distinct by boolean value

是否有更好(更漂亮、更惯用,甚至更高效)的方式来执行以下操作?

Objective: 通过另一个布尔列计算一个列的不同值。

示例数据:

id  | metadata_streaming_date | cols_exist |
--- | ----------------------- | -----------|
 1  | 2022-02-20              | true       |
 1  | 2022-02-20              | true       |
 2  | 2022-02-20              | true       |
 2  | 2022-02-20              | true       |
 3  | 2022-02-20              | false      |
 1  | 2022-02-19              | true       |
 2  | 2022-02-19              | false      |
 3  | 2022-02-19              | false      |
 4  | 2022-02-19              | false      |
 4  | 2022-02-19              | false      |

预期结果是 count distinct idmetadata_streaming_date 分组,按所需 (where cols_exist = false) 和总体(每个日期此 ID 的所有行)拆分。

预期结果 table:

| metadata_streaming_date | wanted | overall |
| ----------------------- | -------| --------|
| 2022-02-20              | 1      | 3       |
| 2022-02-19              | 3      | 4       |

我可以通过两个子查询和内部连接实现它然后 metadata_streaming_date:

select
  t1.metadata_streaming_date,
  overall,
  wanted,
  wanted / overall as perc
from
  (
    select
      metadata_streaming_date,
      count(distinct id) as overall
    from
      non_needed_fields_view
    where
      metadata_streaming_date >= '2022-02-19'
    group by
      metadata_streaming_date
  ) as t1
  inner join (
    select
      metadata_streaming_date,
      count(distinct id) as wanted
    from
      non_needed_fields_view
    where
      cols_exist is false
      and metadata_streaming_date >= '2022-02-19'
    group by
      metadata_streaming_date
  ) as t2 on t1.metadata_streaming_date = t2.metadata_streaming_date

你可以尝试用DISTINCT的聚合条件函数,让你的逻辑在CASE WHEN表达式中。

SELECT metadata_streaming_date,
       COUNT(DISTINCT CASE WHEN cols_exist = false THEN id END) wanted ,
       COUNT(DISTINCT id) overall 
FROM non_needed_fields_view
WHERE metadata_streaming_date >= '2022-02-19'
GROUP BY metadata_streaming_date 
  1. 聚合函数有一个很酷的 FILTER 语法,目前由一些 RDBMS / SQL 引擎支持,包括 Spark SQL、PostgreSQL & SQL网站。据我所知,它是 SQL ISO 标准的一部分。
  2. SQL 中日期的 ISO 语法是 DATE 'yyyy-MM-dd'

select   metadata_streaming_date 
        ,count(distinct id) filter (where cols_exist = false) as wanted
        ,count(distinct id)                                   as overall
from     non_needed_fields_view
where    metadata_streaming_date >= date '2022-02-19'
group by metadata_streaming_date 

+-----------------------+------+-------+
|metadata_streaming_date|wanted|overall|
+-----------------------+------+-------+
|             2022-02-19|     3|      4|
|             2022-02-20|     1|      3|
+-----------------------+------+-------+