如何从不定数量的组中聚合信息

Question

如何在TSQL中聚合来自不定数量组的信息？例如。我们有一个包含 2 列的 table - 客户和地区。

Clients Regions
client1 45
client1 45
client1 45
client1 45
client1 43
client1 42
client1 41
client2 45
client2 45
client3 43
client3 43
client3 41
client3 41
client3 41
client3 41

每个客户端可以有任意数量的区域。

在下面的示例中：client1 有 4 组区域，第 2 - 1 组，第 3 - 2 组。

我想计算每个客户端的基尼杂质，即计算 - 客户端中的区域有何不同。

为此，我想对每个客户应用以下公式：

1 - ((% of region1 among all the regions in the client) ^ 2 + 
     (% of region2 among all the regions in the client) ^ 2 + 
   … (% of regionN among all the regions in the client) ^ 2)

但区域数量不定（每个客户端可能不同）。

这个应该是这样计算的：

client1 = 1 - ((4 / 7 ) ^ 2 + (1 / 7 ) ^ 2 + (1 / 7 ) ^ 2  + (1 / 7 ) ^ 2)
client2 = 1 - ((2 / 2 ) ^ 2)
client3 = 1 - ((2 / 6 ) ^ 2 +  (4 / 6 ) ^ 2)

这是理想的输出：

Clients Impurity
client1 61%
client2 0%
client3 44%

能不能提示一下解决问题的方法。

Answer 1

我认为该公式可以表示为一组：

WITH cte AS (
    SELECT Clients
         , CAST(COUNT(*) AS DECIMAL(10, 0)) / SUM(COUNT(*)) OVER(PARTITION BY Clients) AS tmp
    FROM t
    GROUP BY Clients, Regions
)
SELECT Clients
     , 100 * (1 - SUM(tmp * tmp)) AS GI
FROM cte
GROUP BY Clients

db<>fiddle 似乎符合预期输出。

Answer 2

以下是我的处理方法：

在子查询中，执行 count(*) as cnt ... group by clients, regions
在子查询中，执行 cast(cnt as float)/sum(cnt) over(partition by clients) as pcnt 并将其平方
在外部查询中执行 1 - sum(pcnt) ... group by clients

有一些方法可以将其压缩为不使用 2 个子查询，但它们可能不会使其更具可读性或易于理解。我不是很清楚你是指百分比（满分 100）还是比率（满分 1）所以你可能需要在适当的地方添加一个 *100

如何从不定数量的组中聚合信息

How to aggregate information from indefinite number of groups

sql

tsql

sql-server

gini