选择超过总值百分比的行子集

Selecting a subset of rows that exceed a percentage of total values

我有一个 table 的客户、用户和收入类似于以下内容(实际上有数千条记录):

Customer   User    Revenue
001        James   500
002        James   750
003        James   450
004        Sarah   100
005        Sarah   500
006        Sarah   150
007        Sarah   600
008        James   150
009        James   100

我想做的是return只有占用户总收入80%的消费最高的客户。

要手动执行此操作,我会按收入对 James 的客户进行排序,计算出总百分比和 运行 总百分比,然后只有 return 记录到 运行总点击率80%:

Customer    User    Revenue     % of total  Running Total %
002         James   750         0.38        0.38 
001         James   500         0.26        0.64 
003         James   450         0.23        0.87  <- Greater than 80%, last record
008         James   150         0.08        0.95 
009         James   100         0.05        1.00 

我试过使用 CTE,但到目前为止还是一片空白。有没有办法通过单个查询而不是在 Excel sheet?

中手动执行此操作

SQL Server 2012+

你可以使用窗口化 SUM:

WITH cte AS
(
   SELECT *,
          1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY [User]) AS percentile,
          1.0 * SUM(Revenue) OVER(PARTITION BY [User] ORDER BY [Revenue] DESC)
                /SUM(Revenue) OVER(PARTITION BY [User]) AS running_percentile
   FROM tab
)
SELECT *
FROM cte 
WHERE running_percentile <= 0.8;

LiveDemo


SQL 服务器 2008:

WITH cte AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
    FROM t    
), cte2 AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM cte c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]
            AND c2.rn <= c.rn) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT *
FROM cte2
WHERE running_percentile <= 0.8;

LiveDemo2

输出:

╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User  ║ Revenue ║   percentile   ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║        2 ║ James ║     750 ║ 0,384615384615 ║ 0,384615384615     ║
║        1 ║ James ║     500 ║ 0,256410256410 ║ 0,641025641025     ║
║        7 ║ Sarah ║     600 ║ 0,444444444444 ║ 0,444444444444     ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝

编辑 2:

That looks nearly there, the only niggle is it's missing the last row, the third row for James takes him over 0.80 but needs to be included.

WITH cte AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
    FROM t    
), cte2 AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM cte c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]
            AND c2.rn <= c.rn) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT a.*
FROM cte2 a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
             FROM cte2
             WHERE running_percentile >= 0.8
               AND cte2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp;

LiveDemo3

输出:

╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User  ║ Revenue ║   percentile   ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║        2 ║ James ║     750 ║ 0,384615384615 ║ 0,384615384615     ║
║        1 ║ James ║     500 ║ 0,256410256410 ║ 0,641025641025     ║
║        3 ║ James ║     450 ║ 0,230769230769 ║ 0,871794871794     ║
║        7 ║ Sarah ║     600 ║ 0,444444444444 ║ 0,444444444444     ║
║        5 ║ Sarah ║     500 ║ 0,370370370370 ║ 0,814814814814     ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝

Looks to be perfect, translated to my big table and returns what I need, spent a good 5 minutes working through it and still can't follow what you've done!

SQL Server 2008 不支持 OVER() 子句中的所有内容,但 ROW_NUMBER 支持。

首先cte只是计算组内的位置:

╔═══════════╦════════╦══════════╦════╗
║ Customer  ║ User   ║ Revenue  ║ rn ║
╠═══════════╬════════╬══════════╬════╣
║        2  ║ James  ║     750  ║  1 ║
║        1  ║ James  ║     500  ║  2 ║
║        3  ║ James  ║     450  ║  3 ║
║        8  ║ James  ║     150  ║  4 ║
║        9  ║ James  ║     100  ║  5 ║
║        7  ║ Sarah  ║     600  ║  1 ║
║        5  ║ Sarah  ║     500  ║  2 ║
║        6  ║ Sarah  ║     150  ║  3 ║
║        4  ║ Sarah  ║     100  ║  4 ║
╚═══════════╩════════╩══════════╩════╝

第二个 cte:

  • c2 子查询根据 ROW_NUMBER
  • 的排名计算 运行 总数
  • c3 计算每个用户的全部金额

在最终查询 s 中,子查询找到超过 80% 的最低 running 总数。

编辑 3:

使用ROW_NUMBER实际上是多余的。

WITH cte AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM t c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM t c2
          WHERE c.[User] = c2.[User]
            AND c2.Revenue >= c.Revenue) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM t c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT a.*
FROM cte a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
             FROM cte c2
             WHERE running_percentile >= 0.8
               AND c2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp
ORDER BY [User], Revenue DESC;

LiveDemo4

在 SQL Server 2012+ 中,您将使用累计和——效率更高。在 SQL Server 2008 中,您可以使用相关子查询或 cross apply:

select t.*,
       sum(t.Revenue*1.0) / sum(t.Revenue) over (partition by user) as [% of Total],
       sum(RunningRevenue*1.0) / sum(t.Revenue) over (partition by user) as [Running Total %]
from t cross apply
     (select sum(Revenue) as RunningRevenue
      from t t2
      where t2.Revenue >= t.Revenue and t2.user = t.user
     ) t2;

注意:*1.0 是为了以防万一 Revenue 存储为整数。 SQL 服务器进行整数除法,这将 return 0 用于几乎所有行的两列。

编辑:

如果您只想要 James 的结果,请添加 where user = 'James'