SQL 联合 "All Other" 行

SQL Union "All Other" Row

我有一个 Sqlite 数据库,其中包含将近 500,000 行的访问日志信息。我将其用于 "number of times each ip has hit the site" 或 "percentage of hits were POST" 等聚合信息

我编写了一个 SQL 查询来收集每个 IP 地址访问站点的次数,其中出现次数大于 IP 地址计数的 1%。

select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01

这returns大约有7个重要的IP地址。我将如何将 "All Others" 行合并到结果集?

我尝试了逻辑相反的 UNIONing

select "All Others", count(ip_address)
from records
group by ip_address
having count(ip_address) < (select count(ip_address) from records) * .01

但是这 returns 多 "All Other" 行,计数是连续的。

使用 union all,当然..但这并不能回答 "the problem"。

这个问题是第二个查询"returns multiple"(和第一个查询一样),因为group by是按IP的,有很多。也就是说,每个组 有一个结果元组 ,独立于 select 输出子句中的任何操作。

期望的目标可能是将计数与外部 select 相加。

-- union all
select "All Others", sum(t.ct)
from (
   select count(ip_address) as ct
   from records
   group by ip_address
   -- note: <=, and not <, is inverse of >
   having count(ip_address) <= (select count(ip_address) from records) * .01
   ) t

当然,如果 'total' 和 'found' 已知,那么 'others' 就是 'total' - 'found'.

计数是连续的,虽然这是一个有趣的观察,但无关紧要。请记住,SQL 可以 return 行以任何顺序排列,当没有 order by 应用于物化结果集时(order by in sub-select不严格保证)。

你能用一个变量来保存这些信息吗?

DECLARE @num INT
SET @num = (select count(*)
             from records
             group by ip_address
             having count(*) > (select count(ip_address) from records) * .01)

然后进行常规查询

select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
UNION
select "All Others", count(ip_address)-@num
from records      

如果没有 CTE,这可能是最好的(我不确定 sqlite 允许什么)。使用 not in 可以防止你写出你的条件的相反情况,在其他情况下,空值或浮点数学考虑可能会更复杂:

select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
union all
select 'All others', count(*)
from records
where ip_address not in (
    select ip_address /* assuming non-null ip_address */
    from records
    group by ip_address
    having count(ip_address) > (select count(ip_address) from records) * .01
)

否则:

with topPercent as (
    select ip_address, count(ip_address) as addr_cnt
    from records
    group by ip_address
    having count(ip_address) > (select count(ip_address) from records) * .01
)
select ip_address, addr_cnt from topPercent
union all
select 'All others', count(distinct ip_address) - (select count(*) from topPercent)

如果分析函数可用,第三个选项可能是最快的:

select case when pct > 0.01 then ip_address else 'All others' end, sum(addr_cnt)
from (
    select ip_address, addr_cnt, addr_cnt * 1.0e / sum(addr_cnt) over () as pct
    from (
        select ip_address, count(ip_address) as addr_cnt
        from records
        group by ip_address
    ) T1
) T2
group by case when pct > 0.01 then ip_address else 'All others' end