在 SQL Big Query 中查找类别内的重叠
Find Overlapping within categories in SQL Big Query
我正在尝试做类似于 this 但在 BigQuery 中的事情。我有几个用户可能有 1 个或多个类别。我必须找到类别内的重叠。像这样:
我想要的结果是这样的:
也就是说,例如,只有一个用户只有类别 D(没有其他类别),两个用户有类别 10 和 30,依此类推。
主要问题是我有很多类别(超过40个)。以前我做过类似的事情:
SELECT sum(cat1), sum(cat2), sum(cat3)
FROM table
where cat1 = 0 and cat2 = 1 and cat3 = 0
这种方法行得通,但是太手动了,现在不可能这样做,因为我有很多类别。
如果可能,想使用 BigQuery。
FWIW:
with mytable as (
select 'D' as Usr, '10' as Categories union all
select 'E', '10' union all
select 'E', '30' union all
select 'F', '30' union all
select 'G', '10' union all
select 'G', '50' union all
select 'H', '10' union all
select 'H', '30'
)
select grp, count(*) as cnt
from (
select Usr, string_agg(Categories order by Categories) as grp
from mytable
group by Usr
)
group by grp
这不是您要查找的内容,但您可以将此输出用作 Excel 数据透视表或 BI 工具的来源,以获得您想要的内容。在 SQL 中的 40 多列上执行数据透视是 doable 但并不有趣。
select a.categories,
b.categories as cross_categories,
count(distinct a.usr) as counts
from t a
join t b on a.usr=b.usr and a.categories<> b.categories
group by a.categories, b.categories
union all
select max(categories),
max(categories),
count(distinct categories)
from t
group by usr
having count(distinct categories)=1
order by 1,2
Excel 枢轴
The main problem is that I have a lot of categories (over 40).
考虑以下 (BigQuery) 方法 - 适用于任何合理数量的类别
execute immediate (
select '''
select * from (
select distinct t1.usr,
t1.categories category, t2.categories category2
from `your_table` t1 left join `your_table` t2
on t1.usr = t2.usr and t1.categories != t2.categories
union all
select usr, any_value(categories) category, any_value(categories) category2
from `your_table`
group by usr
having count(1) = 1
)
pivot (count(usr) cat for category2 in (''' || list || '''))
order by category
'''
from (
select string_agg("'" || categories || "'" order by categories) list
from (select distinct categories from `your_table`)
)
)
如果应用于您问题中的示例数据 - 输出为
我正在尝试做类似于 this 但在 BigQuery 中的事情。我有几个用户可能有 1 个或多个类别。我必须找到类别内的重叠。像这样:
我想要的结果是这样的:
也就是说,例如,只有一个用户只有类别 D(没有其他类别),两个用户有类别 10 和 30,依此类推。
主要问题是我有很多类别(超过40个)。以前我做过类似的事情:
SELECT sum(cat1), sum(cat2), sum(cat3)
FROM table
where cat1 = 0 and cat2 = 1 and cat3 = 0
这种方法行得通,但是太手动了,现在不可能这样做,因为我有很多类别。 如果可能,想使用 BigQuery。
FWIW:
with mytable as (
select 'D' as Usr, '10' as Categories union all
select 'E', '10' union all
select 'E', '30' union all
select 'F', '30' union all
select 'G', '10' union all
select 'G', '50' union all
select 'H', '10' union all
select 'H', '30'
)
select grp, count(*) as cnt
from (
select Usr, string_agg(Categories order by Categories) as grp
from mytable
group by Usr
)
group by grp
这不是您要查找的内容,但您可以将此输出用作 Excel 数据透视表或 BI 工具的来源,以获得您想要的内容。在 SQL 中的 40 多列上执行数据透视是 doable 但并不有趣。
select a.categories,
b.categories as cross_categories,
count(distinct a.usr) as counts
from t a
join t b on a.usr=b.usr and a.categories<> b.categories
group by a.categories, b.categories
union all
select max(categories),
max(categories),
count(distinct categories)
from t
group by usr
having count(distinct categories)=1
order by 1,2
Excel 枢轴
The main problem is that I have a lot of categories (over 40).
考虑以下 (BigQuery) 方法 - 适用于任何合理数量的类别
execute immediate (
select '''
select * from (
select distinct t1.usr,
t1.categories category, t2.categories category2
from `your_table` t1 left join `your_table` t2
on t1.usr = t2.usr and t1.categories != t2.categories
union all
select usr, any_value(categories) category, any_value(categories) category2
from `your_table`
group by usr
having count(1) = 1
)
pivot (count(usr) cat for category2 in (''' || list || '''))
order by category
'''
from (
select string_agg("'" || categories || "'" order by categories) list
from (select distinct categories from `your_table`)
)
)
如果应用于您问题中的示例数据 - 输出为