需要帮助尝试从此输入获得所需的输出,获取 total_visits、most_visited_floor 和 resources_used
Need help trying to get the desired output from this input, get total_visits, most_visited_floor and resources_used
输入:
name
address
email
floor
resources
A
Bangalore
A@gmail.com
1
CPU
A
Bangalore
A@gmail.com
1
CPU
A
Bangalore
A@gmail.com
2
DESKTOP
B
Bangalore
B1@gmail.com
2
DESKTOP
B
Bangalore
B1@gmail.com
2
DESKTOP
B
Bangalore
B1@gmail.com
1
MONITIOR
期望的输出:
name
total visits
most visited floor
resources used
A
3
1
CPU, ,DESKTOP
B
3
2
DESKTOP,MONITIOR
所以我使用 spark-sql 想出了这段代码和方法,但如果有人能够在 ms-sql 或 sql-server 中回答,我也很好什么都好
select name, concat_ws(',', collect_set(resources)) as resources_used, count(*) as total_visits
from resources_table
group by name
我无法计算 most_visited_floor 列以获得所需的输出。
感谢帮助
您要查找的内容在统计中称为Mode。
搜索 Mode + SQL,您会发现无穷无尽的博客和帖子。
有多种获取模式的方法。
这是一个选项,假设只有一个模式值:
with
t (name,address,email,floor,resources) as
(
select *
from values ('A' ,'Bangalore' ,'A@gmail.com' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'A@gmail.com' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'A@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,1 ,'MONITIOR')
),
t1 as
(
select * ,count(*) over (partition by name, floor) as count_name_floor
from t
)
select name
,count(*) as total_visitsA
,max((count_name_floor,floor)).floor as most_visited_floor
,concat_ws(',',collect_set(resources)) as resources_used
from t1
group by name
name
total_visits
most_visited_floor
resources_used
B
3
2
MONITIOR,DESKTOP
A
3
1
DESKTOP,CPU
假设可能有多个 Mode 值,这是另一个选项。
我在输入中添加了 2 行,以使其更有趣。
with
t (name,address,email,floor,resources) as
(
select *
from values ('A' ,'Bangalore' ,'A@gmail.com' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'A@gmail.com' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'A@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,1 ,'MONITIOR')
,('B' ,'Bangalore' ,'B1@gmail.com' ,1 ,'MONITIOR')
,('B' ,'Bangalore' ,'B1@gmail.com' ,3 ,'MONITIOR')
),
t1 as
(
select * ,count(*) over (partition by name, floor) as count_name_floor
from t
),
t2 as
(
select * ,rank() over (partition by name order by count_name_floor desc) as rank_count_name_floor
from t1
)
select name
,count(*) as total_visitsA
,concat_ws(',',collect_set(case rank_count_name_floor when 1 then floor end)) as most_visited_floors
,concat_ws(',',collect_set(resources)) as resources_used
from t2
group by name
name
total_visitsA
most_visited_floors
resources_used
A
3
1
DESKTOP,CPU
B
5
1,2
MONITIOR,DESKTOP
试试这个:
val df = Seq(
( "A", "Bangalore", "a*.com", 1, "cpu" ),
( "A", "Bangalore", "a*.com", 1, "cpu" ),
( "A", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 1, "monitor" ),
).toDF("name" ,"address", "email", "floor", "resource")
df.createOrReplaceTempView("R")
val res = spark.sql("""
select A.name, A.total_visits, B.floor, C.resources from (
select R.name, count(*) as total_visits
from R
group by R.name ) A,
(
select Z.name, Z.floor, Z.most_visited
from (
select X.*, rank() over (partition by X.name order by X.most_visited desc) as RANK
from (
select R.name, R.floor, count(R.floor) as most_visited
from R
group by R.name, R.floor) X ) Z
where Z.RANK = 1 ) B,
(
select R.name, array_sort(collect_set(resource)) as resources
from R
group by R.name ) C
where A.name = B.name and B.name = C.name
""")
res.show(false)
它returns:
+----+------------+-----+------------------+
|name|total_visits|floor|resources |
+----+------------+-----+------------------+
|A |3 |1 |[cpu, desktop] |
|B |3 |2 |[desktop, monitor]|
+----+------------+-----+------------------+
输入:
name | address | floor | resources | |
---|---|---|---|---|
A | Bangalore | A@gmail.com | 1 | CPU |
A | Bangalore | A@gmail.com | 1 | CPU |
A | Bangalore | A@gmail.com | 2 | DESKTOP |
B | Bangalore | B1@gmail.com | 2 | DESKTOP |
B | Bangalore | B1@gmail.com | 2 | DESKTOP |
B | Bangalore | B1@gmail.com | 1 | MONITIOR |
期望的输出:
name | total visits | most visited floor | resources used |
---|---|---|---|
A | 3 | 1 | CPU, ,DESKTOP |
B | 3 | 2 | DESKTOP,MONITIOR |
所以我使用 spark-sql 想出了这段代码和方法,但如果有人能够在 ms-sql 或 sql-server 中回答,我也很好什么都好
select name, concat_ws(',', collect_set(resources)) as resources_used, count(*) as total_visits
from resources_table
group by name
我无法计算 most_visited_floor 列以获得所需的输出。
感谢帮助
您要查找的内容在统计中称为Mode。
搜索 Mode + SQL,您会发现无穷无尽的博客和帖子。
有多种获取模式的方法。
这是一个选项,假设只有一个模式值:
with
t (name,address,email,floor,resources) as
(
select *
from values ('A' ,'Bangalore' ,'A@gmail.com' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'A@gmail.com' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'A@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,1 ,'MONITIOR')
),
t1 as
(
select * ,count(*) over (partition by name, floor) as count_name_floor
from t
)
select name
,count(*) as total_visitsA
,max((count_name_floor,floor)).floor as most_visited_floor
,concat_ws(',',collect_set(resources)) as resources_used
from t1
group by name
name | total_visits | most_visited_floor | resources_used |
---|---|---|---|
B | 3 | 2 | MONITIOR,DESKTOP |
A | 3 | 1 | DESKTOP,CPU |
假设可能有多个 Mode 值,这是另一个选项。
我在输入中添加了 2 行,以使其更有趣。
with
t (name,address,email,floor,resources) as
(
select *
from values ('A' ,'Bangalore' ,'A@gmail.com' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'A@gmail.com' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'A@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'B1@gmail.com' ,1 ,'MONITIOR')
,('B' ,'Bangalore' ,'B1@gmail.com' ,1 ,'MONITIOR')
,('B' ,'Bangalore' ,'B1@gmail.com' ,3 ,'MONITIOR')
),
t1 as
(
select * ,count(*) over (partition by name, floor) as count_name_floor
from t
),
t2 as
(
select * ,rank() over (partition by name order by count_name_floor desc) as rank_count_name_floor
from t1
)
select name
,count(*) as total_visitsA
,concat_ws(',',collect_set(case rank_count_name_floor when 1 then floor end)) as most_visited_floors
,concat_ws(',',collect_set(resources)) as resources_used
from t2
group by name
name | total_visitsA | most_visited_floors | resources_used |
---|---|---|---|
A | 3 | 1 | DESKTOP,CPU |
B | 5 | 1,2 | MONITIOR,DESKTOP |
试试这个:
val df = Seq(
( "A", "Bangalore", "a*.com", 1, "cpu" ),
( "A", "Bangalore", "a*.com", 1, "cpu" ),
( "A", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 1, "monitor" ),
).toDF("name" ,"address", "email", "floor", "resource")
df.createOrReplaceTempView("R")
val res = spark.sql("""
select A.name, A.total_visits, B.floor, C.resources from (
select R.name, count(*) as total_visits
from R
group by R.name ) A,
(
select Z.name, Z.floor, Z.most_visited
from (
select X.*, rank() over (partition by X.name order by X.most_visited desc) as RANK
from (
select R.name, R.floor, count(R.floor) as most_visited
from R
group by R.name, R.floor) X ) Z
where Z.RANK = 1 ) B,
(
select R.name, array_sort(collect_set(resource)) as resources
from R
group by R.name ) C
where A.name = B.name and B.name = C.name
""")
res.show(false)
它returns:
+----+------------+-----+------------------+
|name|total_visits|floor|resources |
+----+------------+-----+------------------+
|A |3 |1 |[cpu, desktop] |
|B |3 |2 |[desktop, monitor]|
+----+------------+-----+------------------+