需要帮助尝试从此输入获得所需的输出,获取 total_visits、most_visited_floor 和 resources_used

Need help trying to get the desired output from this input, get total_visits, most_visited_floor and resources_used

输入:

name address email floor resources
A Bangalore A@gmail.com 1 CPU
A Bangalore A@gmail.com 1 CPU
A Bangalore A@gmail.com 2 DESKTOP
B Bangalore B1@gmail.com 2 DESKTOP
B Bangalore B1@gmail.com 2 DESKTOP
B Bangalore B1@gmail.com 1 MONITIOR

期望的输出:

name total visits most visited floor resources used
A 3 1 CPU, ,DESKTOP
B 3 2 DESKTOP,MONITIOR

所以我使用 spark-sql 想出了这段代码和方法,但如果有人能够在 ms-sql 或 sql-server 中回答,我也很好什么都好

select name, concat_ws(',', collect_set(resources)) as resources_used, count(*) as total_visits 
from resources_table 
group by name

我无法计算 most_visited_floor 列以获得所需的输出。

感谢帮助

您要查找的内容在统计中称为Mode
搜索 Mode + SQL,您会发现无穷无尽的博客和帖子。

有多种获取模式的方法。

这是一个选项,假设只有一个模式值:

with 
t (name,address,email,floor,resources) as
(
    select  *
    from    values   ('A' ,'Bangalore' ,'A@gmail.com'  ,1  ,'CPU'     )
                    ,('A' ,'Bangalore' ,'A@gmail.com'  ,1  ,'CPU'     )
                    ,('A' ,'Bangalore' ,'A@gmail.com'  ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'B1@gmail.com' ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'B1@gmail.com' ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'B1@gmail.com' ,1  ,'MONITIOR')
),
t1 as
(
  select  * ,count(*) over (partition by name, floor) as count_name_floor
  from    t
)
select   name
        ,count(*)                              as total_visitsA
        ,max((count_name_floor,floor)).floor   as most_visited_floor
        ,concat_ws(',',collect_set(resources)) as resources_used
from     t1
group by name
name total_visits most_visited_floor resources_used
B 3 2 MONITIOR,DESKTOP
A 3 1 DESKTOP,CPU

假设可能有多个 Mode 值,这是另一个选项。
我在输入中添加了 2 行,以使其更有趣。

with 
t (name,address,email,floor,resources) as
(
    select  *
    from    values   ('A' ,'Bangalore' ,'A@gmail.com'  ,1  ,'CPU'     )
                    ,('A' ,'Bangalore' ,'A@gmail.com'  ,1  ,'CPU'     )
                    ,('A' ,'Bangalore' ,'A@gmail.com'  ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'B1@gmail.com' ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'B1@gmail.com' ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'B1@gmail.com' ,1  ,'MONITIOR')
                    ,('B' ,'Bangalore' ,'B1@gmail.com' ,1  ,'MONITIOR')
                    ,('B' ,'Bangalore' ,'B1@gmail.com' ,3  ,'MONITIOR')
),
t1 as
(
  select  * ,count(*) over (partition by name, floor) as count_name_floor
  from    t
),
t2 as
(
  select  * ,rank() over (partition by name order by count_name_floor desc) as rank_count_name_floor
  from    t1
)
select   name
        ,count(*)                                                                      as total_visitsA
        ,concat_ws(',',collect_set(case rank_count_name_floor when 1 then floor end))  as most_visited_floors
        ,concat_ws(',',collect_set(resources))                                         as resources_used
from     t2
group by name
name total_visitsA most_visited_floors resources_used
A 3 1 DESKTOP,CPU
B 5 1,2 MONITIOR,DESKTOP

试试这个:

val df = Seq( 
              ( "A", "Bangalore", "a*.com", 1, "cpu" ),
              ( "A", "Bangalore", "a*.com", 1, "cpu" ),
              ( "A", "Bangalore", "a*.com", 2, "desktop" ),
              ( "B", "Bangalore", "a*.com", 2, "desktop" ),
              ( "B", "Bangalore", "a*.com", 2, "desktop" ),
              ( "B", "Bangalore", "a*.com", 1, "monitor" ),
             ).toDF("name" ,"address", "email", "floor", "resource")

df.createOrReplaceTempView("R")

val res = spark.sql(""" 

                      select A.name, A.total_visits, B.floor, C.resources from (  
                        select R.name, count(*) as total_visits 
                          from R
                      group by R.name  ) A,
               
                        (
                        select Z.name, Z.floor, Z.most_visited
                          from (
                        select X.*, rank() over (partition by X.name order by X.most_visited desc) as RANK
                          from (
                                select R.name, R.floor, count(R.floor) as most_visited 
                                  from R
                              group by R.name, R.floor) X ) Z     
                        where Z.RANK = 1 ) B, 

                        (
                        select R.name, array_sort(collect_set(resource)) as resources 
                          from R
                      group by R.name ) C
                    where A.name = B.name and B.name = C.name
                         
                    """)
res.show(false)

它returns:

+----+------------+-----+------------------+
|name|total_visits|floor|resources         |
+----+------------+-----+------------------+
|A   |3           |1    |[cpu, desktop]    |
|B   |3           |2    |[desktop, monitor]|
+----+------------+-----+------------------+