蜂巢多次计数同一个字段

Question

我需要计算有多少学生来自哪所大学，但是当我使用以下查询时

select college ,COUNT(*) from students group by college ;

我得到这个结果

结果显示同一所大学的不同计数我应该在这里做什么才能得到正确的大学数量

Answer 1

好像同一所大学有很多不同的名字，比如这些

JIIT
"JIIT
jiit

尝试将它们规范化（转换为大写并删除 '"'），因此在 group by 之后它将是相同的 JIIT:

 select case when college = 'BSA' then 'BSA College of Technology'
        --add other cases
        else --rule for others
            trim(upper(regexp_replace(college,'"',''))) 
         end as college 
       ,COUNT(*)                                    as cnt 
   from students 
  group by 
        case when college = 'BSA' then 'BSA College of Technology'
        --add other cases
        else --rule for others
            trim(upper(regexp_replace(college,'"',''))) 
         end --the same sentence should be in group by, or use subquery instead
;

应用 case 将更复杂的字符串（如 MJP ROHILKHAND 和 M J P ROHILKHAND 转换为相同的字符串。

发生这种情况是因为数据库未规范化，并且大学维度的 College 列的输入不受限制。

蜂巢多次计数同一个字段

hive counting same field many times

hive

hiveql

hadoop2