Hive:如何将总行数输出为变量

Hive: How to output total row count as a variable

我有一个数据集,我正在使用以下代码进行重复数据删除:

select session_id, sol_id, id, session_context_code, date
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id, date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
        order by session_id, sol_id, date

我想添加一个变量来存储去重后的总行数,我尝试使用 count(*):

select session_id, sol_id, id, session_context_code, date,count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
        order by session_id, sol_id, date

我收到的错误:

ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10025]: Line 1:44 Expression not in GROUP BY key 'session_id'

我只想输出一个计数作为变量,在按行号删除重复数据后,按 session_id 和 sol_id 计算所有不同的记录。我如何将其合并到代码中?

根据 Gomz 的建议,但收到错误:

ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:614 missing EOF at 'group' near 'nifi_date'

代码:

select session_id, solicit_id, nifi_date,id, session_context_code,count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id) as rn,
        substr(case_id,2,9) as id
        from df.t1_data
         )undup
        where undup.rn =1 and 
        session_context_code in ("4","3") and
        order by session_id, sol_id, nifi_date
        group by session_id, sol_id, nifi_date,id, session_context_code

带有 COUNT(*) 的 Hive 查询以及 SELECT 子句中的列应该在末尾使用 GROUP BY 对这些列进行分组。

一些示例:

SELECT COUNT(*) FROM employees;

SELECT id, name, COUNT(*) FROM employees GROUP BY id, name;

在您的问题场景中,查询应如下所示,

select session_id, sol_id, id, session_context_code, count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
GROUP BY session_id, sol_id, id, session_context_code
        order by session_id, sol_id, date

你可以阅读更多HERE

更新:如果想只统计session_id和sol_id所有不同的记录,那么查询可以如下,

select session_id, sol_id, count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
GROUP BY session_id, sol_id
        order by session_id, sol_id, date;

如前所述,您可以仅使用需要在 SELECT 和 GROUP BY 中计算的列。

如果您需要包含多列的结果,而不是需要计算的结果,您可以创建一个临时 table,其中仅包含那些被计算的列,并与原始 table 连接。即,如果您还需要 table 中的 c、d、e、f 列,即使您需要 a、b 列的计数,那么您可以执行如下操作,

CREATE TABLE tmp AS 
SELECT a, b, count(*)
FROM table1
GROUP BY a,b;

在 tmp 和 table1 的 a、b 列之间进行 JOIN

SELECT y.a, y.b, x.c, x.d, x.e, x.f
FROM tmp y, table1 x
WHERE y.a=x.a
AND y.b=x.b;

希望对您有所帮助!