填充字段的字段的 Hive 查询计数

Question

我有一个巨大的 Hive table，由十个产品字段、购买日期字段和一个标识符组成。产品字段的命名方式如 prod1、prod2、...、prod10 并指的是最近购买的十个产品。对于大多数 ID，我们没有回溯到十种产品的购买历史记录。

我想为每个 prod<X> 字段构建人口比率分布，以显示整个数据集的购买历史明细。

目前，我是运行一个 bash 脚本，它针对 table 运行十个连续查询，例如：

hive -e "select count(1) from db.tbl where prod<X> != '';"

... 并将输出保存到文件中。这看起来笨拙且效率低下。有没有更好的方法可以在具有一系列字段条件的字段上指定 Hive 计数？我试图想出一个使用 groupby 甚至映射一系列字段的策略，但我无法完全理解为每个字段指定 != '' 条件。

提前感谢您的指导。

Answer 1

select id,
sum(case when prod1='' then 0 else 1 end),
sum(case when prod2='' then 0 else 1 end),
sum(case when prod3='' then 0 else 1 end),
sum(case when prod4='' then 0 else 1 end),
sum(case when prod5='' then 0 else 1 end),
sum(case when prod6='' then 0 else 1 end),
sum(case when prod7='' then 0 else 1 end),
sum(case when prod8='' then 0 else 1 end),
sum(case when prod9='' then 0 else 1 end),
sum(case when prod10='' then 0 else 1 end)
from table group by id;

填充字段的字段的 Hive 查询计数

Hive query counts of fields where fields are populated

hadoop

hive

hiveql

apache-hive