Hive TRANSFORM 为串联的数组值接收 NULL

Question

我有一个配置单元 table，格式为 :

   col1.      col2.     col3.
    a1          b1       c1
    a1          b1       c2                                  
    a1          b2       c2
    a1          b2       c3              
    a2          b3       c1
    a2          b4       c1                                  
    a2          b4       c2
    a2          b4       c3              
    .
    .

col1 中的每个值在 col2 和 (col1, col2) 的每一对中都可以有多个值 可以有多个值 col3.

我是运行查询[Q]:

select col1, col2, collect_list(col3) from {table} group by col1, col2;

获得：

a1   b1   [c1, c2]
a1   b2   [c2, c3]
a2   b3   [c1]
a2   b4   [c1, c2, c3]

我想使用 python UDF 进行一些转换。所以我使用 TRANSFORM 子句将所有这些列传递给 UDF：

select TRANSFORM ( * ) using 'python udf.py' FROM 
(
select col1, col2, concat_ws('\t', collect_list(col3)) from {table} group by col1, col2;
)

我正在使用 concat_ws 将数组输出从 collect_list 由分隔符连接起来转换为 strig。我得到 col1、col2 结果，但没有得到 col3 输出。

+---------+---------+
|      key|    value|
+---------+---------+
|a1       | b1      |
|         |     null|
|a1       | b2      |
|         |     null|
|a2       | b3      |
|         |     null|
|a2       | b4      |
|         |     null|
+---------+---------+

在我的 UDF 中，我只有一个打印语句，打印从标准输入接收到的行。

import sys
for line in sys.stdin:
    try:
        print line
    except Exception as e:
        continue

有人能帮我弄清楚为什么我的 UDF 中没有 col3 吗？

Answer 1

首先，您需要解析Python UDF中的行，例如

import sys
for line in sys.stdin:
    try:
        line = line.strip('\n')
        col1, col2, col3 = line.split('\t')
        print '\t'.join([col1, col2, col3])
    except Exception as e:
        continue

那么在concat_ws

中最好用别的东西代替\t

select TRANSFORM ( * )  using 'python udf.py' as (col1, col2, col3)
FROM 
(
select col1, col2, concat_ws(',', collect_list(col3)) from {table} group by col1, col2;

Hive TRANSFORM 为串联的数组值接收 NULL

Hive TRANSFORM receives NULL for concatenated array values

hive

user-defined-functions

hiveql

apache-spark

hive-udf