在配置单元中使用 python udf 时如何查询多列？

Question

我正在尝试执行此查询：

add FILE /home/user1/test/test_udf.py;

SELECT a.hash_code, col2
FROM (SELECT transform (col2, col3) using 'python test_udf.py' as hash_code, col2
      FROM sample_table) a ;

我能够使用 udf 成功生成 hash_code，但另一列 (col2) 正在填充为 NULL。

样本输出:

sjhfshhalksjlkfj128798172jasjhas   NULL
ajsdlkja982988290819189089089889   NULL
jhsad817982mnsandkjsahj982398290   NULL

Answer 1

我知道你的 HiveSql 有什么问题。

在transform (col2, col3) using 'python test_udf.py' as hash_code, col2 FROM sample_table中，hash_code, col2的值是从transform (col2, col3)的return值中解析出来的。

clo2是从transform (col2, col3)解析出来的，即NULL。

我阅读了Transform doc，得到了如下相关信息。

Transform/Map-Reduce 语法

SELECT TRANSFORM '(' expression (',' expression)* ')'
  (inRowFormat)?
  USING 'my_reduce_script'
  ( AS colName (',' colName)* )?
  (outRowFormat)? (outRecordReader)?

您最好不要将 transform 与其他 select 混用，因为语法不支持。

更新：

有一个技巧可以为所欲为：让 test_udf.py return hash_code\t col2。所以你可以从中解析 hash_code, col2 。这将解决您的问题。

在配置单元中使用 python udf 时如何查询多列？

How to query multiple columns when using a python udf in hive?

python

hadoop

hive

udf