ImportError Python Hive UDF

ImportError Python Hive UDF

我想将一些常量放入一个 Python 文件中,然后将其导入到另一个文件中。我创建了两个文件,一个带有常量,一个导入常量,一切都在本地运行良好:

constants.py:

CONST = "hi guy"

test_constants.py:

from constants import CONST
import sys

for line in sys.stdin:
    print(CONST)

本地测试:

$ echo "dummy" | python test_constants.py
hi guy

使用 Hive 测试(直线):

hive> add file hdfs://path/.../test_constants.py;
No rows affected (0.191 seconds)
hive> add file hdfs://path/.../constants.py;
No rows affected (0.049 seconds)
hive> list files;
resource
/tmp/bb09f878-7e36-4aa2-8566-a30950072bcb_resources/test_constants.py
/tmp/bb09f878-7e36-4aa2-8566-a30950072bcb_resources/constants.py
2 rows selected (0.179 seconds)
hive> with t as (select 1 as dummy) 
  select transform (dummy) 
  using 'python test_constants.py' 
  as dummy_out 
  from t;
Error: org.apache.hive.service.cli.HiveSQLException: 
Error while processing statement: FAILED: 
Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. 
Vertex failed, vertexName=Map 1, vertexId=vertex_1535407036047_170618_1_00, diagnostics=[Task failed, taskId=task_1535407036047_170618_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1535407036047_170618_1_00_000000_0:
java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while closing operators

日志如下所示:

Log Type: stderr
Log Upload Time: Mon Oct 29 15:50:42 -0700 2018
Log Length: 251

2018-10-29 15:45:16 Starting to run new task attempt: attempt_1535407036047_170618_1_00_000000_3
Traceback (most recent call last):
  File "test_constants.py", line 1, in <module>
    from constants import CONST
ImportError: No module named constants

这两个文件似乎都在同一个文件夹中,因此导入似乎应该有效,但实际上没有。

2018-10-30 添加:

@serge_k 的回答有效,但是,我最初遇到了麻烦,因为我的 Python UDF 所在的路径最初不可用于配置单元。将所有文件移动到 HDFS 上的 /tmp 后,一切都按预期进行。

hive> add file hdfs://dev/tmp/transforms;
No rows affected (0.108 seconds)
hive> list files;
resource
/tmp/61ecb363-ead6-4679-8f58-3611db9487b2_resources/transforms
1 row selected (0.202 seconds)
hive> select transform (col) using 'python transforms/test_constants.py' as dummy_out from dummy.test;
dummy_out
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
10 rows selected (63.734 seconds)

将您的 python 脚本放在一个文件夹中,例如files,将整个文件夹添加到分布式缓存并将脚本调用为 python files/script_name.py:

hive> add file ./files;
Added resources: [./files]
hive> with t as (select 1 as dummy) select transform (dummy) 
      using 'python files/test_constants.py' as dummy_out from t;

OK
hi guy