hive spark yarn-cluster 作业失败并显示:"ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
我正在尝试 运行 BigInsights on Cloud 4.2 Enterprise 上访问 Hive table 的 pyspark 脚本 table。
首先我创建配置单元 table:
[biadmin@bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
然后我创建一个简单的 pyspark 脚本:
[biadmin@bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
我尝试执行:
spark-submit --master yarn-cluster test_pokes.py
但是,我遇到错误:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 683, in _ssql_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 692, in _get_hive_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
...
...
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
...
at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
我看过很多关于其他 Hadoop 发行版的类似帖子,但没有看到关于 BigInsights on Cloud 的帖子。
此错误的解决方案是添加罐子:
[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
然而,我得到了一个不同的错误:
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
我在这里添加了另一个问题:
最终解决方案在此处捕获:
我正在尝试 运行 BigInsights on Cloud 4.2 Enterprise 上访问 Hive table 的 pyspark 脚本 table。
首先我创建配置单元 table:
[biadmin@bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
然后我创建一个简单的 pyspark 脚本:
[biadmin@bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
我尝试执行:
spark-submit --master yarn-cluster test_pokes.py
但是,我遇到错误:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 683, in _ssql_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 692, in _get_hive_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
...
...
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
...
at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
我看过很多关于其他 Hadoop 发行版的类似帖子,但没有看到关于 BigInsights on Cloud 的帖子。
此错误的解决方案是添加罐子:
[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
然而,我得到了一个不同的错误:
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
我在这里添加了另一个问题:
最终解决方案在此处捕获: