如何将 Spark Riak 连接器与 pyspark 一起使用?
How to use Spark Riak connector with pyspark?
我正在按照 https://github.com/basho/spark-riak-connector、运行 Spark 2.0.2-hadoop2.7.
中的说明进行操作
尝试过 -
1) pyspark --repositories https://dl.bintray.com/basho/data-platform --packages com.basho.riak:spark-riak-connector_2.11:1.6.0
2) pyspark --driver-class-path /path/to/spark-riak-connector_2.11-1.6.0-uber.jar
3) 将spark.driver.extraClassPath /path/to/jars/*
添加到master的spark-default.conf
4) 尝试旧版本的连接器(1.5.0 和 1.5.1)
我可以在大师的网站 ui 中验证,在 pyspark 的应用程序环境中,riak jar 已加载。我还仔细检查了 spark 的 scala 版本是 2.11.
但是..无论我做什么,我都没有 pyspark_riak
导入
>>> import pyspark_riak
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_riak
我该如何解决?
在尝试选项 #1 时,正在加载 jar,我得到的报告看起来很好:
:: modules in use:
com.basho.riak#riak-client;2.0.7 from central in [default]
com.basho.riak#spark-riak-connector_2.11;1.6.0 from central in [default]
com.fasterxml.jackson.core#jackson-annotations;2.8.0 from central in [default]
com.fasterxml.jackson.core#jackson-core;2.8.0 from central in [default]
com.fasterxml.jackson.core#jackson-databind;2.8.0 from central in [default]
com.fasterxml.jackson.datatype#jackson-datatype-joda;2.4.4 from central in [default]
com.fasterxml.jackson.module#jackson-module-scala_2.11;2.4.4 from central in [default]
com.google.guava#guava;14.0.1 from central in [default]
joda-time#joda-time;2.2 from central in [default]
org.erlang.otp#jinterface;1.6.1 from central in [default]
org.scala-lang#scala-reflect;2.11.2 from central in [default]
:: evicted modules:
com.fasterxml.jackson.core#jackson-core;2.4.4 by [com.fasterxml.jackson.core#jackson-core;2.8.0] in [default]
com.fasterxml.jackson.core#jackson-annotations;2.4.4 by [com.fasterxml.jackson.core#jackson-annotations;2.8.0] in [default]
com.fasterxml.jackson.core#jackson-databind;2.4.4 by [com.fasterxml.jackson.core#jackson-databind;2.8.0] in [default]
com.fasterxml.jackson.core#jackson-annotations;2.4.0 by [com.fasterxml.jackson.core#jackson-annotations;2.8.0] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 15 | 11 | 11 | 4 || 11 | 11 |
---------------------------------------------------------------------
另外,如果我打印 sys.path
,我可以看到 /tmp/spark-b2396e0a-f329-4066-b3b1-4e8c21944a66/userFiles-7e423d94-5aa2-4fe4-935a-e06ab2d423ae/com.basho.riak_spark-riak-connector_2.11-1.6.0.jar
(我已验证存在)
存储库中的 spark-riak-connector 不支持 pyspark。但是你可以自己构建它并附加到 pyspark:
git clone https://github.com/basho/spark-riak-connector.git
cd spark-riak-connector/
python connector/python/setup.py bdist_egg # creates egg file inside connector/python/dist/
然后将新创建的egg添加到python路径:
pyspark --repositories https://dl.bintray.com/basho/data-platform --packages com.basho.riak:spark-riak-connector_2.11:1.6.0
>>> import sys
>>> sys.path.append('connector/python/dist/pyspark_riak-1.0.0-py2.7.egg')
>>> import pyspark_riak
>>>
但要小心将 spark-riak-connector 与 spark 2.0.2 一起使用 - 我看到最新的软件包版本是用 spark 1.6.2 测试的,API 可能无法按预期工作。
我正在按照 https://github.com/basho/spark-riak-connector、运行 Spark 2.0.2-hadoop2.7.
中的说明进行操作尝试过 -
1) pyspark --repositories https://dl.bintray.com/basho/data-platform --packages com.basho.riak:spark-riak-connector_2.11:1.6.0
2) pyspark --driver-class-path /path/to/spark-riak-connector_2.11-1.6.0-uber.jar
3) 将spark.driver.extraClassPath /path/to/jars/*
添加到master的spark-default.conf
4) 尝试旧版本的连接器(1.5.0 和 1.5.1)
我可以在大师的网站 ui 中验证,在 pyspark 的应用程序环境中,riak jar 已加载。我还仔细检查了 spark 的 scala 版本是 2.11.
但是..无论我做什么,我都没有 pyspark_riak
导入
>>> import pyspark_riak
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_riak
我该如何解决?
在尝试选项 #1 时,正在加载 jar,我得到的报告看起来很好:
:: modules in use:
com.basho.riak#riak-client;2.0.7 from central in [default]
com.basho.riak#spark-riak-connector_2.11;1.6.0 from central in [default]
com.fasterxml.jackson.core#jackson-annotations;2.8.0 from central in [default]
com.fasterxml.jackson.core#jackson-core;2.8.0 from central in [default]
com.fasterxml.jackson.core#jackson-databind;2.8.0 from central in [default]
com.fasterxml.jackson.datatype#jackson-datatype-joda;2.4.4 from central in [default]
com.fasterxml.jackson.module#jackson-module-scala_2.11;2.4.4 from central in [default]
com.google.guava#guava;14.0.1 from central in [default]
joda-time#joda-time;2.2 from central in [default]
org.erlang.otp#jinterface;1.6.1 from central in [default]
org.scala-lang#scala-reflect;2.11.2 from central in [default]
:: evicted modules:
com.fasterxml.jackson.core#jackson-core;2.4.4 by [com.fasterxml.jackson.core#jackson-core;2.8.0] in [default]
com.fasterxml.jackson.core#jackson-annotations;2.4.4 by [com.fasterxml.jackson.core#jackson-annotations;2.8.0] in [default]
com.fasterxml.jackson.core#jackson-databind;2.4.4 by [com.fasterxml.jackson.core#jackson-databind;2.8.0] in [default]
com.fasterxml.jackson.core#jackson-annotations;2.4.0 by [com.fasterxml.jackson.core#jackson-annotations;2.8.0] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 15 | 11 | 11 | 4 || 11 | 11 |
---------------------------------------------------------------------
另外,如果我打印 sys.path
,我可以看到 /tmp/spark-b2396e0a-f329-4066-b3b1-4e8c21944a66/userFiles-7e423d94-5aa2-4fe4-935a-e06ab2d423ae/com.basho.riak_spark-riak-connector_2.11-1.6.0.jar
(我已验证存在)
存储库中的 spark-riak-connector 不支持 pyspark。但是你可以自己构建它并附加到 pyspark:
git clone https://github.com/basho/spark-riak-connector.git
cd spark-riak-connector/
python connector/python/setup.py bdist_egg # creates egg file inside connector/python/dist/
然后将新创建的egg添加到python路径:
pyspark --repositories https://dl.bintray.com/basho/data-platform --packages com.basho.riak:spark-riak-connector_2.11:1.6.0
>>> import sys
>>> sys.path.append('connector/python/dist/pyspark_riak-1.0.0-py2.7.egg')
>>> import pyspark_riak
>>>
但要小心将 spark-riak-connector 与 spark 2.0.2 一起使用 - 我看到最新的软件包版本是用 spark 1.6.2 测试的,API 可能无法按预期工作。