java.lang.AbstractMethodError: com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V
java.lang.AbstractMethodError: com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V
我正在尝试根据此 blog post.
从 Data Science Experience 访问 IBM COS 上的数据
首先,我select1.0.8版本的stocator ...
!pip install --user --upgrade pixiedust
import pixiedust
pixiedust.installPackage("com.ibm.stocator:stocator:1.0.8")
重启内核,然后...
access_key = 'xxxx'
secret_key = 'xxxx'
bucket = 'xxxx'
host = 'lon.ibmselect.objstor.com'
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint", "http://" + host)
hconf.set("fs.s3d.service.access.key", access_key)
hconf.set("fs.s3d.service.secret.key", secret_key)
file = 'mydata_file.tsv.gz'
inputDataset = "s3d://{}.service/{}".format(bucket, file)
lines = sc.textFile(inputDataset, 1)
lines.count()
但是,这会导致以下错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.AbstractMethodError: com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V
at com.ibm.stocator.fs.ObjectStoreFileSystem.initialize(ObjectStoreFileSystem.java:104)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
at org.apache.spark.rdd.RDD$$anonfun$collect.apply(RDD.scala:932)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:378)
at org.apache.spark.rdd.RDD.collect(RDD.scala:931)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:507)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:785)
注意: 我第一次尝试连接到 IBM COS 时出现了不同的错误。该尝试在此处捕获:No FileSystem for scheme: cos
Chris,我通常不在端点中使用 'http://',这对我有用。不确定这是否是这里的问题。
以下是我如何从 DSX 笔记本访问 COS 对象
endpoint = "s3-api.dal-us-geo.objectstorage.softlayer.net"
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint",endpoint)
hconf.set("fs.s3d.service.access.key",Access_Key_ID)
hconf.set("fs.s3d.service.secret.key",Secret_Access_Key)
inputObject = "s3d://<bucket>.service/<file>"
myRDD = sc.textFile(inputObject,1)
DSX 在 Spark 2.0 和 Spark 2.1 内核的类路径上有一个 stocator 版本。您在您的实例中安装的版本可能会与预安装的版本发生冲突。
无需安装 stocator,它已经存在。正如罗兰所说,新安装很可能会与预装的发生冲突并导致冲突。
试试 ibmos2spark:
如果您仍然遇到问题,请告诉我。
除非你有充分的理由,否则不要强制安装新的 Stocator。
我强烈推荐 Spark aaS 文档,网址为:
https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic1.html#genTopProcId2
请从以下选项中选择正确的 COS 端点:
https://ibm-public-cos.github.io/crs-docs/endpoints
如果您在 IBM Cloud 中工作,请使用私有端点。它会更快更便宜。
它提供了如何使用所有好帮手访问 COS 数据的示例。它会归结为
import ibmos2spark
credentials = {
'endpoint': 's3-api.us-geo.objectstorage.service.networklayer.com', #just an example. Your url might be different
'access_key': 'my access key',
'secret_key': 'my secret key'
}
bucket_name = 'my bucket name'
object_name = 'mydata_file.tsv.gz'
cos = ibmos2spark.CloudObjectStorage(sc, credentials)
lines = sc.textFile(cos.url(object_name, bucket_name),1)
lines.count()
我正在尝试根据此 blog post.
从 Data Science Experience 访问 IBM COS 上的数据首先,我select1.0.8版本的stocator ...
!pip install --user --upgrade pixiedust
import pixiedust
pixiedust.installPackage("com.ibm.stocator:stocator:1.0.8")
重启内核,然后...
access_key = 'xxxx'
secret_key = 'xxxx'
bucket = 'xxxx'
host = 'lon.ibmselect.objstor.com'
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint", "http://" + host)
hconf.set("fs.s3d.service.access.key", access_key)
hconf.set("fs.s3d.service.secret.key", secret_key)
file = 'mydata_file.tsv.gz'
inputDataset = "s3d://{}.service/{}".format(bucket, file)
lines = sc.textFile(inputDataset, 1)
lines.count()
但是,这会导致以下错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.AbstractMethodError: com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V
at com.ibm.stocator.fs.ObjectStoreFileSystem.initialize(ObjectStoreFileSystem.java:104)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
at org.apache.spark.rdd.RDD$$anonfun$collect.apply(RDD.scala:932)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:378)
at org.apache.spark.rdd.RDD.collect(RDD.scala:931)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:507)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:785)
注意: 我第一次尝试连接到 IBM COS 时出现了不同的错误。该尝试在此处捕获:No FileSystem for scheme: cos
Chris,我通常不在端点中使用 'http://',这对我有用。不确定这是否是这里的问题。
以下是我如何从 DSX 笔记本访问 COS 对象
endpoint = "s3-api.dal-us-geo.objectstorage.softlayer.net"
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint",endpoint)
hconf.set("fs.s3d.service.access.key",Access_Key_ID)
hconf.set("fs.s3d.service.secret.key",Secret_Access_Key)
inputObject = "s3d://<bucket>.service/<file>"
myRDD = sc.textFile(inputObject,1)
DSX 在 Spark 2.0 和 Spark 2.1 内核的类路径上有一个 stocator 版本。您在您的实例中安装的版本可能会与预安装的版本发生冲突。
无需安装 stocator,它已经存在。正如罗兰所说,新安装很可能会与预装的发生冲突并导致冲突。
试试 ibmos2spark:
如果您仍然遇到问题,请告诉我。
除非你有充分的理由,否则不要强制安装新的 Stocator。
我强烈推荐 Spark aaS 文档,网址为:
https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic1.html#genTopProcId2
请从以下选项中选择正确的 COS 端点:
https://ibm-public-cos.github.io/crs-docs/endpoints
如果您在 IBM Cloud 中工作,请使用私有端点。它会更快更便宜。
它提供了如何使用所有好帮手访问 COS 数据的示例。它会归结为
import ibmos2spark
credentials = {
'endpoint': 's3-api.us-geo.objectstorage.service.networklayer.com', #just an example. Your url might be different
'access_key': 'my access key',
'secret_key': 'my secret key'
}
bucket_name = 'my bucket name'
object_name = 'mydata_file.tsv.gz'
cos = ibmos2spark.CloudObjectStorage(sc, credentials)
lines = sc.textFile(cos.url(object_name, bucket_name),1)
lines.count()