带有 PySpark 和 Python >=3.6 的 Cassandra

Cassandra with PySpark and Python >=3.6

我是 Cassandra 和 Pyspark 的新手,最初我安装了 cassandra 版本 3.11.1、openjdk 1.8、pyspark 3.x 和 scala 1.12。在 运行 我的 python 服务器之后,我收到了很多错误,如下所示。

raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.NoClassDefFoundError: scala/Product$class
        at com.datastax.spark.connector.util.ConfigParameter.<init>(ConfigParameter.scala:7)
        at com.datastax.spark.connector.rdd.ReadConf$.<init>(ReadConf.scala:33)
        at com.datastax.spark.connector.rdd.ReadConf$.<clinit>(ReadConf.scala)
        at org.apache.spark.sql.cassandra.DefaultSource$.<init>(DefaultSource.scala:134)
        at org.apache.spark.sql.cassandra.DefaultSource$.<clinit>(DefaultSource.scala)
        at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load(DataFrameReader.scala:307)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 23 more

我不知道这个错误到底是什么,但经过一些研究后我意识到 pyspark Cassandra 连接有一些问题。然后我也检查了版本。在我的研究中,我发现 4.x 以外的 Cassandra 版本与 Python3.9 不兼容。我卸载了 Cassandra 并尝试安装 cassandra4 发行版,但这在 运行 命令之后又给我带来了另一组错误:

wget http://mirror.cogentco.com/pub/apache/cassandra/4.0-beta2/apache-cassandra-4.0-beta2-bin.tar.gz

    Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cassandra : Depends: python3 (>= 3.6) but 3.5.1-3 is to be installed
             Recommends: ntp but it is not going to be installed or
                         time-daemon
E: Unable to correct problems, you have held broken packages.

有人可以帮助我理解这个问题吗?如何安装 Cassandra 和 Pyspark 以及 Python3.9。是不是版本不兼容?

正在根据答案更新问题

我已经在另一台机器上更新了我的版本:

目前,我使用的是以下版本:Pyspark 3.0.1 Cassandra:4.0 cqlsh:5.0.1 python:3.6 Scala:2.12

我尝试使用连接器 3.0.0 和 3.1.0 都给我错误:

UNRESOLVED DEPENDENCY: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1389)
        at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
        at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:1007)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


.......
        raise Exception("Java gateway process exited before sending its port number")
    Exception: Java gateway process exited before sending its port number

使用的连接字符串:--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell 因为 pyspark 版本现在是 3.0.1。

您使用的 Cassandra 连接器版本错误 - 如果您使用的是 pyspark 3.x,则需要获取相应的版本 - 3.0 或 3.1。您的版本适用于旧版本的 Spark:

pyspark --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0

P.S。 Cassandra 4.0 也已发布 - 使用 beta2

毫无意义