使用 mongo-spark-connector 连接到 mongodb 时出错
Error connecting to mongodb with mongo-spark-connector
我是 spark/mongodb 的新手,我正在尝试使用 mongo-spark-connector 按照说明 here 从 pyspark 连接到 mongo。我使用命令
启动 pyspark
`pyspark \
--conf 'spark.mongodb.input.uri=mongodb://127.0.0.1/mydb.mytable?readPreference=primaryPreferred' \
--conf 'spark.mongodb.output.uri=mongodb://127.0.0.1/mydb.mytable' \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1`
启动时给出以下内容:
`SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/spark-2.4.4-bin-hadoop2.7/jars/slf4j log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Ivy Default Cache set to: /home/mmr/.ivy2/cache
The jars for the packages stored in: /home/user_name/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-18ec2360-9f44-414c-a1de-11f629819aec;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 in central
found org.mongodb#mongo-java-driver;3.10.2 in central
[3.10.2] org.mongodb#mongo-java-driver;[3.10,3.11)
:: resolution report :: resolve 1360ms :: artifacts dl 3ms
:: modules in use:
org.mongodb#mongo-java-driver;3.10.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 1 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-18ec2360-9f44-414c-a1de-11f629819aec
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/4ms)
20/01/24 00:21:29 WARN Utils: Your hostname, user_name-Machine resolves to a loopback address: 127.0.1.1; using 192.168.1.18 instead (on interface wlan0)
20/01/24 00:21:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/01/24 00:21:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".`
当我 运行 >>> df = spark.read.format("mongo").load()
:
时出现以下错误
`Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.load.
: java.lang.NoSuchMethodError: com.mongodb.MongoClient.<init>(Lcom/mongodb/MongoClientURI;Lcom/mongodb/MongoDriverInformation;)V
at com.mongodb.spark.connection.DefaultMongoClientFactory.create(DefaultMongoClientFactory.scala:49)
at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:55)
at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:242)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:155)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.hasSampleAggregateOperator(MongoConnector.scala:237)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator$lzycompute(MongoRDD.scala:221)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator(MongoRDD.scala:221)
at com.mongodb.spark.sql.MongoInferSchema$.apply(MongoInferSchema.scala:68)
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:97)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)`
规格:
OS: Ubuntu 18.04
java: openjdk 8
火花:2.4.4
mongo: 4.2.2
scala:2.11.12
mongo java 驱动程序:3.12
我试过使用 Orace java 8,并将 mongo 驱动程序切换到 3.10.2。
第一个错误是由于 slf4j 记录器依赖冲突而发生的。 Spark mongo 连接器 jar 将 slf4j 列为依赖项。参见 maven package info。然而,这只是一个警告,spark 会选择第一个可用的。看来这个 jar 在你的系统上安装了两次。一个来自 spark 包,一个来自 hadoop。 Mongo-connector 将此列为提供的依赖项,spark 使用系统上的任何一个。
通常可以用
排除罐子
--exclude-packages
Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
例如
--exclude-packages org.slf4j:slf4j-api
不过我认为这不是问题。
第二个错误告诉我们,这样的 MongoClient 构造方法不存在。 Mongo客户端是 mongo spark 连接器的 java 包依赖项。要么根本没有正确加载。或者您以某种方式错误地传递了 conf 选项,这最终导致使用不正确的参数(不同的数量或错误的类型)调用 MongoClient 构造函数。
我看到您在命令周围使用了不同的引号和反引号。您还写道您已尝试安装 java mongo 驱动程序。您是否在类路径中的某处放置了一个罐子。这不是必需的。 --packages
参数解析来自 maven 的依赖项。 mongo-spark-connector
依赖于 mongo-driver,应该会为您解决。参见 maven info and source。包含此依赖项(与提供的 slf4j
形成对比)
尝试将下面的确切命令粘贴到您的 shell 中。不要手动安装 mongo java 驱动程序。
pyspark \
--conf "spark.mongodb.input.uri=mongodb://127.0.0.1/mydb.mytable?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/mydb.mytable" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
当我 运行 这个命令时 ~/.ivy2/cache
上自动安装了 2 个 jar
org.mongodb.spark_mongo-spark-connector_2.11-2.4.1.jar
org.mongodb_mongo-java-driver-3.10.2.jar
没有安装冲突的 slf4j。这些罐子也不包含来自其他包的任何其他依赖代码。您可以使用 unzip -l <jar-file-name>.jar
检查分类
我是 spark/mongodb 的新手,我正在尝试使用 mongo-spark-connector 按照说明 here 从 pyspark 连接到 mongo。我使用命令
启动 pyspark`pyspark \
--conf 'spark.mongodb.input.uri=mongodb://127.0.0.1/mydb.mytable?readPreference=primaryPreferred' \
--conf 'spark.mongodb.output.uri=mongodb://127.0.0.1/mydb.mytable' \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1`
启动时给出以下内容:
`SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/spark-2.4.4-bin-hadoop2.7/jars/slf4j log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Ivy Default Cache set to: /home/mmr/.ivy2/cache
The jars for the packages stored in: /home/user_name/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-18ec2360-9f44-414c-a1de-11f629819aec;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 in central
found org.mongodb#mongo-java-driver;3.10.2 in central
[3.10.2] org.mongodb#mongo-java-driver;[3.10,3.11)
:: resolution report :: resolve 1360ms :: artifacts dl 3ms
:: modules in use:
org.mongodb#mongo-java-driver;3.10.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 1 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-18ec2360-9f44-414c-a1de-11f629819aec
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/4ms)
20/01/24 00:21:29 WARN Utils: Your hostname, user_name-Machine resolves to a loopback address: 127.0.1.1; using 192.168.1.18 instead (on interface wlan0)
20/01/24 00:21:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/01/24 00:21:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".`
当我 运行 >>> df = spark.read.format("mongo").load()
:
`Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.load.
: java.lang.NoSuchMethodError: com.mongodb.MongoClient.<init>(Lcom/mongodb/MongoClientURI;Lcom/mongodb/MongoDriverInformation;)V
at com.mongodb.spark.connection.DefaultMongoClientFactory.create(DefaultMongoClientFactory.scala:49)
at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:55)
at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:242)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:155)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.hasSampleAggregateOperator(MongoConnector.scala:237)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator$lzycompute(MongoRDD.scala:221)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator(MongoRDD.scala:221)
at com.mongodb.spark.sql.MongoInferSchema$.apply(MongoInferSchema.scala:68)
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:97)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)`
规格:
OS: Ubuntu 18.04
java: openjdk 8
火花:2.4.4
mongo: 4.2.2
scala:2.11.12
mongo java 驱动程序:3.12
我试过使用 Orace java 8,并将 mongo 驱动程序切换到 3.10.2。
第一个错误是由于 slf4j 记录器依赖冲突而发生的。 Spark mongo 连接器 jar 将 slf4j 列为依赖项。参见 maven package info。然而,这只是一个警告,spark 会选择第一个可用的。看来这个 jar 在你的系统上安装了两次。一个来自 spark 包,一个来自 hadoop。 Mongo-connector 将此列为提供的依赖项,spark 使用系统上的任何一个。
通常可以用
排除罐子
--exclude-packages
Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts.
例如
--exclude-packages org.slf4j:slf4j-api
不过我认为这不是问题。
第二个错误告诉我们,这样的 MongoClient 构造方法不存在。 Mongo客户端是 mongo spark 连接器的 java 包依赖项。要么根本没有正确加载。或者您以某种方式错误地传递了 conf 选项,这最终导致使用不正确的参数(不同的数量或错误的类型)调用 MongoClient 构造函数。
我看到您在命令周围使用了不同的引号和反引号。您还写道您已尝试安装 java mongo 驱动程序。您是否在类路径中的某处放置了一个罐子。这不是必需的。 --packages
参数解析来自 maven 的依赖项。 mongo-spark-connector
依赖于 mongo-driver,应该会为您解决。参见 maven info and source。包含此依赖项(与提供的 slf4j
形成对比)
尝试将下面的确切命令粘贴到您的 shell 中。不要手动安装 mongo java 驱动程序。
pyspark \
--conf "spark.mongodb.input.uri=mongodb://127.0.0.1/mydb.mytable?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/mydb.mytable" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
当我 运行 这个命令时 ~/.ivy2/cache
org.mongodb.spark_mongo-spark-connector_2.11-2.4.1.jar
org.mongodb_mongo-java-driver-3.10.2.jar
没有安装冲突的 slf4j。这些罐子也不包含来自其他包的任何其他依赖代码。您可以使用 unzip -l <jar-file-name>.jar