获取 BusyPoolException com.datastax.spark.connector.writer.QueryExecutor ，我做错了什么？

Question

我正在使用 spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 java8 和 apache cassandra 3.0 版本。

我有如下所示的 spark-submit 或 spark 集群环境来加载 20 亿条记录。

--executor-cores 3 
--executor-memory 9g 
--num-executors 5 
--driver-cores 2 
--driver-memory 4g

Using following configurration

cassandra.concurrent.writes=1500
cassandra.output.batch.size.rows=10
cassandra.output.batch.size.bytes=2048
cassandra.output.batch.grouping.key=partition 
cassandra.output.consistency.level=LOCAL_QUORUM
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128

作业大约需要 2 小时，时间真的很长

当我检查日志时，我看到警告 com.datastax.spark.connector.writer.QueryExecutor - BusyPoolException

如何解决这个问题？

Answer 1

您的 cassandra.concurrent.writes 值不正确 - 这意味着您同时发送 1500 个并发批次。但默认情况下，Java driver allows 1024 simultaneous requests。通常，如果此参数的数字太大，可能会导致节点过载，结果 - 重试任务。

此外，其他设置不正确 - 如果您选择 cassandra.output.batch.size.rows，则其值将覆盖 cassandra.output.batch.size.bytes 的值。有关详细信息，请参阅 corresponding section of the Spark Cassandra Connector reference。

性能调优的一个方面是拥有正确数量的 Spark 分区，从而达到良好的并行性 - 但这实际上取决于您的代码、Cassandra 集群中的节点数量等。

P.S。另外，请注意配置参数应该以 spark.cassandra. 开头，而不是简单的 cassandra. - 如果您以这种形式指定它们，那么这些参数将被忽略并使用默认值。

获取 BusyPoolException com.datastax.spark.connector.writer.QueryExecutor ，我做错了什么？

Getting BusyPoolException com.datastax.spark.connector.writer.QueryExecutor , what wrong me doing?

cassandra

datastax-java-driver

apache-spark

apache-spark-sql

spark-cassandra-connector