"avoid multiple Kudu clients per cluster" 是什么意思？

Question

我在看kudu的文档

以下是 kudu-spark 的部分描述。

https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster

Avoid multiple Kudu clients per cluster.

One common Kudu-Spark coding error is instantiating extra KuduClient objects. In kudu-spark, a KuduClient is owned by the KuduContext. Spark application code should not create another KuduClient connecting to the same cluster. Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

To diagnose multiple KuduClient instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations or GetTabletLocations requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient per task will result in periodic waves of master requests from new clients.

这是否意味着我一次只能运行一个 kudu-spark 任务？

如果我有一个总是向 kudu 写入数据的 spark-streaming 程序，如何使用其他 spark 程序连接到 kudu？

Answer 1

在 non-Spark 程序中，您使用 KUDU 客户端访问 KUDU。对于 Spark 应用程序，您可以为该 KUDU 集群使用已经具有此类客户端的 KUDU 上下文。

Simple JAVA program requires a KUDU Client using JAVA API and maven approach.

KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();

见http://harshj.com/writing-a-simple-kudu-java-api-program/

Spark / Scala program of which many can be running at the same time against the same Cluster using Spark KUDU Integration. Snippet borrowed from official guide as quite some time ago I looked at this.

import org.apache.kudu.client._
import collection.JavaConverters._

// Read a table from Kudu
val df = spark.read
              .options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
              .format("kudu").load

// Query using the Spark API...
df.select("id").filter("id >= 5").show()

// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()

// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)

// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
                        new CreateTableOptions()
                      .setNumReplicas(1)
                      .addHashPartitions(List("key").asJava, 3))

// Insert data
kuduContext.insertRows(df, "test_table")

见https://kudu.apache.org/docs/developing.html

Answer 2

"avoid multiple Kudu clients per cluster"更明确的说法是"avoid multiple Kudu clients per spark application".

Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

"avoid multiple Kudu clients per cluster" 是什么意思？

What does "avoid multiple Kudu clients per cluster" mean?

apache-spark

apache-kudu

Avoid multiple Kudu clients per cluster.