"avoid multiple Kudu clients per cluster" 是什么意思?
What does "avoid multiple Kudu clients per cluster" mean?
我在看kudu的文档
以下是 kudu-spark 的部分描述。
https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster
Avoid multiple Kudu clients per cluster.
One common Kudu-Spark coding error is instantiating extra KuduClient
objects. In kudu-spark, a KuduClient
is owned by the KuduContext
. Spark application code should not create another KuduClient
connecting to the same cluster. Instead, application code should use the KuduContext
to access a KuduClient
using KuduContext#syncClient
.
To diagnose multiple KuduClient
instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations
or GetTabletLocations
requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient
per task will result in periodic waves of master requests from new clients.
这是否意味着我一次只能 运行 一个 kudu-spark 任务?
如果我有一个总是向 kudu 写入数据的 spark-streaming 程序,
如何使用其他 spark 程序连接到 kudu?
在 non-Spark 程序中,您使用 KUDU 客户端访问 KUDU。对于 Spark 应用程序,您可以为该 KUDU 集群使用已经具有此类客户端的 KUDU 上下文。
Simple JAVA program requires a KUDU Client using JAVA API and maven
approach.
KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();
见http://harshj.com/writing-a-simple-kudu-java-api-program/
Spark / Scala program of which many can be running at the same time
against the same Cluster using Spark KUDU Integration. Snippet
borrowed from official guide as quite some time ago I looked at this.
import org.apache.kudu.client._
import collection.JavaConverters._
// Read a table from Kudu
val df = spark.read
.options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
.format("kudu").load
// Query using the Spark API...
df.select("id").filter("id >= 5").show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("key").asJava, 3))
// Insert data
kuduContext.insertRows(df, "test_table")
"avoid multiple Kudu clients per cluster"更明确的说法是"avoid multiple Kudu clients per spark application".
Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.
我在看kudu的文档
以下是 kudu-spark 的部分描述。
https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster
Avoid multiple Kudu clients per cluster.
One common Kudu-Spark coding error is instantiating extra
KuduClient
objects. In kudu-spark, aKuduClient
is owned by theKuduContext
. Spark application code should not create anotherKuduClient
connecting to the same cluster. Instead, application code should use theKuduContext
to access aKuduClient
usingKuduContext#syncClient
.To diagnose multiple
KuduClient
instances in a Spark job, look for signs in the logs of the master being overloaded by manyGetTableLocations
orGetTabletLocations
requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating aKuduClient
per task will result in periodic waves of master requests from new clients.
这是否意味着我一次只能 运行 一个 kudu-spark 任务?
如果我有一个总是向 kudu 写入数据的 spark-streaming 程序, 如何使用其他 spark 程序连接到 kudu?
在 non-Spark 程序中,您使用 KUDU 客户端访问 KUDU。对于 Spark 应用程序,您可以为该 KUDU 集群使用已经具有此类客户端的 KUDU 上下文。
Simple JAVA program requires a KUDU Client using JAVA API and maven approach.
KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();
见http://harshj.com/writing-a-simple-kudu-java-api-program/
Spark / Scala program of which many can be running at the same time against the same Cluster using Spark KUDU Integration. Snippet borrowed from official guide as quite some time ago I looked at this.
import org.apache.kudu.client._
import collection.JavaConverters._
// Read a table from Kudu
val df = spark.read
.options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
.format("kudu").load
// Query using the Spark API...
df.select("id").filter("id >= 5").show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("key").asJava, 3))
// Insert data
kuduContext.insertRows(df, "test_table")
"avoid multiple Kudu clients per cluster"更明确的说法是"avoid multiple Kudu clients per spark application".
Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.