Spark Cassandra 使用空值将数据集附加到 table
Spark Cassandra append dataset to table with null values
我用DataStax Spark connector to fill a Cassandra cluster and process data in different jobs (due to some unsupported operations by Spark for streaming processing, such as double aggregation). So I want to store data in the same table for different jobs. Assuming that a first streaming job inserts a row in this table (using a foreach writer, because the connector doesn't support streamed writing yet).
INSERT INTO keyspace_name.table_name (id, col1, col2) VALUES ("test", 1, null);
如果我在 Cassandra 中该行已有非空值的位置附加(更新插入)一个包含空列的数据集会怎样?
// One row of the dataset = "test", null, 2
dataset.write
.format("org.apache.spark.sql.cassandra")
.option("keyspace", keyspace)
.option("table", table)
.mode(SaveMode.Append)
.save()
如果我对 docs 的理解正确,以前的非空值将被新的空值覆盖?如果是这样,有没有办法保持现有
非空值?还是我必须将每个作业的数据存储在单独的表中?
是的。非 Null 值将被 null 覆盖。
要避免此行为,请使用 spark.cassandra.output.ignoreNulls = true
。这将导致所有空值被保留为未设置而不是绑定。
Write Tuning Parameters
我用DataStax Spark connector to fill a Cassandra cluster and process data in different jobs (due to some unsupported operations by Spark for streaming processing, such as double aggregation). So I want to store data in the same table for different jobs. Assuming that a first streaming job inserts a row in this table (using a foreach writer, because the connector doesn't support streamed writing yet).
INSERT INTO keyspace_name.table_name (id, col1, col2) VALUES ("test", 1, null);
如果我在 Cassandra 中该行已有非空值的位置附加(更新插入)一个包含空列的数据集会怎样?
// One row of the dataset = "test", null, 2
dataset.write
.format("org.apache.spark.sql.cassandra")
.option("keyspace", keyspace)
.option("table", table)
.mode(SaveMode.Append)
.save()
如果我对 docs 的理解正确,以前的非空值将被新的空值覆盖?如果是这样,有没有办法保持现有 非空值?还是我必须将每个作业的数据存储在单独的表中?
是的。非 Null 值将被 null 覆盖。
要避免此行为,请使用 spark.cassandra.output.ignoreNulls = true
。这将导致所有空值被保留为未设置而不是绑定。
Write Tuning Parameters