为什么 Spark Cassandra 连接器允许过滤,即使 if query table by partitiong key using DataFrame API?
Why Spark Cassandra connector allows filtering even when if query table by partitiong key using DataFrame API?
给定卡桑德拉 table:
CREATE TABLE data_storage.stack_overflow_test_table (
id int,
text_id text,
clustering date,
some_other text,
PRIMARY KEY (( id, text_id ), clustering)
)
以下查询是有效查询:
select * from data_storage.test_table_filtering where id=4 and text_id='2';
因为我包含了从分区键到查询的所有列。
考虑以下代码:
val ds = session.
read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "stack_overflow_test_table", "keyspace" -> "data_storage"))
.load()
.where(col("id") === 4 &&
col("text_id") === "2").show(10)
由于 spark-cassandra 连接器将谓词推送到 Cassandra,我希望查询 Spark 将向 Cassandra 发送类似于
SELECT "id", "text_id", "clustering", "some_other" FROM "data_storage"."stack_overflow_test_table" WHERE "id" = ? AND "text_id" = ?
但是,我可以在日志中看到
18/04/09 15:38:09 TRACE Connection: Connection[localhost/127.0.0.1:9042-2, inFlight=1, closed=false], stream 256, writing request PREPARE SELECT "id", "text_id", "clustering", "some_other" FROM "data_storage"."stack_overflow_test_table" WHERE "id" = ? AND "text_id" = ? ALLOW FILTERING
这意味着 spark-cassandra-connector 添加了 ALLOW FILTERING 到查询
因此我有两个问题:
- 这会影响性能吗?
- 有解决办法吗?
Cassandra 允许过滤的连接器文档是隐式添加的。参见 here。请注意它如何警告并非所有谓词都适用于实际数据库。
"Does this affecting performance?"
文档说:
Note: Although the ALLOW FILTERING clause is implicitly added to the generated CQL query, not all predicates are currently allowed by the Cassandra engine. This limitation is going to be addressed in the future Cassandra releases. Currently, ALLOW FILTERING works well with columns indexed by clustering columns.
我读这篇文章是因为隐式 allow filtering
不会影响性能
"Is there a workaround?"
加快查询速度或防止发送 'allow filtering'
的解决方法?简单的答案是不需要 "workaround"。发送一个对 Cassandra 进行有效查询的谓词,就像您的情况一样,数据库引擎将选择最佳执行计划。
给定卡桑德拉 table:
CREATE TABLE data_storage.stack_overflow_test_table (
id int,
text_id text,
clustering date,
some_other text,
PRIMARY KEY (( id, text_id ), clustering)
)
以下查询是有效查询:
select * from data_storage.test_table_filtering where id=4 and text_id='2';
因为我包含了从分区键到查询的所有列。
考虑以下代码:
val ds = session.
read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "stack_overflow_test_table", "keyspace" -> "data_storage"))
.load()
.where(col("id") === 4 &&
col("text_id") === "2").show(10)
由于 spark-cassandra 连接器将谓词推送到 Cassandra,我希望查询 Spark 将向 Cassandra 发送类似于
SELECT "id", "text_id", "clustering", "some_other" FROM "data_storage"."stack_overflow_test_table" WHERE "id" = ? AND "text_id" = ?
但是,我可以在日志中看到
18/04/09 15:38:09 TRACE Connection: Connection[localhost/127.0.0.1:9042-2, inFlight=1, closed=false], stream 256, writing request PREPARE SELECT "id", "text_id", "clustering", "some_other" FROM "data_storage"."stack_overflow_test_table" WHERE "id" = ? AND "text_id" = ? ALLOW FILTERING
这意味着 spark-cassandra-connector 添加了 ALLOW FILTERING 到查询
因此我有两个问题:
- 这会影响性能吗?
- 有解决办法吗?
Cassandra 允许过滤的连接器文档是隐式添加的。参见 here。请注意它如何警告并非所有谓词都适用于实际数据库。
"Does this affecting performance?"
文档说:Note: Although the ALLOW FILTERING clause is implicitly added to the generated CQL query, not all predicates are currently allowed by the Cassandra engine. This limitation is going to be addressed in the future Cassandra releases. Currently, ALLOW FILTERING works well with columns indexed by clustering columns.
我读这篇文章是因为隐式
allow filtering
不会影响性能
"Is there a workaround?"
加快查询速度或防止发送
'allow filtering'
的解决方法?简单的答案是不需要 "workaround"。发送一个对 Cassandra 进行有效查询的谓词,就像您的情况一样,数据库引擎将选择最佳执行计划。