Cassandra如何保证不同表的数据一致性?
How to ensure data consistency in Cassandra on different tables?
我是 Cassandra 的新手,我读到 Cassandra 鼓励数据的非规范化和重复。这让我有点困惑。
让我们想象以下场景:
我有一个包含四个 table 的键空间:A、B、C 和 D。
CREATE TABLE A (
tableID int,
column1 int,
column2 varchar,
column3 varchar,
column4 varchar,
column5 varchar,
PRIMARY KEY (column1, tableID)
);
让我们假设其他 tables (B,C,D) 与 tableA 具有相同的结构和相同的数据,只是具有不同的主键,以便回复其他查询。
如果我升级 table 中的一行,我如何确保具有相同数据的其他 table 中数据的一致性?
Cassandra 为此提供了 BATCH
。来自 documentation:
A BATCH statement combines multiple data modification language (DML) statements (INSERT, UPDATE, DELETE) into a single logical operation, and sets a client-supplied timestamp for all columns written by the statements in the batch. Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see "Cassandra: Batch loading without the Batch keyword."
Batches are atomic by default. In the context of a Cassandra batch operation, atomic means that if any of the batch succeeds, all of it will. To achieve atomicity, Cassandra first writes the serialized batch to the batchlog system table that consumes the serialized batch as blob data. When the rows in the batch have been successfully written and persisted (or hinted) the batchlog data is removed. There is a performance penalty for atomicity. If you do not want to incur this penalty, prevent Cassandra from writing to the batchlog system by using the UNLOGGED option: BEGIN UNLOGGED BATCH
UNLOGGED BATCH 几乎总是不受欢迎的,我相信在未来的版本中会被删除。普通批次提供您想要的功能。
您还可以探索 Cassandra 3.0 的一项名为 materialized views 的新功能:
Basic rules of data modeling in Cassandra involve manually denormalizing data into separate tables based on the queries that will be run against that table. Currently, the only way to query a column without specifying the partition key is to use secondary indexes, but they are not a substitute for the denormalization of data into new tables as they are not fit for high cardinality data. High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. Instead, client-side denormalization and multiple independent tables are used, which means that the same code is rewritten for many different users.
In 3.0, Cassandra will introduce a new feature called Materialized Views. Materialized views handle automated server-side denormalization, removing the need for client side handling of this denormalization and ensuring eventual consistency between the base and view data. This denormalization allows for very fast lookups of data in each view using the normal Cassandra read path.
这个想法与 Jeff Jirsa 的建议完全相同,但它不需要您处理应用程序中的所有多 table 一致性逻辑,Cassandra 会自动为您完成。
我是 Cassandra 的新手,我读到 Cassandra 鼓励数据的非规范化和重复。这让我有点困惑。 让我们想象以下场景:
我有一个包含四个 table 的键空间:A、B、C 和 D。
CREATE TABLE A (
tableID int,
column1 int,
column2 varchar,
column3 varchar,
column4 varchar,
column5 varchar,
PRIMARY KEY (column1, tableID)
);
让我们假设其他 tables (B,C,D) 与 tableA 具有相同的结构和相同的数据,只是具有不同的主键,以便回复其他查询。
如果我升级 table 中的一行,我如何确保具有相同数据的其他 table 中数据的一致性?
Cassandra 为此提供了 BATCH
。来自 documentation:
A BATCH statement combines multiple data modification language (DML) statements (INSERT, UPDATE, DELETE) into a single logical operation, and sets a client-supplied timestamp for all columns written by the statements in the batch. Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see "Cassandra: Batch loading without the Batch keyword."
Batches are atomic by default. In the context of a Cassandra batch operation, atomic means that if any of the batch succeeds, all of it will. To achieve atomicity, Cassandra first writes the serialized batch to the batchlog system table that consumes the serialized batch as blob data. When the rows in the batch have been successfully written and persisted (or hinted) the batchlog data is removed. There is a performance penalty for atomicity. If you do not want to incur this penalty, prevent Cassandra from writing to the batchlog system by using the UNLOGGED option: BEGIN UNLOGGED BATCH
UNLOGGED BATCH 几乎总是不受欢迎的,我相信在未来的版本中会被删除。普通批次提供您想要的功能。
您还可以探索 Cassandra 3.0 的一项名为 materialized views 的新功能:
Basic rules of data modeling in Cassandra involve manually denormalizing data into separate tables based on the queries that will be run against that table. Currently, the only way to query a column without specifying the partition key is to use secondary indexes, but they are not a substitute for the denormalization of data into new tables as they are not fit for high cardinality data. High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. Instead, client-side denormalization and multiple independent tables are used, which means that the same code is rewritten for many different users.
In 3.0, Cassandra will introduce a new feature called Materialized Views. Materialized views handle automated server-side denormalization, removing the need for client side handling of this denormalization and ensuring eventual consistency between the base and view data. This denormalization allows for very fast lookups of data in each view using the normal Cassandra read path.
这个想法与 Jeff Jirsa 的建议完全相同,但它不需要您处理应用程序中的所有多 table 一致性逻辑,Cassandra 会自动为您完成。