如何将数据从 Cassandra table 复制到另一个结构以获得更好的性能

Question

在一些地方，建议根据我们要对它们执行的查询来设计我们的 Cassandra table。在 this article by DataScale 他们这样说：

The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.

[...]

If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.

我已经理解了这一点，现在我的问题是：假设我有一个现有的table，比如说

CREATE TABLE invoices (
    id_invoice int PRIMARY KEY,
    year int,
    id_client int,
    type_invoice text
)

但我想按年份查询并改为键入，所以我想要类似

的内容

CREATE TABLE invoices_yr (
    id_invoice int,
    year int,
    id_client int,
    type_invoice text,
    PRIMARY KEY (type_invoice, year)
)

使用 id_invoice 作为分区键，year 作为集群键，将数据从一个 table 复制到另一个的首选方法是什么 稍后执行优化查询？

我的 Cassandra 版本：

user@cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]

Answer 1

可以使用cqlsh COPY命令 :
要将发票数据复制到 csv 文件，请使用：

COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';

然后从 csv 文件复制回 table 在您的情况下 invoices_yr 使用：

COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';

如果您有大量数据，您可以使用 sstable writer 写入和 sstableloader 加载数据更快。 http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

Answer 2

为了回应关于 COPY 命令的说法，这是一个很好的解决方案。

但是，我不同意关于 Bulk Loader 的说法，因为它非常难用。具体来说，因为您需要在每个节点上运行它（而 COPY 只需要在单个节点上运行）。

为了帮助复制更大数据集的规模，您可以使用 PAGETIMEOUT 和 PAGESIZE 参数。

COPY invoices(id_invoice, year, id_client, type_invoice) 
  TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;

适当地使用这些参数，我之前使用 COPY 成功地 export/import 3.7 亿行。

有关详细信息，请查看这篇标题为：New options and better performance in cqlsh copy。

的文章

Answer 3

使用 COPY 命令（参见其他答案的示例）或 Spark 迁移数据的替代方法是创建一个实体化视图来为您执行非规范化。

CREATE MATERIALIZED VIEW invoices_yr AS
       SELECT * FROM invoices
       WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
       PRIMARY KEY ((type_invoice), year, id_client)
       WITH CLUSTERING ORDER BY (year DESC)

Cassandra 将为您填写 table，这样您就不必自己迁移了。对于 3.5，请注意修复效果不佳（请参阅 CASSANDRA-12888）。

注意：实体化视图可能不是最好的使用方式，已更改为 "experimental" 状态

如何将数据从 Cassandra table 复制到另一个结构以获得更好的性能

How to copy data from a Cassandra table to another structure for better performance

cql

cassandra

cql3

cqlsh