为什么在 Cassandra 中创建两个表供用户按用户名和电子邮件进行搜索而不是添加索引?

In Cassandra why create two tables for users to search by username and email instead of adding an index?

阅读这篇文章:Basic Rules of Cassandra Data Modeling 他们说,如果你想通过电子邮件和用户名查询用户,你应该做两个 tables:

CREATE TABLE users_by_username (
    username text PRIMARY KEY,
    email text,
    age int
)

CREATE TABLE users_by_email (
    email text PRIMARY KEY,
    username text,
    age int
)

你为什么要这样做?这么小的东西不会让数据更难管理吗?你为什么不做一个 table 并有一个索引?

-- A table holding the user info
CREATE TABLE users (
    username text,
    email text,
    age int,
    PRIMARY KEY((username),email)
);

-- An index that gives good performance on email searching
CREATE INDEX user_email ON users (email);

你应该做两个 table 因为索引中的高基数问题

If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion emails, looking up user by email (a value that is typically unique for each user) is likely to be very inefficient.

当您使用电子邮件执行查询时,cassandra 将在每个节点上执行此查询,每个节点将查找其本地索引并发送响应。您的合并结果将是单个用户。您在每个节点上查询以获得单个结果,这是非常低效的

相反,如果您通过电子邮件为用户创建一个单独的 table。并且执行查询,cassandra只需要通过partition key email查找到单个节点即可。

或者如果您使用的是 cassandra 版本 3.0 或更高版本,您可以使用 Materialized Views 来自动维护您的非规范化。

来源:http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html