Cassandra 中的数据建模，列可以是文本或数字

Question

我有 table 5 列。

    1. ID -  number but it can stored as text or number
    2. name - text
    3. date - date value but can stored as date or text
    4. time - number but it can stored as text or number
    5. rating - number but it can stored as text or number

我想找出哪种数据类型可以使我的 table 写入速度更快。我怎样才能找到。那里有任何 Cassandra 压力 yaml 吗？

Answer 1

关于 answer that @BryceAtNetwork23 provided, it will be the same with Cassandra 2.1 or in Cassandra 2.2 (but Cassandra 3.0 will probably be a different story as the team is currently rewriting the storage engine, see CASSANDRA-8099）。存储的数据仍然以二进制形式存储。

不过还有更多要说的。并且您可能需要考虑存储的实际数据、您的项目需要达到的性能、每秒查询等。

根据这些目标或约束，一种有趣的方法是查看给定 type on cassandra 的序列化数据的大小。

如果数据是一个数字，例如 Java 中的 long 大小为 8 字节，则匹配 cassandra bigint 类型在大小上，这意味着序列化时没有相关成本，一个普通副本就可以了。这还有一个好处，即密钥足够小，因此它不会 stress cassandra 密钥缓存。
如果数据是一段文本，比如Java中的一个String，在运行时是用UTF-16编码的，但是在Cassandra中序列化的时候使用 text 类型然后使用 UTF-8。 UTF-16 始终使用 2 个字节 每个字符 有时使用 4 个字节，但 UTF-8 是 space 高效的并且取决于字符可以是 1、2、3 或 4 个字节长。

这意味着 CPU 需要为 encoding/decoding 目的对此类数据进行序列化。同样取决于文本，例如 158786464563，数据将以 12 个字节存储。这意味着使用了更多的 space 和更多的 IO。

注意 cassandra 提供遵循 US-ASCII 字符集的 ascii 类型，并且始终使用 1 byte per character.
如果数据是一个 UUID（128 位的值），在 Java 中，UUID 类型使用 2 longs 所以它是 16 字节长, Cassandra 也将它们存储为 16 个字节 (they use the Java UUID type).

同样，这始终取决于项目的里程、目标和现有限制。但这是我的未受过教育选项：

如果必须插入的数据总是在长范围 [−9,223,372,036,854,775,808 ; +9,223,372,036,854,775,807] 内的数字，我会得到一个 bigint 类型
UUID 可以
如果群集负载不重（例如每秒 100k 查询）并且 space 不是问题，那么 text 不是问题，但如果是或者使用量可能会增长如果可能的话，我会避免使用 text 作为键。

另一种选择是使用 blob 类型，即二进制类型，可以根据软件的业务以您想要的方式使用任何数据。这可以允许 space 高效、IO 高效存储，并且 CPU 高效。但是根据需要，可能需要在客户端代码中管理很多东西，比如排序、序列化、比较、映射等...

Cassandra 中的数据建模，列可以是文本或数字

data modeling in Cassandra with columns that can be text or numbers

cassandra

datastax-enterprise