具有结构类型的平面数据与文档存储

Question

我知道这是一个 'soft' 问题，在 SO 上通常不受欢迎，但我一直在使用 BigQuery 对（显然）平面数据进行数据分析，其中包含结构和重复数据。让我们使用一个非常基本的示例，一行可能如下所示：

ID
标题（str）
发行年份 (int)
流派 (str[])
致谢名单（struct[]）

一个示例数据可能如下所示：

{
    "ID": "T-1997",
    "Title": "Titanic",
    "ReleaseYear": 1997,
    "Genres": ["Drama", "Romance"],
    "Credits": {
        "Actors": ["Leonardo DiCaprio", "Kate Winslet"],
        "Directors": ["James Cameron"]
    }
}

我的问题基本上是什么类型的操作或查询可以在 MongoDB 或 CouchBase 等本机文档存储中完成，而在支持 [=36= 的关系数据库中无法完成] 数据。换句话说，我的假设（我希望我错了或被误导了）是只要数据库支持结构，它就可以做 document-store 可以做的所有事情。如果不是，它有哪些地方：(1) 可以在 MongoDB（或任何其他 document-store）中完成但在 BigQuery（或任何其他支持结构的数据库）中无法完成的事情)? (2) 在 MongoDB 中 比在关系数据库中 更容易完成的事情？

Answer 1

免责声明：我没有 MongoDB 或 CouchBase 方面的经验。我的回答是基于 BigQuery 在 STRUCT 上的能力。

性能

BigQuery 的 STRUCT 针对查询进行了优化。比如查询select a.nested_b.nested_c.nested_d from table_t，查询只扫描左边STRUCT字段nested_d的数据，速度快，成本低
可用性

如果您的数据是 write-once 或 append-only，则 STRUCT 列与文档存储 AFAIK 相当。

但是如果你以后只想更新某个嵌套字段，嵌套的STRUCT很难做到，因为没有办法更新REPEATED字段中的单个项目，你必须加载整个数组，扫描和更改，并重新打包以更新列。你会写这样的东西：

UPDATE table
SET Credits.Actors = (SELECT ARRAY_AGG(...) FROM UNNEST(Credits.Actors) WHERE ...)
WHERE ...

当存在数组的结构数组（甚至更多的嵌套级别）时，它可能会成为一个更大的问题。根据我对文档存储的理解，更新文档的单个嵌套字段应该比这更容易。基本上，这是获得前面提到的性能优势所必须付出的代价。

Answer 2

what type of operations or queries can be done in a native document store, such as MongoDB or CouchBase, that couldn't be done in a relational DB that supports arbitrarily-nested data.

即使支持任意嵌套数据，与 MongoDB 相比，BigQuery 也允许有限的嵌套。MongoDB 支持更多级别的嵌套。在 BigQuery 中，您的架构不能包含超过 15 层的嵌套结构。 MongoDB 支持 BSON 文档的 100 级嵌套。

In other words, my assumption (and I hope I'm wrong or misguided) is that as long as a DB supports structs, it can do everything that a document-store can do.

不完全是 - 嵌套列是列中的列。但是与像 Mongo 这样的 NoSQL 数据库相比，在 RDBMS 中进行分片是一项复杂的工作。从技术上讲你可以做到，但它不是为同样的目的而设计的。这就像把扳手当作锤子使用——你当然可以，但它的目的是不同的。您应该为正确的目的使用正确的工具。

If not, what are some places where it is either: (1) something that can be done in MongoDB (or any other document-store) that cannot be done in BigQuery (or any other database that supports structs)? and (2) something that can be done much more easily in MongoDB that in a relational DB?

问题的症结在于，RDBMS 可能会增加功能以“从技术上”允许您做一些您可以在 NoSQL 数据库中做的事情。但这并不意味着它可能同样有效。例如，由于使 RDBMS 成为 RDBMS 的特性（ACID 合规性、事务等），与 NoSQL 数据库相比，总会有额外的性能损失。如果 RDBMS 删除了这些功能，那么它就不再是 RDBMS！

这个答案说明了 MongoDB 如何获得更好的性能，因为它不需要支持 RDBMS 功能：

https://softwareengineering.stackexchange.com/questions/54373/when-would-someone-use-mongodb-or-similar-over-a-relational-dbms

MongoDB has a lower latency per query & spends less CPU time per query because it is doing a lot less work (e.g. no joins, transactions).

As a result, it can handle a higher load in terms of queries per second and is thus often used if you have a massive # of users.

MongoDB is easier to shard (use in a cluster) because it doesn't have to worry about transactions and consistency. - MongoDB has a faster write speed because it does not have to worry about transactions or rollbacks (and thus does not have to worry about locking).

MongoDB does not have a schema in case you have a special use case that can take advantage of that.

另一个特性是分片——使用 mongodb 分片更容易，因为它不需要支持许多使 RDBMS 成为 RDBMS 的特性，例如 ACID 兼容。相比之下，分片对于 RDBMS 来说很复杂，因为 RDBMS 必须保持 ACID 兼容。

看看下面两张图：

快艇在水中的性能是“两栖车”的 10/10 倍。水陆两用车在技术上可以在水中航行，但它不是设计成这样的，因此速度慢得多，不适合它的用途。

和智者一样，看看快艇和这辆可爱的汽车在空气动力学方面的区别。即使您在船上装上轮子，它在陆地上的表现也不会像这辆车那样好。（作为一个类比，您可以说 NoSQL 数据库不进行连接 - 您必须自己实现它们。 - 但对于连接繁重的操作，它会比 RDBMS 表现更好吗？）

我在类比中提出的观点是，每种数据库最初都是为特定目标而设计的，随着时间的推移，添加了一些特性来尝试解决它不是为它设计的问题（因此它做的不如专门为此目的设计的东西）。

因此在你的问题中，即使 BigQuery 或某些 RDBMS 可以做一些事情，这并不意味着你应该使用它们来完成工作。这同样适用于 NoSQL 数据库。你应该使用最好的工具来完成这项工作。

具有结构类型的平面数据与文档存储

Flat data with struct type vs document store

database

struct

mongodb

couchbase

google-bigquery