Apache Samza 本地存储 - OrientDB / Neo4J 图而不是 KV 存储

Apache Samza local storage - OrientDB / Neo4J graph instead of KV store

Apache Samza 使用 RocksDB 作为本地存储的存储引擎。这允许有状态的流处理和 here's a very good overview.

我的用例：

我希望处理来自 Apache Kafka 等系统的多个事件流。
这些事件创建状态 - 我希望跟踪的状态基于之前收到的消息。
我希望根据计算出的状态生成新的流事件。
输入流事件高度相关，OrientDB / Neo4J 等图形是查询数据以创建新流事件的理想媒介。

我的问题：

是否可以使用非 KV 存储作为 Samza 的本地存储？有没有人用 OrientDB / Neo4J 做过这个，有人知道一个例子吗？

我一直在评估 Samza，我绝不是专家，但我建议您阅读 the official documentation，甚至通读源代码——除了它在Scala，它非常平易近人。

在这种特殊情况下，the documentation's page on State Management 的底部是这样的：

Other storage engines

Samza’s fault-tolerance mechanism (sending a local store’s writes to a replicated changelog) is completely decoupled from the storage engine’s data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the StorageEngine interface. Samza’s model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task.

Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), approximate algorithms such as bloom filters and hyperloglog, or full-text indexes such as Lucene. (Patches accepted!)

实际上，我大约在两周前通读了默认 StorageEngine 实现的代码，以便更好地了解它的工作原理。我的知识肯定不够多，无法聪明地谈论它，但我可以指出它：

主要的实施问题似乎是：

记录对主题的所有更改，以便在任务失败时可以恢复商店的状态。
以高性能方式恢复商店的状态
批量写入和缓存频繁读取，以节省前往原始存储的次数。
关于商店使用情况的报告指标。

输入流事件是定义一个全局图，还是为每个匹配的 Kafka/Samza 分区定义多个图？这很重要，因为 Samza 状态是局部的而不是全局的。

如果它是一个全局图，您可以 update/query 从 Samza 任务处理方法中分离出一个单独的图系统。 Cassandra 上的 Titan 就是这样一个图形系统。

如果是多个单独的图，可以使用当前的RocksDB KV存储来模拟图数据库操作。 Cassandra 上的 Titan 就是这样做的——使用 Cassandra KV 存储来存储和查询图形。图通过矩阵（如果连接，则将 [i,j] 设置为 1）或边列表存储。对于每个节点，将其用作键并将其邻居集存储为值。

Apache Samza 本地存储 - OrientDB / Neo4J 图而不是 KV 存储

Apache Samza local storage - OrientDB / Neo4J graph instead of KV store

stream-processing

neo4j

orientdb

apache-kafka

apache-samza