Spark Structured Streaming 流-流连接问题

Question

场景与经典的流-流连接略有不同

streamA: 交易流：transTS, userid, productid,...

streamB：创建的新产品流：productid, productname, createTS, ...)

我想加入带有 productIds 的交易，但找不到实现此目的的 watermarks/join 条件组合。

streamA_wm = streamA.withWatermark("transTS", "3 minutes")
streamB_wm = streamB.withWatermark("createTS", "1 day")

streamA_wm
   .join(streamB_wm, "productId AND transTS >= createTS", "leftOuter")

结果为空。

我做错了什么？

Answer 1

我认为您可能在这里采取了错误的方法。虽然产品在创建和更新时是事务性的，但它们是相对于其他事务流的元数据。

我建议如下：

将事务流加入参考数据产品 - 不受流处理。
不要缓存产品，这样可以确保您找到源代码。
为产品使用镶木地板、KUDU。

但是产品流可能是有原因的，但是...如果不再对产品进行更新并且您通过交易流再次获取该产品的数据会怎样？

Spark Structured Streaming 流-流连接问题

Spark Structured Streaming stream-stream join question

apache-spark

spark-structured-streaming