将一个数据集中的一行添加到 Spark Scala 中的另一个数据集

Question

有两个DataFrame集，一个是“Training set”，一个是“Test set”。我想做的是通过使用“训练集加上一行测试集”迭代一些算法（让我们称之为 AAA，它需要 RDD 输入格式），按照下面的步骤。

其实在spark的手册中，我查过spark中的RDD和DataFrame是不可变的，所以无法使用

Testset.map( x => AAA(Trainset.union(x)) )

此外，我尝试使用

Testset.map( x => AAA(Trainset.union(Array(x.get(0).toString.toDouble, x.get(1).toString.toDouble, ... x.get(19).toString.toDouble))

但是，它不起作用 :(。是否有任何解决方案可以使上述步骤成为可能？如果您对这个问题有好的想法，请帮助我。

//修改和添加条件

由于耗时问题，我需要使用并行计算。因此，我无法使用 'for loop'。谢谢

Answer 1

不确定这个想法有多好，但是怎么样：

1) 在名为 helper 的训练数据框中创建一个值为 -1

的新列

2) 在名为 helper 的测试数据框上创建一个新列，如下所示：

test.withColumn("helper", monotonically_increasing_id())

3) 将 2) 的输出写入磁盘以确保 ids 永远不会改变

4) 联合 1) 与 3) 读回，然后 cache/persist/write 到磁盘并读回

5) 编写一个循环来过滤联合数据帧并执行以下逻辑：

val data = unioned.filter($"helper" === lit(-1) || $"helper" === lit(n))
val result = logic(data)

其中 n 是您要循环的值，测试的第一行从 0 开始

Add one row from one Data set to Another Data set in Spark Scala