在RDD中添加递增变量

Question

假设我有以下 RDD：

test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)

如何创建以下 rdd：

((1,'trial1',[1,2]),(2,'trial2',[3,4]))

我尝试使用累加器，但它不起作用，因为无法在任务中访问累加器：

def increm(keyvalue):
    global acc
    acc +=1
    return (acc.value,keyvalue[0],keyvalue[1])


acc = sc.accumulator(0)
test1RDD.map(lambda x: increm(x)).collect()

知道如何做到这一点吗？

Answer 1

您可以使用 zipWithIndex

zipWithIndex()

Zips this RDD with its element indices.

The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.

This method needs to trigger a spark job when this RDD contains more than one partitions.

  >>> sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()
[('a', 0), ('b', 1), ('c', 2), ('d', 3)]

并用map改造RDD，让索引在新RDD

前面

这是未经测试的，因为我没有任何环境：

test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)
test1RDD.zipWithIndex().map(lambda x : (x[1],x[0]))

在RDD中添加递增变量

Add incrementing variable in RDD

accumulator

apache-spark

pyspark