在RDD中添加递增变量
Add incrementing variable in RDD
假设我有以下 RDD:
test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)
如何创建以下 rdd:
((1,'trial1',[1,2]),(2,'trial2',[3,4]))
我尝试使用累加器,但它不起作用,因为无法在任务中访问累加器:
def increm(keyvalue):
global acc
acc +=1
return (acc.value,keyvalue[0],keyvalue[1])
acc = sc.accumulator(0)
test1RDD.map(lambda x: increm(x)).collect()
知道如何做到这一点吗?
您可以使用 zipWithIndex
zipWithIndex()
Zips this RDD with its element indices.
The ordering is first based on the partition index and then the
ordering of items within each partition. So the first item in the
first partition gets index 0, and the last item in the last partition
receives the largest index.
This method needs to trigger a spark job when this RDD contains more
than one partitions.
>>> sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()
[('a', 0), ('b', 1), ('c', 2), ('d', 3)]
并用map
改造RDD,让索引在新RDD
前面
这是未经测试的,因为我没有任何环境:
test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)
test1RDD.zipWithIndex().map(lambda x : (x[1],x[0]))
假设我有以下 RDD:
test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)
如何创建以下 rdd:
((1,'trial1',[1,2]),(2,'trial2',[3,4]))
我尝试使用累加器,但它不起作用,因为无法在任务中访问累加器:
def increm(keyvalue):
global acc
acc +=1
return (acc.value,keyvalue[0],keyvalue[1])
acc = sc.accumulator(0)
test1RDD.map(lambda x: increm(x)).collect()
知道如何做到这一点吗?
您可以使用 zipWithIndex
zipWithIndex()
Zips this RDD with its element indices.
The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.
This method needs to trigger a spark job when this RDD contains more than one partitions.
>>> sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()
[('a', 0), ('b', 1), ('c', 2), ('d', 3)]
并用map
改造RDD,让索引在新RDD
这是未经测试的,因为我没有任何环境:
test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)
test1RDD.zipWithIndex().map(lambda x : (x[1],x[0]))