RDD pyspark partitionBy - TypeError: 'int' object is not subscriptable

Question

list_1 = [[6, [3, 8, 7]], [5, [9, 7, 3]], [6, [7, 8, 5]], [5, [6, 7, 2]]]

rdd1 = sc.parallelize(list_1)
newpairRDD = rdd1.partitionBy(2,lambda k: int(k[0]))
print("Partitions structure: {}".format(newpairRDD.glom().collect()))

我想按键分区

我得到

TypeError: 'int' object is not subscriptable

我做错了什么？

Answer 1

所以，key 应该是引号内的整数

list_1 = [["6", [3, 8, 7]], ["5", [9, 7, 3]], ["6", [7, 8, 5]], ["5", [6, 7, 2]]]

这样就可以了

Answer 2

提供给partitionBy的分区函数对RDD的每个条目的键进行操作，即每个条目的第一个元素。所以你在整数键上调用 lambda k: int(k[0])，从而导致你遇到的错误。

如果您只是想按键分区，您的 lambda 函数应该是恒等运算，例如

newpairRDD = rdd1.partitionBy(2, lambda x: x)

RDD pyspark partitionBy - TypeError: 'int' object is not subscriptable

RDD pyspark partitionBy - TypeError: 'int' object is not subscriptable

partitioning

apache-spark

rdd

pyspark