PySpark - 将 RDD 转换为键值对 RDD,值在列表中
PySpark - Convert an RDD into a key value pair RDD, with the values being in a List
我有一个 RDD,其元组的形式为:
[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...
我想要的是将其转换为键值对 RDD,其中第一个字段将是第一个字符串(键),第二个字段是字符串列表(值),即我想将其转换形式为:
[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])
>>> result = rdd.map(lambda x: (x[0], list(x[1:])))
>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]
lambda x: (x[0], list(x[1:]))
的解释:
x[0]
将使第一个元素成为
输出
x[1:]
将使除第一个元素之外的所有元素成为
在第二个元素中
list(x[1:])
将强制它成为一个列表
因为默认将是一个元组
我有一个 RDD,其元组的形式为:
[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...
我想要的是将其转换为键值对 RDD,其中第一个字段将是第一个字符串(键),第二个字段是字符串列表(值),即我想将其转换形式为:
[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])
>>> result = rdd.map(lambda x: (x[0], list(x[1:])))
>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]
lambda x: (x[0], list(x[1:]))
的解释:
x[0]
将使第一个元素成为 输出x[1:]
将使除第一个元素之外的所有元素成为 在第二个元素中list(x[1:])
将强制它成为一个列表 因为默认将是一个元组