PySpark - 将 RDD 转换为键值对 RDD,值在列表中

PySpark - Convert an RDD into a key value pair RDD, with the values being in a List

我有一个 RDD,其元组的形式为:

[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...

我想要的是将其转换为键值对 RDD,其中第一个字段将是第一个字符串(键),第二个字段是字符串列表(值),即我想将其转换形式为:

[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])

>>> result = rdd.map(lambda x: (x[0], list(x[1:])))

>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]

lambda x: (x[0], list(x[1:]))的解释:

  1. x[0] 将使第一个元素成为 输出
  2. x[1:] 将使除第一个元素之外的所有元素成为 在第二个元素中
  3. list(x[1:]) 将强制它成为一个列表 因为默认将是一个元组