pyspark 从逗号分隔的行生成键值对 RDD

pyspark generate key value pairs RDD from comma separated lines

我需要阅读下面提供的具有逗号分隔值的行并生成键值对 RDD,如 output.I 所示,我是新手,欢迎任何指导。

输入:

    R-001, A1, 10, A2, 20, A3, 30

    R-002, X1, 20, Y2, 10

    R-003, Z4, 30, Z10, 5, N12, 38

输出:

    R-001, A1
    R-001, A2
    R-001, A3
    R-002, X1
    R-002, Y2
    R-003, Z4
    R-003, Z10
    R-003, N12

代码:

    lines = spark.parallelize([
    "R-001, A1, 10, A2, 20, A3, 30",
    "R-002, X1, 20, Y2, 10",
    "R-003, Z4, 30, Z10, 5, N12, 38"])
     

您可以 flatMaplines RDD 上,并通过基于 ,.

的拆分为每一行提取键和值
from typing import Tuple, List

lines = spark.parallelize([
    "R-001, A1, 10, A2, 20, A3, 30",
    "R-002, X1, 20, Y2, 10",
    "R-003, Z4, 30, Z10, 5, N12, 38"])

def processor(line: str) -> List[Tuple[str, str]]:
    tokens = line.split(",")
    key = tokens[0].strip()
    return [(key, v.strip()) for v in tokens[1::2]]

lines.flatMap(processor).collect()

输出

[('R-001', 'A1'),
 ('R-001', 'A2'),
 ('R-001', 'A3'),
 ('R-002', 'X1'),
 ('R-002', 'Y2'),
 ('R-003', 'Z4'),
 ('R-003', 'Z10'),
 ('R-003', 'N12')]