pySpark 使用 Key/Value 从 RDD 创建 DataFrame

Question

如果我有一个 Key/Value 的 RDD（键是列索引），是否可以将其加载到数据框中？例如：

(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)

让数据框看起来像：

1,2,18
1,10,18
2,20,18

Answer 1

是的，这是可能的（使用 Spark 1.3.1 测试）：

>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
>>> sqlContext.createDataFrame(rdd, ["id", "score"])
Out[2]: DataFrame[id: bigint, score: bigint]

Answer 2

rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])

df=rdd.toDF(['id','score'])

df.show()

答案是：

+---+-----+
| id|score|
+---+-----+
|  0|    1|
|  0|    1|
|  0|    2|
|  1|    2|
|  1|   10|
|  1|   20|
|  3|   18|
|  3|   18|
|  3|   18|
+---+-----+

pySpark 使用 Key/Value 从 RDD 创建 DataFrame

pySpark Create DataFrame from RDD with Key/Value

apache-spark

pyspark