PySpark:将 RDD[DenseVector] 转换为数据帧
PySpark: convert RDD[DenseVector] to dataframe
我有以下 RDD:
rdd.take(5) 给我:
[DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]),
DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699])]
我想让它成为一个数据框,应该如下所示:
-------------------------------------------------------------------
| features |
-------------------------------------------------------------------
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0] |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
这可能吗?我尝试使用 df_new = sqlContext.createDataFrame(rdd,['features'])
,但没有用。有人有什么建议吗?谢谢!
先映射到tuples
:
rdd.map(lambda x: (x, )).toDF(["features"])
请记住,从 Spark 2.0 开始,有两种不同的 Vector
实现和 ml
算法需要 pyspark.ml.Vector
.
我有以下 RDD:
rdd.take(5) 给我:
[DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]),
DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699])]
我想让它成为一个数据框,应该如下所示:
-------------------------------------------------------------------
| features |
-------------------------------------------------------------------
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0] |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
这可能吗?我尝试使用 df_new = sqlContext.createDataFrame(rdd,['features'])
,但没有用。有人有什么建议吗?谢谢!
先映射到tuples
:
rdd.map(lambda x: (x, )).toDF(["features"])
请记住,从 Spark 2.0 开始,有两种不同的 Vector
实现和 ml
算法需要 pyspark.ml.Vector
.