如何将多个数据框列转换为一个 numpy 数组列
How to transform multiple dataframe columns into one numpy array column
我有一个如下所示的数据框
from pyspark import SparkContext, SparkConf,SQLContext
import numpy as np
config = SparkConf("local")
sc = SparkContext(conf=config)
sqlContext=SQLContext(sc)
df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","word1").withColumnRenamed("_3","word2").withColumnRenamed("_4","word3")
现在我需要将第一列和其余列保留为一个 numpy 数组(两列:"doc" 和一个 numpy 数组列)
我知道
sdf=np.array(df.select([c for c in df.columns if c not in {'doc'}]).collect())
print sdf
将所有列转换为 numpy 数组,但如何将第一列附加到 numpy 数组?感谢任何帮助。
不幸的是,您不能在 pyspark 数据框中创建 numpy.array
列,但您可以使用常规 python
列表,并在阅读时转换它:
>>> df = sqlContext.createDataFrame([("doc_3",[1,3,9]), ("doc_1",[9,6,0]), ("doc_2",[9,9,3]) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","words")
>>> df.show()
+-----+---------+
| doc| words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+
>>> df
DataFrame[doc: string, words: array<bigint>]
要从您拥有的 4 列中获取此信息,您可以:
>>> from pyspark.sql.functions import *
>>> df2=df.select("doc", array("word1", "word2", "word3").alias("words"))
>>> df2
DataFrame[doc: string, words: array<bigint>]
>>> df2.show()
+-----+---------+
| doc| words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+
我有一个如下所示的数据框
from pyspark import SparkContext, SparkConf,SQLContext
import numpy as np
config = SparkConf("local")
sc = SparkContext(conf=config)
sqlContext=SQLContext(sc)
df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","word1").withColumnRenamed("_3","word2").withColumnRenamed("_4","word3")
现在我需要将第一列和其余列保留为一个 numpy 数组(两列:"doc" 和一个 numpy 数组列)
我知道
sdf=np.array(df.select([c for c in df.columns if c not in {'doc'}]).collect())
print sdf
将所有列转换为 numpy 数组,但如何将第一列附加到 numpy 数组?感谢任何帮助。
不幸的是,您不能在 pyspark 数据框中创建 numpy.array
列,但您可以使用常规 python
列表,并在阅读时转换它:
>>> df = sqlContext.createDataFrame([("doc_3",[1,3,9]), ("doc_1",[9,6,0]), ("doc_2",[9,9,3]) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","words")
>>> df.show()
+-----+---------+
| doc| words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+
>>> df
DataFrame[doc: string, words: array<bigint>]
要从您拥有的 4 列中获取此信息,您可以:
>>> from pyspark.sql.functions import *
>>> df2=df.select("doc", array("word1", "word2", "word3").alias("words"))
>>> df2
DataFrame[doc: string, words: array<bigint>]
>>> df2.show()
+-----+---------+
| doc| words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+