将列表列表转换为 pyspark 数据框?
Convert list of lists to pyspark dataframe?
无法将以下列表转换为 pyspark 数据框。
lst = [[1, 'A', 'aa'], [2, 'B', 'bb'], [3, 'C', 'cc']]
cols = ['col1', 'col2', 'col3']
Desired output:
+----------+----------+----------+
| col1 | col2 | col3 |
+----------+----------+----------+
| 1 | A | aa |
+----------+----------+----------+
| 2 | B | bb |
+----------+----------+----------+
| 3 | C | cc |
+----------+----------+----------+
我主要是在寻找 pandas 等价于:
df = pd.DataFrame(data=lst,columns=cols)
如果您安装了 pandas 包,则可以使用 spark.createDataFrame
将数据框导入 pyspark
import pandas as pd
from pyspark.sql import SparkSession
lst = [[1, 'A', 'aa'], [2, 'B', 'bb'], [3, 'C', 'cc']]
cols = ['col1', 'col2', 'col3']
df = pd.DataFrame(data=lst,columns=cols)
#Create PySpark SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("spark") \
.getOrCreate()
#Create PySpark DataFrame from Pandas
sparkDF=spark.createDataFrame(df)
sparkDF.printSchema()
sparkDF.show()
或者,您也可以不使用 pandas
from pyspark.sql import SparkSession
lst = [[1, 'A', 'aa'], [2, 'B', 'bb'], [3, 'C', 'cc']]
cols = ['col1', 'col2', 'col3']
#Create PySpark SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("spark") \
.getOrCreate()
df = spark.createDataFrame(lst).toDF(*cols)
df.printSchema()
df.show()
无法将以下列表转换为 pyspark 数据框。
lst = [[1, 'A', 'aa'], [2, 'B', 'bb'], [3, 'C', 'cc']]
cols = ['col1', 'col2', 'col3']
Desired output:
+----------+----------+----------+
| col1 | col2 | col3 |
+----------+----------+----------+
| 1 | A | aa |
+----------+----------+----------+
| 2 | B | bb |
+----------+----------+----------+
| 3 | C | cc |
+----------+----------+----------+
我主要是在寻找 pandas 等价于:
df = pd.DataFrame(data=lst,columns=cols)
如果您安装了 pandas 包,则可以使用 spark.createDataFrame
将数据框导入 pysparkimport pandas as pd
from pyspark.sql import SparkSession
lst = [[1, 'A', 'aa'], [2, 'B', 'bb'], [3, 'C', 'cc']]
cols = ['col1', 'col2', 'col3']
df = pd.DataFrame(data=lst,columns=cols)
#Create PySpark SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("spark") \
.getOrCreate()
#Create PySpark DataFrame from Pandas
sparkDF=spark.createDataFrame(df)
sparkDF.printSchema()
sparkDF.show()
或者,您也可以不使用 pandas
from pyspark.sql import SparkSession
lst = [[1, 'A', 'aa'], [2, 'B', 'bb'], [3, 'C', 'cc']]
cols = ['col1', 'col2', 'col3']
#Create PySpark SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("spark") \
.getOrCreate()
df = spark.createDataFrame(lst).toDF(*cols)
df.printSchema()
df.show()