以编程方式指定 PySpark 中的模式
Programmatically specifying the schema in PySpark
我正在尝试从 rdd 创建数据框。我想明确指定模式。下面是我试过的代码片段。
from pyspark.sql.types import StructField, StructType , LongType, StringType
stringJsonRdd_new = sc.parallelize(('{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }',\
'{ "id": "234","name": "Michael", "age": 22, "eyeColor": "green" }',\
'{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }'))
mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])
new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- eyeColor: string (nullable = true)
|-- name: string (nullable = true)
当我尝试 new_df.show() 时,出现错误:
ValueError: Unexpected tuple '{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }' with StructType
有人可以帮我吗?
PS:我可以使用以下方法从现有的 df 显式类型转换并创建一个新的 df:
casted_df = stringJsonDf.select(stringJsonDf.age,stringJsonDf.eyeColor, stringJsonDf.name,stringJsonDf.id.cast('int').alias('new_id'))
您将数据框字符串作为输入而不是字典,因此它无法将其映射到您定义的类型。
如果您如下修改代码(还将数据中的 "id" 更改为数字而非字符串 - 或者将 "id" 的结构类型从 LongType
更改为 StringType
):
from pyspark.sql.types import StructField, StructType , LongType, StringType
# give dictionaries instead of strings:
stringJsonRdd_new = sc.parallelize((
{"id": 123, "name": "Katie", "age": 19, "eyeColor": "brown" },\
{ "id": 234,"name": "Michael", "age": 22, "eyeColor": "green" },\
{ "id": 345, "name": "Simone", "age": 23, "eyeColor": "blue" }))
mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])
new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- eyeColor: string (nullable = true)
|-- name: string (nullable = true)
+---+---+--------+-------+
| id|age|eyeColor| name|
+---+---+--------+-------+
|123| 19| brown| Katie|
|234| 22| green|Michael|
|345| 23| blue| Simone|
+---+---+--------+-------+
希望对您有所帮助,祝您好运!
我正在尝试从 rdd 创建数据框。我想明确指定模式。下面是我试过的代码片段。
from pyspark.sql.types import StructField, StructType , LongType, StringType
stringJsonRdd_new = sc.parallelize(('{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }',\
'{ "id": "234","name": "Michael", "age": 22, "eyeColor": "green" }',\
'{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }'))
mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])
new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- eyeColor: string (nullable = true)
|-- name: string (nullable = true)
当我尝试 new_df.show() 时,出现错误:
ValueError: Unexpected tuple '{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }' with StructType
有人可以帮我吗?
PS:我可以使用以下方法从现有的 df 显式类型转换并创建一个新的 df:
casted_df = stringJsonDf.select(stringJsonDf.age,stringJsonDf.eyeColor, stringJsonDf.name,stringJsonDf.id.cast('int').alias('new_id'))
您将数据框字符串作为输入而不是字典,因此它无法将其映射到您定义的类型。
如果您如下修改代码(还将数据中的 "id" 更改为数字而非字符串 - 或者将 "id" 的结构类型从 LongType
更改为 StringType
):
from pyspark.sql.types import StructField, StructType , LongType, StringType
# give dictionaries instead of strings:
stringJsonRdd_new = sc.parallelize((
{"id": 123, "name": "Katie", "age": 19, "eyeColor": "brown" },\
{ "id": 234,"name": "Michael", "age": 22, "eyeColor": "green" },\
{ "id": 345, "name": "Simone", "age": 23, "eyeColor": "blue" }))
mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])
new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- eyeColor: string (nullable = true)
|-- name: string (nullable = true)
+---+---+--------+-------+
| id|age|eyeColor| name|
+---+---+--------+-------+
|123| 19| brown| Katie|
|234| 22| green|Michael|
|345| 23| blue| Simone|
+---+---+--------+-------+
希望对您有所帮助,祝您好运!