以编程方式指定 PySpark 中的模式

Programmatically specifying the schema in PySpark

我正在尝试从 rdd 创建数据框。我想明确指定模式。下面是我试过的代码片段。

from pyspark.sql.types import StructField, StructType , LongType, StringType

stringJsonRdd_new = sc.parallelize(('{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown"  }',\
'{ "id": "234","name": "Michael", "age": 22, "eyeColor": "green"  }',\
'{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }'))

mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])
new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- age: long (nullable = true)
 |-- eyeColor: string (nullable = true)
 |-- name: string (nullable = true)

当我尝试 new_df.show() 时,出现错误:

ValueError: Unexpected tuple '{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown"  }' with StructType

有人可以帮我吗?

PS:我可以使用以下方法从现有的 df 显式类型转换并创建一个新的 df:

casted_df = stringJsonDf.select(stringJsonDf.age,stringJsonDf.eyeColor, stringJsonDf.name,stringJsonDf.id.cast('int').alias('new_id'))

您将数据框字符串作为输入而不是字典,因此它无法将其映射到您定义的类型。

如果您如下修改代码(还将数据中的 "id" 更改为数字而非字符串 - 或者将 "id" 的结构类型从 LongType 更改为 StringType):

from pyspark.sql.types import StructField, StructType , LongType, StringType

# give dictionaries instead of strings:
stringJsonRdd_new = sc.parallelize((
{"id": 123, "name": "Katie", "age": 19, "eyeColor": "brown"  },\
{ "id": 234,"name": "Michael", "age": 22, "eyeColor": "green"  },\
{ "id": 345, "name": "Simone", "age": 23, "eyeColor": "blue" }))

mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])

new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()


root
 |-- id: long (nullable = true)
 |-- age: long (nullable = true)
 |-- eyeColor: string (nullable = true)
 |-- name: string (nullable = true)

+---+---+--------+-------+
| id|age|eyeColor|   name|
+---+---+--------+-------+
|123| 19|   brown|  Katie|
|234| 22|   green|Michael|
|345| 23|    blue| Simone|
+---+---+--------+-------+

希望对您有所帮助,祝您好运!