将数组的数据类型从 double 更改为 int 的问题

Question

我有一组数据，我正在尝试编写一个 python 程序，在数据块中加载文件时从架构级别更改数据类型。在将数组的数据类型从 DOUBLE 更改为 INT 时，我不断收到错误

架构

root
 |-- _id: string (nullable = true)
 |-- city: string (nullable = true)
 |-- loc: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- pop: long (nullable = true)
 |-- state: string (nullable = true)

我的代码

s= StructType([
StructField("_id",IntegerType(), True),
StructField("city",StringType(), True),
StructField("loc",ArrayType(), True),
StructField("element",DoubleType(), True),
StructField("pop",LongType(), True),
StructField("state",StringType(), True)
])

filepath= "/FileStore/tables/zips.json"
df2= spark.read.format("json").load(filepath, schema=s)
df2.show()

错误

TypeError: __init__() missing 1 required positional argument: 'elementType'

示例数据

Answer 1

您的模式定义不正确

s= StructType([
StructField("_id",IntegerType(), True),
StructField("city",StringType(), True),
StructField("loc",ArrayType(DoubleType()), True),
StructField("element",DoubleType(), True),
StructField("pop",LongType(), True),
StructField("state",StringType(), True)
])

# "flatten" `lat` and `lon` from `loc` array
filepath= "/FileStore/tables/zips.json"
df2= (spark
    .read.format("json").load(filepath, schema=s)
    .withColumn('loc', F.array(
      F.col('loc')[0].cast('int'),
      F.col('loc')[1].cast('int')
    ))
)
df2.show()

# +---+----+------------+-----+-----+---+---+
# |_id|city|         loc|  pop|state|lat|lon|
# +---+----+------------+-----+-----+---+---+
# |  1|  CC|[77.3, 77.2]|12345|   SS| 77| 77|
# +---+----+------------+-----+-----+---+---+

Answer 2

您错过了在 ArrayType(elementType)

中传递一个参数

错误：elementType 应该是 DataType

from pyspark.sql.types import *

ArrayType(IntegerType())

点击这里了解更多：Documentation

将数组的数据类型从 double 更改为 int 的问题

Issues changing the datatype of an array from double to int

python

pyspark

databricks