指定多列数据类型更改为pyspark中不同的数据类型
Specify multiple columns data type changes to different data types in pyspark
我有一个 DataFrame (df
),它由 50 多列和不同类型的数据类型组成,例如
df3.printSchema()
CtpJobId: string (nullable = true)
|-- TransformJobStateId: string (nullable = true)
|-- LastError: string (nullable = true)
|-- PriorityDate: string (nullable = true)
|-- QueuedTime: string (nullable = true)
|-- AccurateAsOf: string (nullable = true)
|-- SentToDevice: string (nullable = true)
|-- StartedAtDevice: string (nullable = true)
|-- ProcessStart: string (nullable = true)
|-- LastProgressAt: string (nullable = true)
|-- ProcessEnd: string (nullable = true)
|-- ClipFirstFrameNumber: string (nullable = true)
|-- ClipLastFrameNumber: double (nullable = true)
|-- SourceNamedLocation: string (nullable = true)
|-- TargetId: string (nullable = true)
|-- TargetNamedLocation: string (nullable = true)
|-- TargetDirectory: string (nullable = true)
|-- TargetFilename: string (nullable = true)
|-- Description: string (nullable = true)
|-- AssignedDeviceId: string (nullable = true)
|-- DeviceResourceId: string (nullable = true)
|-- DeviceName: string (nullable = true)
|-- srcDropFrame: string (nullable = true)
|-- srcDuration: double (nullable = true)
|-- srcFrameRate: double (nullable = true)
|-- srcHeight: double (nullable = true)
|-- srcMediaFormat: string (nullable = true)
|-- srcWidth: double (nullable = true)
现在我希望所有一种类型的列都可以一次性更改,例如
timestamp_type = [
'PriorityDate', 'QueuedTime', 'AccurateAsOf', 'SentToDevice',
'StartedAtDevice', 'ProcessStart', 'LastProgressAt', 'ProcessEnd'
]
integer_type = [
'ClipFirstFrameNumber', 'ClipLastFrameNumber', 'TargetId', 'srcHeight',
'srcMediaFormat', 'srcWidth'
]
我知道如何像现在这样一个接一个地做。
df3 = df3.withColumn("PriorityDate", df3["PriorityDate"].cast(TimestampType()))
df3 = df3.withColumn("QueuedTime", df3["QueuedTime"].cast(TimestampType()))
df3 = df3.withColumn("AccurateAsOf", df3["AccurateAsOf"].cast(TimestampType())
df3= df3.withColumn("srcMediaFormat", df3["srcMediaFormat"].cast(IntegerType()))
df3= df3.withColumn("DeviceResourceId", df3["DeviceResourceId"].cast(IntegerType()))
df3= df3.withColumn("AssignedDeviceId", df3["AssignedDeviceId"].cast(IntegerType()))
但这看起来很难看,而且我很容易错过任何我想更改的列。有什么方法可以编写任何函数来处理相同类型的列列表 change.So 我可以轻松实现 convert_data_type 并传递这些列名称。
提前致谢
与其枚举所有值,不如使用循环:
for c in timestamp_type:
df3 = df3.withColumn(c, df[c].cast(TimestampType()))
for c in integer_type:
df3 = df3.withColumn(c, df[c].cast(IntegerType()))
或者等效地,您可以使用 functools.reduce
:
from functools import reduce # not needed in python 2
df3 = reduce(
lambda df, c: df.withColumn(c, df[c].cast(TimestampType())),
timestamp_type,
df3
)
df3 = reduce(
lambda df, c: df.withColumn(c, df[c].cast(IntegerType())),
integer_type,
df3
)
我有一个 DataFrame (df
),它由 50 多列和不同类型的数据类型组成,例如
df3.printSchema()
CtpJobId: string (nullable = true)
|-- TransformJobStateId: string (nullable = true)
|-- LastError: string (nullable = true)
|-- PriorityDate: string (nullable = true)
|-- QueuedTime: string (nullable = true)
|-- AccurateAsOf: string (nullable = true)
|-- SentToDevice: string (nullable = true)
|-- StartedAtDevice: string (nullable = true)
|-- ProcessStart: string (nullable = true)
|-- LastProgressAt: string (nullable = true)
|-- ProcessEnd: string (nullable = true)
|-- ClipFirstFrameNumber: string (nullable = true)
|-- ClipLastFrameNumber: double (nullable = true)
|-- SourceNamedLocation: string (nullable = true)
|-- TargetId: string (nullable = true)
|-- TargetNamedLocation: string (nullable = true)
|-- TargetDirectory: string (nullable = true)
|-- TargetFilename: string (nullable = true)
|-- Description: string (nullable = true)
|-- AssignedDeviceId: string (nullable = true)
|-- DeviceResourceId: string (nullable = true)
|-- DeviceName: string (nullable = true)
|-- srcDropFrame: string (nullable = true)
|-- srcDuration: double (nullable = true)
|-- srcFrameRate: double (nullable = true)
|-- srcHeight: double (nullable = true)
|-- srcMediaFormat: string (nullable = true)
|-- srcWidth: double (nullable = true)
现在我希望所有一种类型的列都可以一次性更改,例如
timestamp_type = [
'PriorityDate', 'QueuedTime', 'AccurateAsOf', 'SentToDevice',
'StartedAtDevice', 'ProcessStart', 'LastProgressAt', 'ProcessEnd'
]
integer_type = [
'ClipFirstFrameNumber', 'ClipLastFrameNumber', 'TargetId', 'srcHeight',
'srcMediaFormat', 'srcWidth'
]
我知道如何像现在这样一个接一个地做。
df3 = df3.withColumn("PriorityDate", df3["PriorityDate"].cast(TimestampType()))
df3 = df3.withColumn("QueuedTime", df3["QueuedTime"].cast(TimestampType()))
df3 = df3.withColumn("AccurateAsOf", df3["AccurateAsOf"].cast(TimestampType())
df3= df3.withColumn("srcMediaFormat", df3["srcMediaFormat"].cast(IntegerType()))
df3= df3.withColumn("DeviceResourceId", df3["DeviceResourceId"].cast(IntegerType()))
df3= df3.withColumn("AssignedDeviceId", df3["AssignedDeviceId"].cast(IntegerType()))
但这看起来很难看,而且我很容易错过任何我想更改的列。有什么方法可以编写任何函数来处理相同类型的列列表 change.So 我可以轻松实现 convert_data_type 并传递这些列名称。 提前致谢
与其枚举所有值,不如使用循环:
for c in timestamp_type:
df3 = df3.withColumn(c, df[c].cast(TimestampType()))
for c in integer_type:
df3 = df3.withColumn(c, df[c].cast(IntegerType()))
或者等效地,您可以使用 functools.reduce
:
from functools import reduce # not needed in python 2
df3 = reduce(
lambda df, c: df.withColumn(c, df[c].cast(TimestampType())),
timestamp_type,
df3
)
df3 = reduce(
lambda df, c: df.withColumn(c, df[c].cast(IntegerType())),
integer_type,
df3
)