使用 DATE 示例数据填充 pyspark 数据框

Populate a pyspark dataframe with DATE sample data

我尝试使用日期值创建和填充 pyspark 数据框。

Columns = ["EmployeeNo", "Name", "EmployeeID", "ValidFrom", "ValidTo"]
Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),
       ]

DfEmployee = spark.createDataFrame(Data, Columns)
DfEmployee.show()

给予

+----------+----------------+----------+----------+----------+
|EmployeeNo|            Name|EmployeeID| ValidFrom|   ValidTo|
+----------+----------------+----------+----------+----------+
|       100|   Hilmar Buchta|        HB|2000-01-01|2999-12-31|
+----------+----------------+----------+----------+----------+

看起来正确,但 ValidFromValidTo 值是字符串,而不是日期。如何在一步中使用日期类型的值填充 df 列?

我在 Whosebug 上搜索了一段时间并尝试了这个:

from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType, BooleanType

Schema = StructType([
                      StructField('EmployeeNo', IntegerType(), False),
                      StructField('Name', StringType(), False),
                      StructField('EmployeeID', StringType(), False),
                      StructField('ValidFrom', DateType(), False),
                      StructField('ValidTo', DateType(), False)
                  ])

Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),]
DfEmployee = spark.createDataFrame(Data, Columns)

给予

TypeError: field ValidFrom: DateType can not accept object '2000-01-01' in type <class 'str'>

所以我尝试了

Data = [(100, "Hilmar Buchta", "HB", F.to_date("2000-01-01", "yyyy-MM-dd"), F.to_date("2999-12-31", "yyyy-MM-dd")),]
       ]
DfEmployee = spark.createDataFrame(Data, Columns)

给予

TypeError: field ValidFrom: DateType can not accept object Column<b"to_date(2000-01-01, 'yyyy-MM-dd')"> in type <class 'pyspark.sql.column.Column'>

您可以传递 python datetime.date 对象而不是字符串 :

import datetime

Data = [
    (100, "Hilmar Buchta", "HB", datetime.date(2000, 1, 1), datetime.date(2999, 12, 31)),
]

DfEmployee = spark.createDataFrame(Data, Columns)

DfEmployee.printSchema()

#root
# |-- EmployeeNo: long (nullable = true)
# |-- Name: string (nullable = true)
# |-- EmployeeID: string (nullable = true)
# |-- ValidFrom: date (nullable = true)
# |-- ValidTo: date (nullable = true)

或者将字符串转换为 python 日期对象:

from datetime import datetime

Data = [
    (100, "Hilmar Buchta", "HB", datetime.strptime("2000-01-01", "%Y-%M-%d").date(),
     datetime.strptime("2999-12-31", "%Y-%M-%d").date()
     ),
]

函数F.to_date仅与DataFrame一起使用。您可以在创建 df 后将字符串转换为日期,例如:

df = df.withColumn("ValidFrom", F.to_date("ValidFrom", "yyyy-MM-dd"))