修复 Pyspark DataFrame 中的日期 - 设置为最小值

Fix Dates in Pyspark DataFrame - set to minimum value

我有一个带有时间戳字段的数据框 - RECEIPTDATEREQUESTED:timestamp 由于某种原因,有些日期小于 1900-01-01。我不想要这些,我想要做的是,对于 RECEIPTDATEREQUESTED<'1900-01-01 00:00:00' 的数据帧列中的每个值,然后将时间戳设置为 1900-01-01或空。 我已经尝试了几种方法来做到这一点,但似乎必须存在一些更简单的方法。我认为这样的事情可能会奏效,但是

import datetime
def testdate(date_value):
    oldest = datetime.datetime.strptime('1900-01-01 00:00:00', '%Y-%m-%d')
    try:
        if (date_value < oldest):
            return oldest
        else:
            return date_value
    except ValueError:
        return oldest
udf_testdate = udf(lambda x:testdate(x),TimestampType())
bdf = olddf.withColumn("RECEIPTDATEREQUESTED",udf_testdate(col("RECEIPTDATEREQUESTED")))

当值为 < '1900-01-01 00:00:00'.[=21= 时,您可以使用条件评估使用 when and otherwiseRECEIPTDATEREQUESTED 设置为 null1900-01-01 00:00:00 ]


from pyspark.sql import functions as F

data = [("1000-01-01 00:00:00",), 
        ("1899-12-31 23:59:59",),
        ("1900-01-01 00:00:00",), 
        ("1901-01-01 00:00:00",)]

df = spark.createDataFrame(data, ("RECEIPTDATEREQUESTED",))\
          .withColumn("RECEIPTDATEREQUESTED", F.to_timestamp(F.col("RECEIPTDATEREQUESTED")))


# Fill null

df.withColumn("RECEIPTDATEREQUESTED", 
              F.when(F.col("RECEIPTDATEREQUESTED") < "1900-01-01 00:00:00", F.lit(None))
               .otherwise(F.col("RECEIPTDATEREQUESTED")))\
  .show(200, False)

# Fill default value

df.withColumn("RECEIPTDATEREQUESTED", 
              F.when(F.col("RECEIPTDATEREQUESTED") < "1900-01-01 00:00:00", F.lit("1900-01-01 00:00:00").cast("timestamp"))
               .otherwise(F.col("RECEIPTDATEREQUESTED")))\
  .show(200, False)

输出

填写null

+--------------------+
|RECEIPTDATEREQUESTED|
+--------------------+
|null                |
|null                |
|1900-01-01 00:00:00 |
|1901-01-01 00:00:00 |
+--------------------+

填写1900-01-01 00:00:00

+--------------------+
|RECEIPTDATEREQUESTED|
+--------------------+
|1900-01-01 00:00:00 |
|1900-01-01 00:00:00 |
|1900-01-01 00:00:00 |
|1901-01-01 00:00:00 |
+--------------------+