修复 Pyspark DataFrame 中的日期 - 设置为最小值
Fix Dates in Pyspark DataFrame - set to minimum value
我有一个带有时间戳字段的数据框 - RECEIPTDATEREQUESTED:timestamp
由于某种原因,有些日期小于 1900-01-01。我不想要这些,我想要做的是,对于 RECEIPTDATEREQUESTED<'1900-01-01 00:00:00' 的数据帧列中的每个值,然后将时间戳设置为 1900-01-01或空。
我已经尝试了几种方法来做到这一点,但似乎必须存在一些更简单的方法。我认为这样的事情可能会奏效,但是
import datetime
def testdate(date_value):
oldest = datetime.datetime.strptime('1900-01-01 00:00:00', '%Y-%m-%d')
try:
if (date_value < oldest):
return oldest
else:
return date_value
except ValueError:
return oldest
udf_testdate = udf(lambda x:testdate(x),TimestampType())
bdf = olddf.withColumn("RECEIPTDATEREQUESTED",udf_testdate(col("RECEIPTDATEREQUESTED")))
当值为 < '1900-01-01 00:00:00'
.[=21= 时,您可以使用条件评估使用 when and otherwise
将 RECEIPTDATEREQUESTED
设置为 null
或 1900-01-01 00:00:00
]
from pyspark.sql import functions as F
data = [("1000-01-01 00:00:00",),
("1899-12-31 23:59:59",),
("1900-01-01 00:00:00",),
("1901-01-01 00:00:00",)]
df = spark.createDataFrame(data, ("RECEIPTDATEREQUESTED",))\
.withColumn("RECEIPTDATEREQUESTED", F.to_timestamp(F.col("RECEIPTDATEREQUESTED")))
# Fill null
df.withColumn("RECEIPTDATEREQUESTED",
F.when(F.col("RECEIPTDATEREQUESTED") < "1900-01-01 00:00:00", F.lit(None))
.otherwise(F.col("RECEIPTDATEREQUESTED")))\
.show(200, False)
# Fill default value
df.withColumn("RECEIPTDATEREQUESTED",
F.when(F.col("RECEIPTDATEREQUESTED") < "1900-01-01 00:00:00", F.lit("1900-01-01 00:00:00").cast("timestamp"))
.otherwise(F.col("RECEIPTDATEREQUESTED")))\
.show(200, False)
输出
填写null
+--------------------+
|RECEIPTDATEREQUESTED|
+--------------------+
|null |
|null |
|1900-01-01 00:00:00 |
|1901-01-01 00:00:00 |
+--------------------+
填写1900-01-01 00:00:00
+--------------------+
|RECEIPTDATEREQUESTED|
+--------------------+
|1900-01-01 00:00:00 |
|1900-01-01 00:00:00 |
|1900-01-01 00:00:00 |
|1901-01-01 00:00:00 |
+--------------------+
我有一个带有时间戳字段的数据框 - RECEIPTDATEREQUESTED:timestamp 由于某种原因,有些日期小于 1900-01-01。我不想要这些,我想要做的是,对于 RECEIPTDATEREQUESTED<'1900-01-01 00:00:00' 的数据帧列中的每个值,然后将时间戳设置为 1900-01-01或空。 我已经尝试了几种方法来做到这一点,但似乎必须存在一些更简单的方法。我认为这样的事情可能会奏效,但是
import datetime
def testdate(date_value):
oldest = datetime.datetime.strptime('1900-01-01 00:00:00', '%Y-%m-%d')
try:
if (date_value < oldest):
return oldest
else:
return date_value
except ValueError:
return oldest
udf_testdate = udf(lambda x:testdate(x),TimestampType())
bdf = olddf.withColumn("RECEIPTDATEREQUESTED",udf_testdate(col("RECEIPTDATEREQUESTED")))
当值为 < '1900-01-01 00:00:00'
.[=21= 时,您可以使用条件评估使用 when and otherwise
将 RECEIPTDATEREQUESTED
设置为 null
或 1900-01-01 00:00:00
]
from pyspark.sql import functions as F
data = [("1000-01-01 00:00:00",),
("1899-12-31 23:59:59",),
("1900-01-01 00:00:00",),
("1901-01-01 00:00:00",)]
df = spark.createDataFrame(data, ("RECEIPTDATEREQUESTED",))\
.withColumn("RECEIPTDATEREQUESTED", F.to_timestamp(F.col("RECEIPTDATEREQUESTED")))
# Fill null
df.withColumn("RECEIPTDATEREQUESTED",
F.when(F.col("RECEIPTDATEREQUESTED") < "1900-01-01 00:00:00", F.lit(None))
.otherwise(F.col("RECEIPTDATEREQUESTED")))\
.show(200, False)
# Fill default value
df.withColumn("RECEIPTDATEREQUESTED",
F.when(F.col("RECEIPTDATEREQUESTED") < "1900-01-01 00:00:00", F.lit("1900-01-01 00:00:00").cast("timestamp"))
.otherwise(F.col("RECEIPTDATEREQUESTED")))\
.show(200, False)
输出
填写null
+--------------------+
|RECEIPTDATEREQUESTED|
+--------------------+
|null |
|null |
|1900-01-01 00:00:00 |
|1901-01-01 00:00:00 |
+--------------------+
填写1900-01-01 00:00:00
+--------------------+
|RECEIPTDATEREQUESTED|
+--------------------+
|1900-01-01 00:00:00 |
|1900-01-01 00:00:00 |
|1900-01-01 00:00:00 |
|1901-01-01 00:00:00 |
+--------------------+