无法使用 PySpark 在 Databricks 上与 apache spark 函数 to_timestamp() 连接并添加列

Question

我正在尝试在 Apache Spark table 上使用与 to_timestamp() 的连接并使用 .withColumn 函数添加列，但它不起作用。

代码如下：

DIM_WORK_ORDER.withColumn("LAST_MODIFICATION_DT", to_timestamp(concat(col('LAST_MOD_DATE'), lit(' '), col('LAST_MOD_TIME')), 'yyyyMMdd HHmmss'))

我希望看到的结果类似于

LAST_MODIFICATION_DT | WORK_ORDER

但是，我得到以下结果：

一些要处理的数据：

WORK_ORDER LAST_MOD_TIME 10000008 空 11358186 142254 10000007 193402 10000009 空

有什么想法吗？

Answer 1

据我所知，在 Spark 中，数据帧是不可变的。因此，一旦创建了数据框，它就无法更改。

%python
import pyspark
from pyspark.sql.functions import *
df = spark.read.option("header","true").csv("<input file path>")
df1 = df.withColumn("LAST_MODIFICATION_DT", to_timestamp(concat(col('LAST_MOD_DATE'), lit(' '), col('LAST_MOD_TIME')), 'yyyyMMdd HHmmss'))
display(df1)

我的输出低于预期。如果这不是您所期望的，请提供更多信息

无法使用 PySpark 在 Databricks 上与 apache spark 函数 to_timestamp() 连接并添加列

Unable to concatenate with apache spark function to_timestamp() on Databricks using PySpark and add a column

apache-spark

pyspark

azure-databricks