在 Pyspark 中将字符串更改为时间戳
Changing string to timestamp in Pyspark
我正在尝试将字符串列转换为时间戳列,格式如下:
c1
c2
2019-12-10 10:07:54.000
2019-12-13 10:07:54.000
2020-06-08 15:14:49.000
2020-06-18 10:07:54.000
from pyspark.sql.functions import col, udf, to_timestamp
joined_df.select(to_timestamp(joined_df.c1, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
joined_df.select(to_timestamp(joined_df.c2, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
更改日期后,我希望通过减去 c2-c1 得到一个新的日期差列
在python我正在做:
df['c1'] = df['c1'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['c2'] = df['c2'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['days'] = (df['c2'] - df['c1']).apply(lambda x: x.days)
任何人都可以帮助如何转换为 pyspark?
如果你想得到日期差异,你可以使用datediff
:
import pyspark.sql.functions as F
df = df.withColumn('c1', F.col('c1').cast('timestamp')).withColumn('c2', F.col('c2').cast('timestamp'))
result = df.withColumn('days', F.datediff(F.col('c2'), F.col('c1')))
result.show(truncate=False)
+-----------------------+-----------------------+----+
|c1 |c2 |days|
+-----------------------+-----------------------+----+
|2019-12-10 10:07:54.000|2019-12-13 10:07:54.000|3 |
|2020-06-08 15:14:49.000|2020-06-18 10:07:54.000|10 |
+-----------------------+-----------------------+----+
我正在尝试将字符串列转换为时间戳列,格式如下:
c1 | c2 |
---|---|
2019-12-10 10:07:54.000 | 2019-12-13 10:07:54.000 |
2020-06-08 15:14:49.000 | 2020-06-18 10:07:54.000 |
from pyspark.sql.functions import col, udf, to_timestamp
joined_df.select(to_timestamp(joined_df.c1, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
joined_df.select(to_timestamp(joined_df.c2, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
更改日期后,我希望通过减去 c2-c1 得到一个新的日期差列
在python我正在做:
df['c1'] = df['c1'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['c2'] = df['c2'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['days'] = (df['c2'] - df['c1']).apply(lambda x: x.days)
任何人都可以帮助如何转换为 pyspark?
如果你想得到日期差异,你可以使用datediff
:
import pyspark.sql.functions as F
df = df.withColumn('c1', F.col('c1').cast('timestamp')).withColumn('c2', F.col('c2').cast('timestamp'))
result = df.withColumn('days', F.datediff(F.col('c2'), F.col('c1')))
result.show(truncate=False)
+-----------------------+-----------------------+----+
|c1 |c2 |days|
+-----------------------+-----------------------+----+
|2019-12-10 10:07:54.000|2019-12-13 10:07:54.000|3 |
|2020-06-08 15:14:49.000|2020-06-18 10:07:54.000|10 |
+-----------------------+-----------------------+----+