按时间戳排序不适用于 Scala Spark 中的日期时间列
Order By Timestamp is not working for Date time column in Scala Spark
这是我的数据框
+-------------+-------------------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+------------+
|DataPartition|TimeStamp |OrganizationID|SourceID|AuditorID|AuditorEnumerationId|AuditorOpinionCode|AuditorOpinionId|AuditorOpinionOnInternalControlsId|IsPlayingAuditorRole|IsPlayingCSRAuditorRole|IsPlayingTaxAdvisorRole|FFAction|!||AuditorOpinionOnInternalControlCode|AuditorOpinionOnGoingConcernCode|AuditorOpinionOnGoingConcernId|tobefiltered|
+-------------+-------------------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+------------+
|Japan |2018-04-04T09:53:35+00:00|4295877275 |181 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |O|!| |
|Japan |2018-04-04T08:36:57+00:00|4295877275 |189 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |O|!| |
|Japan |2018-04-04T08:39:19+00:00|4295877275 |173 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |O|!| |
|Japan |2018-04-04T08:24:17+00:00|4295877275 |196 |5913 |3026579 |UWE |3010547 |null |true |false |false |I|!| |null |null |null |I|!| |
|Japan |2018-04-04T08:24:17+00:00|4295877275 |196 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |I|!| |null |null |null |I|!| |
|Japan |2018-04-04T09:53:35+00:00|4295877275 |196 |null |null |null |null |null |null |null |null |D|!| |null |null |null |I|!| |
+-------------+-------------------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+------------+
这就是我正在做的,以便根据两列获取最新信息:
val windowSpec3 = Window.partitionBy("OrganizationID", "SourceID").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd HH:mm:ss.SSS").cast("timestamp").desc)
val latestForEachKey3 = latestForEachKey.withColumn("rank", row_number.over(windowSpec3)).filter($"rank" === 1).drop("rank").drop("tobefiltered", "TimeStamp")
latestForEachKey3.show(false)
这给了我下面的输出
+-------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+
|DataPartition|OrganizationID|SourceID|AuditorID|AuditorEnumerationId|AuditorOpinionCode|AuditorOpinionId|AuditorOpinionOnInternalControlsId|IsPlayingAuditorRole|IsPlayingCSRAuditorRole|IsPlayingTaxAdvisorRole|FFAction|!||AuditorOpinionOnInternalControlCode|AuditorOpinionOnGoingConcernCode|AuditorOpinionOnGoingConcernId|
+-------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+
|Japan |4295877275 |181 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |
|Japan |4295877275 |189 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |
|Japan |4295877275 |173 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |
|Japan |4295877275 |196 |5913 |3026579 |UWE |3010547 |null |true |false |false |I|!| |null |null |null |
+-------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+
因此,根据登录信息,我应该从相同的三行中获取具有以下时间戳的行。
2018-04-04T09:53:35+00:00|4295877275 |196 |null |null
问题是,我也获得了排名,但.orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd HH:mm:ss.SSS").cast("timestamp").desc)
无法正常工作。
我也尝试使用这种数据格式,但结果相同YYYY-MM-DDThh:mm:ssTZD
使用的时间戳格式是错误的
而不是
"yyyy-MM-dd HH:mm:ss.SSS"
使用
"yyyy-MM-dd'T'HH:mm:ss"
这是我的数据框
+-------------+-------------------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+------------+
|DataPartition|TimeStamp |OrganizationID|SourceID|AuditorID|AuditorEnumerationId|AuditorOpinionCode|AuditorOpinionId|AuditorOpinionOnInternalControlsId|IsPlayingAuditorRole|IsPlayingCSRAuditorRole|IsPlayingTaxAdvisorRole|FFAction|!||AuditorOpinionOnInternalControlCode|AuditorOpinionOnGoingConcernCode|AuditorOpinionOnGoingConcernId|tobefiltered|
+-------------+-------------------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+------------+
|Japan |2018-04-04T09:53:35+00:00|4295877275 |181 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |O|!| |
|Japan |2018-04-04T08:36:57+00:00|4295877275 |189 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |O|!| |
|Japan |2018-04-04T08:39:19+00:00|4295877275 |173 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |O|!| |
|Japan |2018-04-04T08:24:17+00:00|4295877275 |196 |5913 |3026579 |UWE |3010547 |null |true |false |false |I|!| |null |null |null |I|!| |
|Japan |2018-04-04T08:24:17+00:00|4295877275 |196 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |I|!| |null |null |null |I|!| |
|Japan |2018-04-04T09:53:35+00:00|4295877275 |196 |null |null |null |null |null |null |null |null |D|!| |null |null |null |I|!| |
+-------------+-------------------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+------------+
这就是我正在做的,以便根据两列获取最新信息:
val windowSpec3 = Window.partitionBy("OrganizationID", "SourceID").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd HH:mm:ss.SSS").cast("timestamp").desc)
val latestForEachKey3 = latestForEachKey.withColumn("rank", row_number.over(windowSpec3)).filter($"rank" === 1).drop("rank").drop("tobefiltered", "TimeStamp")
latestForEachKey3.show(false)
这给了我下面的输出
+-------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+
|DataPartition|OrganizationID|SourceID|AuditorID|AuditorEnumerationId|AuditorOpinionCode|AuditorOpinionId|AuditorOpinionOnInternalControlsId|IsPlayingAuditorRole|IsPlayingCSRAuditorRole|IsPlayingTaxAdvisorRole|FFAction|!||AuditorOpinionOnInternalControlCode|AuditorOpinionOnGoingConcernCode|AuditorOpinionOnGoingConcernId|
+-------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+
|Japan |4295877275 |181 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |
|Japan |4295877275 |189 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |
|Japan |4295877275 |173 |3185 |3023399 |UNQ |3010546 |3010546 |true |false |false |O|!| |null |null |null |
|Japan |4295877275 |196 |5913 |3026579 |UWE |3010547 |null |true |false |false |I|!| |null |null |null |
+-------------+--------------+--------+---------+--------------------+------------------+----------------+----------------------------------+--------------------+-----------------------+-----------------------+-----------+-----------------------------------+--------------------------------+------------------------------+
因此,根据登录信息,我应该从相同的三行中获取具有以下时间戳的行。
2018-04-04T09:53:35+00:00|4295877275 |196 |null |null
问题是,我也获得了排名,但.orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd HH:mm:ss.SSS").cast("timestamp").desc)
无法正常工作。
我也尝试使用这种数据格式,但结果相同YYYY-MM-DDThh:mm:ssTZD
使用的时间戳格式是错误的
而不是
"yyyy-MM-dd HH:mm:ss.SSS"
使用
"yyyy-MM-dd'T'HH:mm:ss"