Hive 时间戳不接受 Spark 时间戳类型

Question

我有一个 spark Dataframe，其中包含一个字段作为时间戳。我将数据帧存储到创建配置单元外部 table 的 HDFS 位置。 Hive table 包含时间戳类型的字段。但是当从外部位置读取数据时，配置单元将时间戳字段填充为 table 中的空白值。我的 spark 数据框查询：

df.select($"ipAddress", $"clientIdentd", $"userId", to_timestamp(unix_timestamp($"dateTime", "dd/MMM/yyyy:HH:mm:ss Z").cast("timestamp")).as("dateTime"), $"method", $"endpoint", $"protocol", $"responseCode", $"contentSize", $"referrerURL", $"browserInfo")

Hive 创建 table 语句：

CREATE EXTERNAL TABLE `finalweblogs3`(
   `ipAddress` string,
   `clientIdentd` string,
   `userId` string,
   `dateTime` timestamp,
   `method` string,
   `endpoint` string,
   `protocol` string,
   `responseCode` string,
   `contentSize` string,
   `referrerURL` string,
   `browserInfo` string)
 ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
 WITH SERDEPROPERTIES (
   'field.delim'=',',
   'serialization.format'=',')
 STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
 OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
   'hdfs://localhost:9000/streaming/spark/finalweblogs3'

我不明白为什么会这样。

Answer 1

我通过将存储格式更改为 "Parquet" 解决了这个问题。我仍然不知道为什么它不适用于 CSV 格式。

Hive 时间戳不接受 Spark 时间戳类型

Spark timestamp type is not getting accepted with hive timestamp

hive

apache-spark

spark-dataframe