在按年月日镶木地板分区时创建 Hive Table

Question

我创建了一个这样的数据框：

val df = Seq(
  (1,27,"bob",2020,9,3),
  (1,27,"jack",2020,9,3),
  (3,31,"tom",2020,9,4)
).toDF("id","age","nom","year","month","day")

我得到以下数据框

+---+---+----+----+-----+---+
|id |age|nom |year|month|day|
+---+---+----+----+-----+---+
|1  |27 |bob |2020|9    |3  |
|1  |27 |jack|2020|9    |3  |
|3  |31 |tom |2020|9    |4  |
+---+---+----+----+-----+---+

然后我在 hdfs 上用 partitionBy usig year, month day 写 df;

df.write
  .mode(SaveMode.Append)
  .partitionBy("year", "month", "day")
  .parquet(s"$outputPath/test_hive")

我在以下 hdfs 路径上获取数据：

/outputPath/test_hive/year=2020/month=9/day=3
/outputPath/test_hive/year=2020/month=9/day=4

我想知道如何在位置创建外部配置单元 table outputPath/test_hive 可以考虑子目录年月日。

我尝试了以下创建 table 但它不起作用：

CREATE EXTERNAL TABLE test1(id int, age int, nom string, year int, month int, day int) STORED AS PARQUET LOCATION 'outputPath/test_hive'

+-----------+------------+------------+--+
| test1.id  | test1.age  | test1.nom  |
+-----------+------------+------------+--+
| 1         | 27         | bob        |
| 1         | 27         | jack       |
| 3         | 31         | tom        |
+-----------+------------+------------+--+

和

CREATE EXTERNAL TABLE test2(id int, age int, nom string) PARTITIONED BY(year INT, month int , day INT) STORED AS PARQUET LOCATION 'outputPath/test_hive'

+-----------+------------+------------+-------------+--------------+------------+--+
| test2.id  | test2.age  | test2.nom  | test2.year  | test2.month  | test2.day  |
+-----------+------------+------------+-------------+--------------+------------+--+
+-----------+------------+------------+-------------+--------------+------------+--+

和

CREATE EXTERNAL TABLE test3(id int, age int, nom string) STORED AS PARQUET LOCATION 'outputPath/test_hive' PARTITIONED BY(year INT, month int , day INT);

Error while compiling statement: FAILED: ParseException line 1:138 missing EOF at 'PARTITIONED' near ''/outputPath/test_hive'' (state=42000,code=40000)

Answer 1

做msckrepair table or add partitions到table.

Example:

From Hive:

hive> msck repair table test3

--or

hive> ALTER TABLE test3 ADD PARTITION (year='2020', month='9',day='3') location '/outputPath/test_hive/year=2020/month=9/day=3';

From spark:

spark.sql("ALTER TABLE test3 ADD PARTITION (year='2020', month='9',day='3') location '/outputPath/test_hive/year=2020/month=9/day=3'")

//or

spark.sql("msck repair table test3")

在按年月日镶木地板分区时创建 Hive Table

Create Hive Table on partionned by year month day parquet

hive

scala

hdfs

apache-spark