如何为嵌套目录结构定义分区外部 table
How to define a partitioned external table for a nested directory structure
对于存储在hdfs
中的一组数据文件在year/*.csv
结构中如下:
$ hdfs dfs -ls air/
Found 21 items
air/year=2000
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2001
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2002
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2003
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2004
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2005
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2006
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2007
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2008
有 12 个 csv
个文件 - 每个月一个。由于我们的查询不关心月份粒度,因此可以将一年中的所有月份都放入一个目录中。这是其中一年的内容:注意这些是 .csv
个文件:
[hadoop@ip-172-31-25-82 ~]$ hdfs dfs -ls air/year=2008
Found 10 items
-rw-r--r-- 2 hadoop hadoop 193893785 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_1.csv
-rw-r--r-- 2 hadoop hadoop 199126288 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_10.csv
-rw-r--r-- 2 hadoop hadoop 182225240 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_2.csv
-rw-r--r-- 2 hadoop hadoop 197399305 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_3.csv
-rw-r--r-- 2 hadoop hadoop 191321415 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_4.csv
-rw-r--r-- 2 hadoop hadoop 194141438 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_5.csv
-rw-r--r-- 2 hadoop hadoop 195477306 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_6.csv
-rw-r--r-- 2 hadoop hadoop 201148079 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_7.csv
-rw-r--r-- 2 hadoop hadoop 219060870 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_8.csv
-rw-r--r-- 2 hadoop hadoop 172127584 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_9.csv
header 和一行看起来像这样:
hdfs dfs -cat airlines/2008/On_Time_On_Time_Performance_2008_4.csv | head -n 2
"Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac","Dest","DestCityName","DestState","DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups","DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15","ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime","Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay",
2008,2,4,3,4,2008-04-03,"WN",19393,"WN","N601WN","3599","MAF","Midland/Odessa, TX","TX","48","Texas",74,"DAL","Dallas, TX","TX","48","Texas",74,"1115","1112",-3.00,0.00,0.00,-1,"1100-1159",10.00,"1122","1218",6.00,"1220","1224",4.00,4.00,0.00,0,"1200-1259",0.00,"",0.00,65.00,72.00,56.00,1.00,319.00,2,,,,,,
问题是:如何"convince" hive
/ spark
正确阅读这些内容?做法是:
- 由于
partitioning
,hive 将自动读取最后一列 year
- 第一列
YearIn
将是一个占位符:它的值将被读入,但我的应用程序代码将忽略它以支持 year
分区列
- 所有其他字段的处理都没有任何特殊考虑
这是我的尝试。
create external table air (
YearIn string,Quarter string,Month string,
.. _long list of columns_ ..)
partitioned by (year int)
row format delimited fields terminated by ',' location '/user/hadoop/air/';
结果是:
- table 已创建并可由
hive
和 `spark 访问
- 但是 table 是空的 - 正如
hive
和 spark
所报告的
这个过程有什么不对的地方?
table 定义看起来不错,但 headers 除外。如果不跳过 headers,则数据集中将返回 header 行,如果某些列不是字符串,则 header 值将被选为 NULL
s。要跳过 headers 被选中,请在 table DDL tblproperties("skip.header.line.count"="1")
的末尾添加此内容 - 此 属性 仅在 Hive 中受支持,另请阅读此解决方法:https://whosebug.com/a/54542483/2700344
除了创建 table,您还需要创建分区。
使用MSCK [REPAIR] TABLE Air;
命令。
Amazon Elastic MapReduce (EMR) 版本的 Hive 上的等效命令是:ALTER TABLE Air RECOVER PARTITIONS
。
这将添加 Hive 分区元数据。请在此处查看手册:RECOVER PARTITIONS
对于存储在hdfs
中的一组数据文件在year/*.csv
结构中如下:
$ hdfs dfs -ls air/
Found 21 items
air/year=2000
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2001
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2002
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2003
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2004
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2005
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2006
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2007
drwxr-xr-x - hadoop hadoop 0 2019-03-08 01:45 air/year=2008
有 12 个 csv
个文件 - 每个月一个。由于我们的查询不关心月份粒度,因此可以将一年中的所有月份都放入一个目录中。这是其中一年的内容:注意这些是 .csv
个文件:
[hadoop@ip-172-31-25-82 ~]$ hdfs dfs -ls air/year=2008
Found 10 items
-rw-r--r-- 2 hadoop hadoop 193893785 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_1.csv
-rw-r--r-- 2 hadoop hadoop 199126288 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_10.csv
-rw-r--r-- 2 hadoop hadoop 182225240 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_2.csv
-rw-r--r-- 2 hadoop hadoop 197399305 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_3.csv
-rw-r--r-- 2 hadoop hadoop 191321415 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_4.csv
-rw-r--r-- 2 hadoop hadoop 194141438 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_5.csv
-rw-r--r-- 2 hadoop hadoop 195477306 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_6.csv
-rw-r--r-- 2 hadoop hadoop 201148079 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_7.csv
-rw-r--r-- 2 hadoop hadoop 219060870 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_8.csv
-rw-r--r-- 2 hadoop hadoop 172127584 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_9.csv
header 和一行看起来像这样:
hdfs dfs -cat airlines/2008/On_Time_On_Time_Performance_2008_4.csv | head -n 2
"Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac","Dest","DestCityName","DestState","DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups","DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15","ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime","Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay",
2008,2,4,3,4,2008-04-03,"WN",19393,"WN","N601WN","3599","MAF","Midland/Odessa, TX","TX","48","Texas",74,"DAL","Dallas, TX","TX","48","Texas",74,"1115","1112",-3.00,0.00,0.00,-1,"1100-1159",10.00,"1122","1218",6.00,"1220","1224",4.00,4.00,0.00,0,"1200-1259",0.00,"",0.00,65.00,72.00,56.00,1.00,319.00,2,,,,,,
问题是:如何"convince" hive
/ spark
正确阅读这些内容?做法是:
- 由于
partitioning
,hive 将自动读取最后一列 - 第一列
YearIn
将是一个占位符:它的值将被读入,但我的应用程序代码将忽略它以支持year
分区列- 所有其他字段的处理都没有任何特殊考虑
year
这是我的尝试。
create external table air (
YearIn string,Quarter string,Month string,
.. _long list of columns_ ..)
partitioned by (year int)
row format delimited fields terminated by ',' location '/user/hadoop/air/';
结果是:
- table 已创建并可由
hive
和 `spark 访问
- 但是 table 是空的 - 正如
hive
和spark
所报告的
这个过程有什么不对的地方?
table 定义看起来不错,但 headers 除外。如果不跳过 headers,则数据集中将返回 header 行,如果某些列不是字符串,则 header 值将被选为 NULL
s。要跳过 headers 被选中,请在 table DDL tblproperties("skip.header.line.count"="1")
的末尾添加此内容 - 此 属性 仅在 Hive 中受支持,另请阅读此解决方法:https://whosebug.com/a/54542483/2700344
除了创建 table,您还需要创建分区。
使用MSCK [REPAIR] TABLE Air;
命令。
Amazon Elastic MapReduce (EMR) 版本的 Hive 上的等效命令是:ALTER TABLE Air RECOVER PARTITIONS
。
这将添加 Hive 分区元数据。请在此处查看手册:RECOVER PARTITIONS