Impala + 镶木地板文件
Impala + parquet file
我创建了 parquet 文件,然后尝试将其导入 Impala table。
我创建了 table 如下:
CREATE EXTERNAL TABLE `user_daily` (
`user_id` BIGINT COMMENT 'User ID',
`master_id` BIGINT,
`walletAgency` BOOLEAN,
`zone_id` BIGINT COMMENT 'Zone ID',
`day` STRING COMMENT 'The stats are aggregated for single days',
`clicks` BIGINT COMMENT 'The number of clicks',
`impressions` BIGINT COMMENT 'The number of impressions',
`avg_position` BIGINT COMMENT 'The average position * 100',
`money` BIGINT COMMENT 'The cost of the clicks, in hellers',
`web_id` BIGINT COMMENT 'Web ID',
`discarded_clicks` BIGINT COMMENT 'Number of discarded clicks from column "clicks"',
`impression_money` BIGINT COMMENT 'The cost of the impressions, in hellers'
)
PARTITIONED BY (
year BIGINT,
month BIGINT
)
STORED AS PARQUET
LOCATION '/warehouse/impala/contextstat.db/user_daily/';
然后我用这个模式复制文件:
parquet-tools schema user_daily/year\=2016/month\=8/part-r-00001-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet
message spark_schema {
optional int32 user_id;
optional int32 web_id (INT_16);
optional int32 zone_id;
required int32 master_id;
required boolean walletagency;
optional int64 impressions;
optional int64 clicks;
optional int64 money;
optional int64 avg_position;
optional double impression_money;
required binary day (UTF8);
}
然后当我尝试查看带有
的条目时
SELECT * FROM user_daily;
我明白了
File 'hdfs://.../warehouse/impala/contextstat.db/user_daily/year=2016/month=8/part-r-00000-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet'
has an incompatible Parquet schema for column 'contextstat.user_daily.user_id'.
Column type: BIGINT, Parquet schema:
optional int32 user_id [i:0 d:1 r:0]
你知道如何解决这个问题吗?我认为 BIGINT 与 int_32 相同。我应该更改 table 的方案还是生成 parquet 文件?
BIGINT 是 int64,这就是它抱怨的原因。但是您不必自己弄清楚必须使用的不同类型,Impala 可以为您做到这一点。只需使用 CREATE TABLE LIKE PARQUET 变体:
The variation CREATE TABLE ... LIKE PARQUET 'hdfs_path_of_parquet_file' lets you skip the column definitions of the CREATE TABLE statement. The column names and data types are automatically configured based on the organization of the specified Parquet data file, which must already reside in HDFS.
我使用 CAST(... AS BIGINT)
,它将镶木地板架构从 int32
更改为 int64
。然后我必须对列重新排序,因为它不会按名称加入。然后就可以了。
我创建了 parquet 文件,然后尝试将其导入 Impala table。
我创建了 table 如下:
CREATE EXTERNAL TABLE `user_daily` (
`user_id` BIGINT COMMENT 'User ID',
`master_id` BIGINT,
`walletAgency` BOOLEAN,
`zone_id` BIGINT COMMENT 'Zone ID',
`day` STRING COMMENT 'The stats are aggregated for single days',
`clicks` BIGINT COMMENT 'The number of clicks',
`impressions` BIGINT COMMENT 'The number of impressions',
`avg_position` BIGINT COMMENT 'The average position * 100',
`money` BIGINT COMMENT 'The cost of the clicks, in hellers',
`web_id` BIGINT COMMENT 'Web ID',
`discarded_clicks` BIGINT COMMENT 'Number of discarded clicks from column "clicks"',
`impression_money` BIGINT COMMENT 'The cost of the impressions, in hellers'
)
PARTITIONED BY (
year BIGINT,
month BIGINT
)
STORED AS PARQUET
LOCATION '/warehouse/impala/contextstat.db/user_daily/';
然后我用这个模式复制文件:
parquet-tools schema user_daily/year\=2016/month\=8/part-r-00001-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet
message spark_schema {
optional int32 user_id;
optional int32 web_id (INT_16);
optional int32 zone_id;
required int32 master_id;
required boolean walletagency;
optional int64 impressions;
optional int64 clicks;
optional int64 money;
optional int64 avg_position;
optional double impression_money;
required binary day (UTF8);
}
然后当我尝试查看带有
的条目时SELECT * FROM user_daily;
我明白了
File 'hdfs://.../warehouse/impala/contextstat.db/user_daily/year=2016/month=8/part-r-00000-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet'
has an incompatible Parquet schema for column 'contextstat.user_daily.user_id'.
Column type: BIGINT, Parquet schema:
optional int32 user_id [i:0 d:1 r:0]
你知道如何解决这个问题吗?我认为 BIGINT 与 int_32 相同。我应该更改 table 的方案还是生成 parquet 文件?
BIGINT 是 int64,这就是它抱怨的原因。但是您不必自己弄清楚必须使用的不同类型,Impala 可以为您做到这一点。只需使用 CREATE TABLE LIKE PARQUET 变体:
The variation CREATE TABLE ... LIKE PARQUET 'hdfs_path_of_parquet_file' lets you skip the column definitions of the CREATE TABLE statement. The column names and data types are automatically configured based on the organization of the specified Parquet data file, which must already reside in HDFS.
我使用 CAST(... AS BIGINT)
,它将镶木地板架构从 int32
更改为 int64
。然后我必须对列重新排序,因为它不会按名称加入。然后就可以了。