将条件很少的 hive/impala table 中的数据导出到文件中

Question

将数据从 hive/impala table 导出到文件的有效方法是什么（数据会很大，接近 10 GB）？配置单元的格式 table 是 paraquet with snappy compressed 文件是 csv.

table每天分区，需要每天抽取数据，请问

1) Imapala 方法

impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o 文件名'--output_delimiter=\001'

2) Hive 方法

插入覆盖目录 '/path' select * 来自 table 其中 year_month_date=$$$$$$$$

会很有效率

Answer 1

假设 table tbl 作为您的蜂巢镶木地板 table 和 condition 作为您的过滤条件。

CTAS 命令：

 CREATE TABLE tbl_text ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/data' AS select * from tbl where condition;

您将在 HDFS 的 /tmp/data 中找到您的 CSV 文本文件（由','分隔）。

如果需要，您可以使用以下方法将此文件获取到本地文件系统：

hadoop fs -get /tmp/data

Answer 2

请尝试为您的 Hive/Impala table 使用 动态分区 以有效地有条件地导出数据。

将您的 table 与您感兴趣的列进行分区，并根据您的查询以获得最佳结果

第 1 步： 创建一个临时 Hive Table TmpTable 并将原始数据加载到其中

第二步：设置hive参数支持动态分区

SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;

步骤 3： 创建带有分区列的 Main Hive Table，示例：

CREATE TABLE employee (
 emp_id int,
 emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;

第 4 步： 将数据从临时 table 加载到您的员工 table（主要 Table）

insert overwrite table employee  partition(location)  
select emp_id,emp_name, location from TmpTable;

第 5 步： 从配置单元中导出数据，条件为

INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee  WHERE location='CALIFORNIA';

请参考这个link:

希望这有用。

Export the data from a hive/impala table with few conditions into file