Hive 结果未保存到 S3 存储桶中

Question

我无法将 Hive 输出保存到 S3。我已尝试 sshing 进入主节点并运行ning 我在 Hive 中的命令，但它不保存输出。我还尝试了从 AWS 的 EMR 控制台在 Hue 中运行ning 命令，但它仍然没有保存到 S3。我还添加了脚本作为一个步骤，但它仍然没有保存。我能够获得结果的唯一方法是在 Hue 中运行它，然后单击以查看结果并以这种方式下载，然后将它们推送到 S3。我不知道为什么会这样。这是我运行ning.

的查询

with temp as (
select /*+ streamtable(l) */ a.id, a.name, a.page
from my_table a
join my_other_table l on (a.id = l.id)
group by a.page, a.id, a.name)
insert overwrite directory 's3://bucket/folder/folder2/folder3/folder4/folder5/folder6/folder7/'
select page, count(distinct id) over (PARTITION BY page)
from temp
group by page;

请注意，我希望解决方案在添加步骤时起作用，因为我计划按顺序添加 x 个步骤。

Answer 1

我看到 Amazon EMR 输出到 Amazon S3 的正常方式是 CREATE EXTERNAL TABLE 和 Amazon S3 中的 LOCATION。

例如：

CREATE EXTERNAL TABLE IF NOT EXISTS output_table
(gram string, year int, ratio double, increase double)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://my-bucket/directory';

然后，只需 INSERT 数据到 table:

INSERT OVERWRITE TABLE output_table
SELECT gram FROM table...

Answer 2

我找到了解决方案。

问题是 S3 位置的尾部斜杠，您希望覆盖的目录的基本路径应该不包含尾部斜杠。

Hive 结果未保存到 S3 存储桶中

Hive results not being saved into S3 bucket

hadoop

hive

amazon-s3

amazon-emr

hue