胶水爬虫中排除的文件夹在 Athena 中引发 HIVE_BAD_DATA 错误

Question

我正在尝试创建胶水爬虫来爬取特定路径模式。我有以下路径：

bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet

每天都重复同样的模式，即我们有上面的

bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*

我只想每天抓取 **/predictions 文件夹中的内容。我已经设置了一个指向 bucket/inference/ 的胶水爬虫，并具有以下排除模式：

**/modelling/**
**/extract/**

日志正确显示 bucket/inference/2022/04/28/modelling/metadata.tar.gz 和 bucket/inference/2022/04/28/extract/data.parquet 文件被排除在外，DDL 元数据显示它在数据中选取了正确数量的对象和行。

但是，当我在 Athena 中转到 SELECT * 时，出现以下错误：

HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1

我已经尝试了上述排除模式的每个组合，但它似乎总是在提取建模文件夹中的内容，尽管日志明确排除了它。我在这里遗漏了什么吗？

非常感谢。

Answer 1

这是 Athena 的一个已知问题。来自 AWS 故障排除文档：

Athena does not recognize exclude patterns that you specify an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

参考：Athena reads files that I excluded from the AWS Glue crawler (AWS)

胶水爬虫中排除的文件夹在 Athena 中引发 HIVE_BAD_DATA 错误

Excluded folder in glue crawler throws HIVE_BAD_DATA error in Athena

hive

amazon-web-services

amazon-athena

aws-glue