当某列仅出现在某些 CSV 中时，爬虫无法正确排序数据

Question

我正在尝试建立数据湖。我们已经将一堆 CSV 转储到 S3 中，格式如下（对于提问过于简单）：

bucket
|-- report_type1
    |-- file1.csv
    |-- file2.csv
|-- report_type2
    |-- file3.csv
    |-- file4.csv
    |-- file5.csv
|-- report_type3
 etc . . .

我想从这个湖中抓取并将部分数据推送到红移中，并使雅典娜湖可查询。为此，我将通过 Glue 启动爬虫。爬虫运行，创建了一堆表（report_type1、report_type2、report_type3、...）并完成。

但是，当我查询 Athena 以检查其是否有效时，我发现某些列未在 Athena 中正确分配。例如，file1.csv 看起来像这样：

col0, col1, col2
0, 1, 3
2, 4, 9
2, 1, 7

但 file2.csv 看起来像这样：

col0, new_col, col1, col2
1, 3, 12, 8
3,  , 10, 2
7,  , 0,

所以在处理第二个数据集的时候，我们发现多了一列。这是应该记录的合法列...但由于它不在第一个文件的数据中，因此没有添加到那里。

我最终看到的是，我会这样查询：

SELECT col1 FROM report_type1

在 athena 中，我会看到 col2 已经移动，现在我看到 col1 名称下的 col2 的值。我的假设是它与进来的额外列有关。我已经尝试运行具有所有不同 "what to do when you find a new column"

的爬虫

* Update the table definition in the data catalog.
* Add new columns only.
* Ignore the change and don't update the table in the data catalog

None 已经解决了这个问题。我可以使用不会以这种方式中断的设置吗？

Answer 1

文件的结构是不可变的。 Athena 只是对 S3 文件的查询服务。 "Add new columns only" 当列附加在最后而不是任何地方时工作。

Answer 2

Select 复选框

Update all new and existing partitions with metadata from the table.

在爬虫中

当某列仅出现在某些 CSV 中时，爬虫无法正确排序数据

Crawler not correctly sorting data when a column only occurs in some of the CSVs

amazon-web-services

aws-glue