Glue S3 目标路径匹配二级特定子文件夹

Question

bucket/
├── seoul/
│   ├── weather/
│   │   └── data.json
│   └── gdp/
│       └── data.json
├── tokyo/
│   ├── weather/
│   │   └── data.json
│   ├── gdp/
│   │   └── data.json
│   └── transit/
│       └── data.json
├── seattle/
│   ├── weather/
│   │   └── data.json
│   └── cost-of-living/
│       └── data.json
├ ....

我想抓取存储桶中的所有 weather 数据。如 AWS Doc 中所述，我将 S3 目标路径设置为

s3://bucket/*/weather

但是胶水爬虫不匹配任何数据。创建 0 个表。我应该如何设置粘合目标以便收集所有天气数据？

Answer 1

排除模式支持全局模式。因此，对于您的情况，请尝试将目标设置为 s3://bucket/ 并为 */gdp/**,*/transit/**,*/cost-of-living/**

添加排除项

Answer 2

如果没有太多要排除的文件夹，@Yuriy Bondaruk 有很好的答案。但是，就我而言，有许多文件夹要排除，并且不能保证当前文件树是固定的。

因此，我将构建 nested cloudFormation。

BASE Cloudformation：以城市为输入，运行爬虫。
超长Cloudformation模板：将城市名称作为参数并调用BASE cloudformation。

Glue S3 目标路径匹配二级特定子文件夹

Glue S3 Target Path matching two level specific sub folder

aws-glue