AWS Crawler S3 目标路径更改但包含旧路径表

Question

我有一个 AWS 爬虫，我正在切换 s3 目标路径以切换底层 table 源。问题是 table 是从两个目标创建的：

配置：

aws glue get-crawler --name sand-main 
{
    "Crawler": {
        "Name": "sand-main",
        "Role": "Crawler-sand",
        "Targets": {
            "S3Targets": [
                {
                    "Path": "s3://sand-main-green/main",
                    "Exclusions": [
                        "checkpoints/**",
                        "IsActive.txt",
                        "isactive.txt"
                    ]
                }
            ],
            "JdbcTargets": [],
            "MongoDBTargets": [],
            "DynamoDBTargets": [],
            "CatalogTargets": []
        },
        "DatabaseName": "sand_main",
        "Description": "",
        "Classifiers": [],
        "RecrawlPolicy": {
            "RecrawlBehavior": "CRAWL_EVERYTHING"
        },
        "SchemaChangePolicy": {
            "UpdateBehavior": "UPDATE_IN_DATABASE",
            "DeleteBehavior": "DELETE_FROM_DATABASE"
        },
        "LineageConfiguration": {
            "CrawlerLineageSettings": "DISABLE"
        },
        "State": "READY",
        "CrawlElapsedTime": 0,
        "CreationTime": "2020-09-30T14:07:25-06:00",
        "LastUpdated": "2021-01-28T11:32:15-07:00",
        "LastCrawl": {
            "Status": "SUCCEEDED",
            "LogGroup": "/aws-glue/crawlers",
            "LogStream": "sand-main",
            "MessagePrefix": "5bb1907d-2847-46ef-8712-3a50deb2b7a0",
            "StartTime": "2021-01-28T11:32:35-07:00"
        },
        "Version": 24,
        "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"}},\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
    }
}

我有一个 lambda 将从以下位置切换的路径： "Path": "s3://sand-main-green/main" 到： "Path": "s3://sand-main-blue/main"

但我最终得到 tables:

Name -> Location
test -> s3://sand-main-blue/main/test

test_2398l50df -> s3://sand-main-green/main/test

我有 DELETE_IN_DATABASE 所以我希望旧的 s3 路径被删除。感觉爬虫保留了它的 s3 目标的历史。我不想要这种行为

Answer 1

通常爬虫创建 table 文件路径的最后一部分作为 table 名称（在您的示例中为“test”）。如果 table 已经存在于数据库中，它会创建新的 table 并使用随机字符作为后缀（在您的示例中 test_2398l50df）。

如果您希望table“测试”设置为新路径，您应该按以下顺序执行步骤：

运行位置为 s3://sand-main-blue/main/test 的爬虫（这创建“测试”table)
删除数据库中的“测试”table
使用新路径更新爬虫 (s3://sand-main-green/main/test)
运行爬虫（这会用新路径创建“测试”table）。

AWS Crawler S3 目标路径更改但包含旧路径表

AWS Crawler S3 Target Path Changes But Old Path Tables Included

aws-glue

aws-glue-data-catalog