在同一位置重命名 S3 对象以删除 Glue 书签

Question

我有一个特定的用例，我想以特定的前缀将对象上传到 S3。该前缀处已存在一个文件，我想用这个新文件替换该文件。我正在使用 boto3 执行相同的操作，但出现以下错误。存储桶版本控制已关闭，因此我希望在这种情况下文件被覆盖。但是，我收到以下错误。

{
  "errorMessage": "An error occurred (InvalidRequest) when calling the CopyObject operation: This copy request is illegal because it is trying to copy an object to itself without changing the object's metadata, storage class, website redirect location or encryption attributes.",
  "errorType": "ClientError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 25, in lambda_handler\n    s3.Object(bucket,product_key).copy_from(CopySource=bucket + '/' + product_key)\n",
    "  File \"/var/runtime/boto3/resources/factory.py\", line 520, in do_action\n    response = action(self, *args, **kwargs)\n",
    "  File \"/var/runtime/boto3/resources/action.py\", line 83, in __call__\n    response = getattr(parent.meta.client, operation_name)(*args, **params)\n",
    "  File \"/var/runtime/botocore/client.py\", line 386, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n",
    "  File \"/var/runtime/botocore/client.py\", line 705, in _make_api_call\n    raise error_class(parsed_response, operation_name)\n"
  ]
}

这是我目前尝试过的方法。

import boto3
import tempfile
import os
import tempfile


print('Loading function')
s3 = boto3.resource('s3')
glue = boto3.client('glue')

bucket='my-bucket'
bucket_prefix='my-prefix'

def lambda_handler(_event, _context):
    
    my_bucket = s3.Bucket(bucket)
    # Code to find the object name. There is going to be only one file. 
    for object_summary in my_bucket.objects.filter(Prefix=bucket_prefix):
        product_key= object_summary.key
        print(product_key)
    
    #Using product_key variable I am trying to copy the same file name to the same location, which is when I get an error.
    s3.Object(bucket,product_key).copy_from(CopySource=bucket + '/' + product_key)
    # Maybe the following line is not required
    s3.Object(bucket,bucket_prefix).delete()

我有一个非常具体的原因要将同一文件复制到同一位置。 AWS GLue 在为它添加书签后不会选择相同的文件。我再次复制文件我希望 Glue 书签将被删除并且 Glue 作业将把它视为一个新文件。

我不太在意这个名字。如果你能帮我修改上面的代码以在相同的前缀级别生成一个新文件，那也可以。不过这里总是必须有一个文件。将此文件视为已从关系数据库购买到 S3 的静态产品列表。

谢谢

Answer 1

来自Tracking Processed Data Using Job Bookmarks - AWS Glue：

For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If your input source data has been modified since your last job run, the files are reprocessed when you run the job again.

看来你的理论可行！

但是，如错误消息所述，“不更改对象的元数据、存储 class、网站重定向位置或加密属性”不允许将 S3 对象复制到自身。

因此，您可以添加一些元数据作为复制过程的一部分，它会成功。例如：

    s3.Object(bucket,product_key).copy_from(CopySource=bucket + '/' + product_key, Metadata={'foo': 'bar'})

在同一位置重命名 S3 对象以删除 Glue 书签

Renaming S3 object at the same location for removal of Glue bookmark

amazon-s3

amazon-web-services

boto3

aws-glue