Java 文件上传到 S3 - 应该多部分加速吗？

Question

我们正在使用 Java8 并使用 AWS SDK 以编程方式将文件上传到 AWS S3。对于上传大文件（>100MB），我们了解到首选使用的方法是分段上传。我们试过了，但它似乎并没有加快速度，上传时间几乎与不使用分段上传相同。更糟糕的是，我们甚至遇到内存不足的错误，说堆 space 不够。

问题：

使用分段上传真的可以加快上传速度吗？如果不是，那为什么要使用它？
为什么使用分段上传比不使用更快地占用内存？它会同时上传所有部分吗？

我们使用的代码见下文：

private static void uploadFileToS3UsingBase64(String bucketName, String region, String accessKey, String secretKey,
        String fileBase64String, String s3ObjectKeyName) {
    
    byte[] bI = org.apache.commons.codec.binary.Base64.decodeBase64((fileBase64String.substring(fileBase64String.indexOf(",")+1)).getBytes());
    InputStream fis = new ByteArrayInputStream(bI);
    
    long start = System.currentTimeMillis();
    AmazonS3 s3Client = null;
    TransferManager tm = null;

    try {

        s3Client = AmazonS3ClientBuilder.standard().withRegion(region)
                .withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(accessKey, secretKey)))
                .build();
        
        tm = TransferManagerBuilder.standard()
                  .withS3Client(s3Client)
                  .withMultipartUploadThreshold((long) (50* 1024 * 1025))
                  .build();

        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setHeader(Headers.STORAGE_CLASS, StorageClass.Standard);
        PutObjectRequest putObjectRequest = new PutObjectRequest(bucketName, s3ObjectKeyName,
                fis, metadata).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams());
        
        Upload upload = tm.upload(putObjectRequest);

        // Optionally, wait for the upload to finish before continuing.
        upload.waitForCompletion();

        long end = System.currentTimeMillis();
        long duration = (end - start)/1000;
        
        // Log status
        System.out.println("Successul upload in S3 multipart. Duration = " + duration);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        if (s3Client != null)
            s3Client.shutdown();
        if (tm != null)
            tm.shutdownNow();
    }

}

Answer 1

如果同时上传多个部分，使用 multipart 只会加快上传速度。

在您的代码中，您正在设置 withMultipartUploadThreshold。如果您的上传大小大于该阈值，那么您应该观察到不同部分的并发上传。如果不是，则应仅使用一个上传连接。您是说您有 >100 MB 的文件，并且在您的代码中您有 50 * 1024 * 1025 = 52 480 000 字节作为分段上传阈值，因此应该同时上传该文件的各个部分。

但是，如果您的上传吞吐量无论如何都受到网络速度的限制，则吞吐量不会有任何增加。这可能是您没有观察到任何速度增加的原因。

还有其他使用 multipart 的原因，因为容错原因也推荐使用 multipart。此外，它的最大尺寸大于单次上传。

有关详细信息，请参阅 documentation：

Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.

Using multipart upload provides the following advantages:

Improved throughput - You can upload parts in parallel to improve throughput.

Quick recovery from any network issues - Smaller part size minimizes the impact of restarting a failed upload due to a network error.

Pause and resume object uploads - You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you must explicitly complete or stop the multipart upload.

Begin an upload before you know the final object size - You can upload an object as you are creating it.

We recommend that you use multipart upload in the following ways:

If you're uploading large objects over a stable high-bandwidth network, use multipart upload to maximize the use of your available bandwidth by uploading object parts in parallel for multi-threaded performance.

If you're uploading over a spotty network, use multipart upload to increase resiliency to network errors by avoiding upload restarts. When using multipart upload, you need to retry uploading only parts that are interrupted during the upload. You don't need to restart uploading your object from the beginning.

Answer 2

eis的回答很好。虽然你仍然应该采取一些行动：

String.getBytes(StandardCharsets.US_ASCII) 或 ISO_8859_1 防止使用更昂贵的编码，如 UTF-8。如果平台编码是 UTF-16LE，数据甚至会损坏（0x00 字节）。
标准 java Base64 有一些 de-/encoders 可能有效。它可以在字符串上工作。但是请检查正确的处理（行尾）。
try-with-resources 在 exceptions/internal returns.
ByteArrayInputStream 没有关闭，这会是更好的风格（更容易的垃圾收集？）。
您可以将 ExecutorFactory 设置为线程池工厂，以全局限制线程数。

所以

byte[] bI = Base64.getDecoder().decode(
        fileBase64String.substring(fileBase64String.indexOf(',') + 1));
try (InputStream fis = new ByteArrayInputStream(bI)) {
    ...
}

Java 文件上传到 S3 - 应该多部分加速吗？

Java File Upload to S3 - should multipart speed it up?

java

amazon-s3

java-8

aws-sdk

aws-java-sdk