如何使用 Java 和校验和控制从 Google 云存储下载大文件

Question

我想使用 google 提供的 Java 库 com.google.cloud.storage 从 Google 云存储下载大文件。我有工作代码，但我仍然有一个问题和一个主要问题：

我主要关心的是，文件内容实际下载时间是什么时候？在（参考下面的代码）storage.get(blobId)、blob.reader() 或 reader.read(bytes) 期间？当涉及到如何处理无效校验和时，这变得非常重要，我需要做什么才能真正触发再次通过网络获取文件？

更简单的问题是：google 库中是否有内置功能来对接收到的文件进行 md5（或 crc32c）检查？也许我不需要自己实现它。

这是我尝试从 Google 云存储下载大文件的方法：

private static final int MAX_NUMBER_OF_TRIES = 3;
public Path downloadFile(String storageFileName, String bucketName) throws IOException {
    // In my real code, this is a field populated in the constructor.
    Storage storage = Objects.requireNonNull(StorageOptions.getDefaultInstance().getService());

    BlobId blobId = BlobId.of(bucketName, storageFileName);
    Path outputFile = Paths.get(storageFileName.replaceAll("/", "-"));
    int retryCounter = 1;
    Blob blob;
    boolean checksumOk;
    MessageDigest messageDigest;
    try {
        messageDigest = MessageDigest.getInstance("MD5");
    } catch (NoSuchAlgorithmException ex) {
        throw new RuntimeException(ex);
    }

    do {
        LOGGER.debug("Start download file {} from bucket {} to Content Store (try {})", storageFileName, bucketName, retryCounter);
        blob = storage.get(blobId);
        if (null == blob) {
            throw new CloudStorageCommunicationException("Failed to download file after " + retryCounter + " tries.");
        }
        if (Files.exists(outputFile)) {
            Files.delete(outputFile);
        }
        try (ReadChannel reader = blob.reader();
             FileChannel channel = new FileOutputStream(outputFile.toFile(), true).getChannel()) {
            ByteBuffer bytes = ByteBuffer.allocate(128 * 1024);
            int bytesRead = reader.read(bytes);
            while (bytesRead > 0) {
                bytes.flip();
                messageDigest.update(bytes.array(), 0, bytesRead);
                channel.write(bytes);
                bytes.clear();
                bytesRead = reader.read(bytes);
            }
        }
        String checksum = Base64.encodeBase64String(messageDigest.digest());
        checksumOk = checksum.equals(blob.getMd5());
        if (!checksumOk) {
            Files.delete(outputFile);
            messageDigest.reset();
        }
    } while (++retryCounter <= MAX_NUMBER_OF_TRIES && !checksumOk);
    if (!checksumOk) {
        throw new CloudStorageCommunicationException("Failed to download file after " + MAX_NUMBER_OF_TRIES + " tries.");
    }
    return outputFile;
}

Answer 1

正如 ReadChannel 的 JavaDoc 所说：

Implementations of this class may buffer data internally to reduce remote calls.

所以您从 blob.reader() 获得的实现可以缓存整个文件，一些字节或什么都不缓存，并且在您调用 read() 时只是逐字节获取字节。你永远不会知道，你不应该关心。

因为只有 read() 会抛出 IOException 而您使用的其他方法不会，我想说只有调用 read() 才会真正下载内容。您也可以在库的 the sources 中看到它。

顺便说一句。尽管库的 JavaDocs 中有示例，但您应该检查 >= 0，而不是 > 0。 0 只是意味着没有读取任何内容，而不是到达流的末尾。返回 -1.

表示流结束

要在校验和检查失败后重试，请从 blob 中获取新的 reader。如果某些东西缓存了下载的数据，那么 reader 本身。因此，如果您从 blob 中获得一个新的 reader，该文件将从远程重新下载。

Answer 2

google-cloud-java 存储库在读取超出正常 HTTPS/TCP 正确性检查的数据时不会自行验证校验和。如果它将接收到的数据的 MD5 与已知的 MD5 进行比较，则需要先下载整个文件才能 return read() 的任何结果，这对于非常大的文件来说是不可行的。

如果您需要比较 MD5 的额外保护，那么您正在做的事情是个好主意。如果这是一次性任务，您可以使用 gsutil 命令行工具，它会执行相同类型的额外检查。

如何使用 Java 和校验和控制从 Google 云存储下载大文件

How to download a large file from Google Cloud Storage using Java with checksum control

java

checksum

google-cloud-storage