使用 Java 获取 Azure Data Lake Gen2 中的文件夹大小

Obtain Folder size in Azure Data Lake Gen2 using Java

互联网上有一些关于 C# 计算文件夹大小的文献。但是找不到 Java.

  1. 有没有简单的方法可以知道文件夹的大小?在第 2 代
  2. 如果不是如何计算?

互联网上有几个使用 C# 和 powershell 的 (2) 示例。 Java?

的任何方法

据我所知,在 Azure Data Lake Gen2 中没有 API 直接提供文件夹大小。

递归执行:

DataLakeServiceClient dataLakeServiceClient = new DataLakeServiceClientBuilder()
        .credential(new StorageSharedKeyCredential(storageAccountName, secret))
        .endpoint(endpoint)
        .buildClient();
DataLakeFileSystemClient container = dataLakeServiceClient.getFileSystemClient(containerName);


/**
 * Returns the size in bytes
 *
 * @param folder
 * @return
 */
@Beta
public Long getSize(String folder) {
    DataLakeDirectoryClient directoryClient = container.getDirectoryClient(folder);
    if (directoryClient.exists()) {
        AtomicInteger count = new AtomicInteger();
        return directoryClient.listPaths(true, false, null, null)
                .stream()
                .filter(x -> !x.isDirectory())
                .mapToLong(PathItem::getContentLength)
                .sum();
    }
    throw new RuntimeException("Not a valid folder: " + folder);
}

这递归地遍历文件夹并获取大小。

默认每页记录数为 5000。因此,如果有 12000 条记录(文件夹 + 文件组合),则需要进行 3 API 次调用以获取详细信息。来自文档:

recursive – Specifies if the call should recursively include all paths.

userPrincipleNameReturned – If "true", the user identity values returned in the x-ms-owner, x-ms-group, and x-ms-acl response headers will be transformed from Azure Active Directory Object IDs to User Principal Names. If "false", the values will be returned as Azure Active Directory Object IDs. The default value is false. Note that group and application Object IDs are not translated because they do not have unique friendly names.

maxResults – Specifies the maximum number of blobs to return per page, including all BlobPrefix elements. If the request does not specify maxResults or specifies a value greater than 5,000, the server will return up to 5,000 items per page. If iterating by page, the page size passed to byPage methods such as PagedIterable.iterableByPage(int) will be preferred over this value.

timeout – An optional timeout value beyond which a RuntimeException will be raised.