使用 DataLakeFileClient 和进度条下载文件

Download file with DataLakeFileClient and progress bar

我需要使用 DataLakeFileClient 从 Azure 下载一个大文件,并在下载过程中显示类似 tqdm 的进度条。下面是我用较小的测试文件尝试的代码。

# Download a File
test_file = DataLakeFileClient.from_connection_string(my_conn_str, file_system_name=fs_name, file_path="161263.tmp")

download = test_file.download_file()
blocks = download.chunks()
print(f"File Size = {download.size}, Number of blocks = {len(blocks)}")

with open("./newfile.tmp", "wb") as my_file:
    for block in tqdm(blocks):
        my_file.write(block)

jupyter notebook 中的结果如下所示,块数与文件大小相同。

怎样才能让块数正确,进度条正常工作?

使用卡盘时要注意只有文件大小大于32MB(33554432 bytes),然后是文件大小(这里的文​​件大小是指total file size - 32MB ) 将被分成块,每个块的大小为 4MB

例如,如果文件大小为39MB,它将被分成3块。第一个块是 32MB,第二个块是 4MB,第三个块是 3MB(39MB - 32MB - 4MB).

这是一个例子,在我这边可以很好地工作:

from tqdm import tqdm
from azure.storage.filedatalake import DataLakeFileClient
import math

conn_str = "xxxxxxxx"
file_system_name="xxxx"
file_name="ccc.txt"

test_file = DataLakeFileClient.from_connection_string(conn_str,file_system_name,file_name)

download = test_file.download_file()

blocks = download.chunks()

number_of_blocks = 0

#if the file size is larger than 32MB
if len(blocks) > 33554432:
    number_of_blocks = math.ceil((len(blocks) - 33554432) / 1024 / 1024 / 4) + 1
else:
    number_of_blocks = 1
    
print(f"File Size = {download.size}, Number of blocks = {number_of_blocks}")

#initialize a tqdm instance
progress_bar = tqdm(total=download.size,unit='iB',unit_scale=True)

with open("D:\a11\ccc.txt","wb") as my_file:
    for block in blocks:
        #update the progress bar
        progress_bar.update(len(block))

        my_file.write(block)

progress_bar.close()

print("**completed**")