如何从巨大 tar.gz 的 csv 文件中按块获取 pandas 数据帧而不解压缩和迭代它们?

How to get pandas dataframe by chunks from csv files in huge tar.gz without unzipping and iterating over them?

我有一个巨大的压缩文件,我有兴趣读取其中的各个数据帧,以免 运行 内存不足。

此外,由于时间和 space,我无法解压 .tar.gz.

这是到目前为止我得到的代码:

import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io

tar_file = tarfile.open(r'\path\to\the\tar\file.tar.gz')

# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
    return \
        (
            (
                member.name, \
                pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
            )
               for member in tar_file
                   if member.isreg()\
        )

for filename, dataframe in generate_individual_df(tar_file):
    # But dataframe is the whole file, which is too big

尝试了How to create Panda Dataframe from csv that is compressed in tar.gz?还是无法解决...

您可以使用 glob 模块通过 glob 获取 zip 中的某些文件 例如我希望 cv2 读取文件中的图像

 import glob
 import cv2
    
    file1 = glob.glob(filepath/ "*.extension")
    for image in file1:
       image = cv2.imread(image)
hope it works

您实际上可以使用以下函数迭代压缩文件中的块:

def generate_individual_df(tar_file, chunksize=10**4):
    return \
        (
            (
                member.name, \
                chunk
            )
            for member in tar_file
                if member.isreg()\
                for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)\
                  .read().decode('ascii')), header=None, chunksize=chunksize)
        )