从多个 csv 文件加载数据的最快方法是什么

Question

我正在处理多个 csv 文件，每个文件都包含多个一维数据。我有大约 9000 个这样的文件，总数据量约为 40 GB。

我写了这样一个数据加载器：

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        
        self.files = files
        my_data = np.genfromtxt('/data/'+files, delimiter=',')
        self.dim = my_data.shape[1]
        self.data = []
        
    def __getitem__(self, i):

        file1 = self.files
        my_data = np.genfromtxt('/data/'+file1, delimiter=',')
        self.dim = my_data.shape[1]

        for j in range(my_data.shape[1]):
            tmp = np.reshape(my_data[:,j],(1,my_data.shape[0]))
            tmp = torch.from_numpy(tmp).float()
            self.data.append(tmp)        
        
        return self.data[i]

    def __len__(self): 
        
        return self.dim

我将整个数据集加载到数据加载器的方式就像是通过 for 循环：

for x_train in tqdm(train_files):
    train_dl_spec = data_gen(x_train)
        train_loader = torch.utils.data.DataLoader(
        train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)
        for data in train_loader:

但这工作起来非常慢。我想知道是否可以将所有这些数据存储在一个文件中，但我没有足够的 RAM。那么有什么解决办法吗？

如果有办法请告诉我。

Answer 1

我以前从未使用过pytorch，我承认我真的不知道发生了什么。尽管如此，我几乎可以肯定您使用的 Dataset 是错误的。

据我了解，数据集是所有数据的抽象，其中每个索引 return 都是一个样本。假设您的 9000 个文件中的每一个都有 10 行（示例），21 行将引用第 3 个文件和第 2 行（使用 0 索引）。

因为您有太多数据，所以不想将所有内容都加载到内存中。所以 Dataset 应该只获取一个值，而 DataLoader 会创建一批值。

几乎可以肯定有一些优化可以应用于我所做的，但也许这可以让你开始。我用这些文件创建了目录 csvs：

❯ cat csvs/1.csv
1,2,3
2,3,4
3,4,5

❯ cat csvs/2.csv
21,21,21
34,34,34
66,77,88

然后我创建了这个数据集class。它以一个目录作为输入（存储所有 CSV 的地方）。然后唯一存储在内存中的是每个文件的名称和它的行数。当请求一个项目时，我们找出哪个文件包含该索引，然后 return 该行的张量。

通过只遍历文件，我们从不将文件内容存储在内存中。不过这里的一个改进是不会遍历文件列表来找出哪个是相关的，并且在访问连续索引时使用生成器和状态。

(因为在访问索引 8 时访问，在一个 10 行的文件中我们无用地遍历了前 7 行，这我们无能为力。但是当访问索引 9 时，最好计算出我们可以只是 return 下一个，而不是再次遍历前 8 行。）

import numpy as np
from functools import lru_cache
from pathlib import Path
from pprint import pprint
from torch.utils.data import Dataset, DataLoader

@lru_cache()
def get_sample_count_by_file(path: Path) -> int:
    c = 0
    with path.open() as f:
        for line in f:
            c += 1
    return c


class CSVDataset:
    def __init__(self, csv_directory: str, extension: str = ".csv"):
        self.directory = Path(csv_directory)
        self.files = sorted((f, get_sample_count_by_file(f)) for f in self.directory.iterdir() if f.suffix == extension)
        self._sample_count = sum(f[-1] for f in self.files)

    def __len__(self):
        return self._sample_count

    def __getitem__(self, idx):
        current_count = 0
        for file_, sample_count in self.files:
            if current_count <= idx < current_count + sample_count:
                # stop when the index we want is in the range of the sample in this file
                break  # now file_ will be the file we want
            current_count += sample_count

        # now file_ has sample_count samples
        file_idx = idx - current_count  # the index we want to access in file_
        with file_.open() as f:
            for i, line in enumerate(f):
                if i == file_idx:
                    data = np.array([float(v) for v in line.split(",")])
                    return torch.from_numpy(data)

现在我们可以按预期使用 DataLoader：

dataset = CSVDataset("csvs")
loader = DataLoader(dataset, batch_size=4)

pprint(list(enumerate(loader)))

"""
[(0,
  tensor([[ 1.,  2.,  3.],
        [ 2.,  3.,  4.],
        [ 3.,  4.,  5.],
        [21., 21., 21.]], dtype=torch.float64)),
 (1, tensor([[34., 34., 34.],
        [66., 77., 88.]], dtype=torch.float64))]
"""

您可以正确看到这 returns 批数据。您可以处理每个批次并将该批次仅存储在内存中，而不是将其打印出来。

有关更多信息，请参阅文档：https://pytorch.org/tutorials/recipes/recipes/custom_dataset_transforms_loader.html#part-3-the-dataloader

从多个 csv 文件加载数据的最快方法是什么

What is the fastest way to load data from multiple csv files

csv

python-3.x

pytorch

dataloader

pytorch-dataloader