大文件分层随机拆分

Question

我有一个 35GB 的 CSV 文件（预计将来会更大），用于 Keras 中的二进制分类问题。为了训练和测试我的模型，我想将数据分成 train/test 个数据集，每个数据集的正样本比例相同。像这样：

|Dataset type | Total samples | negative samples | positive instances |
|-------------|---------------|------------------|--------------------|
|Dataset      |    10000      |        8000      |       2000         |
|Train        |    7000       |        6000      |       1000         |
|Test         |    3000       |        2000      |       1000         |

由于此数据集太大而无法放入内存，我创建了一个自定义生成器来批量加载数据并通过 fit_generator 训练模型。因此，我无法应用 Scikitlearn 中的 StratifiedShuffleSplit 方法来执行此操作，因为它需要整个数据集，而不是仅一部分数据，以保持训练和测试数据集的正例比例。

编辑：我的数据具有以下形状：11500 x 160000

有谁知道我怎样才能做我想做的事？

解决方案

我一步步跟着Ian Lin的回答。请注意，如果您有大量列，将 Dataframe 转换为 hdf5 可能会失败。因此，直接从 numpy 数组

创建 hdf5 文件

此外，要将数据附加到 hdf5 文件，我必须执行以下操作（将 maxshape=None 设置为要无限制调整大小的数据集的每个维度。在我的例子中，我将数据集调整为使用固定列号追加无限行）：

path = 'test.h5'
mydata = np.random.rand(11500, 160000)
if not os.path.exists(path):
    h5py.File(path, 'w').create_dataset('dataset', data=mydata, maxshape=(None, mydata.shape[1]))
else:
    with h5py.File(path, 'a') as hf:
        hf['dataset'].resize(hf['dataset'].shape[0] + mydata.shape[0], axis=0)
        hf["dataset"][-mydata.shape[0]:, :] = mydata

Answer 1

我通常这样做：

将数据存储到类似 numpy.memmap or HDF5 dataset (If your dataset has a large number of features, use h5py 的文件中，而不是 pandas.DataFrame.to_hdf() 或 pytables)
使用类似这样的方法生成整数索引range(dataset.shape[0])
使用sklearn中的split函数将整数索引拆分为train/test
将整数索引传递给您的生成器，并使用整数索引在 h5py.Dataset or numpy.memmap

如果您使用 keras.image.ImageDataGenerator.flow() 作为生成器，您可以参考我写的帮助程序 here 以更轻松地重新索引数据。

大文件分层随机拆分

Stratified Shuffle Split for large files

python

split

large-data

scikit-learn

解决方案