如何为 Keras 打开一个大的镶木地板文件？

Question

我已经尝试寻找这个但没有任何有意义的结果。

我有一个具有多输入的 keras 模型，我的数据对于我的 pandas 方法来说太大了，所以我对其进行了预处理并将其保存为镶木地板文件。不知道怎么用keras打开。

我查阅了 tf.datasets，但我仍然不知道如何读取可以传递给我的模型的镶木地板文件。

有谁知道如何使用打开的 parquet 文件？我似乎无法弄清楚如何在 tensorflow 中执行此操作，并且在 keras 中找不到与之相关的任何内容。

Answer 1

您或许可以保留 pandas 方法，但您必须将数据分解成块。

如果您已经将其分解以创建 parquet 文件，您应该能够使用相同的方法一次只在 pandas 中打开数据的一个子集。

如果您需要从 parquet 文件中提取数据，这里有一个 link 关于如何为 pandas 数据帧创建数据块的方法： How to read a CSV file subset by subset with Pandas?

一旦你有了一块数据，你就可以对该数据块调用 model.fit，然后继续处理下一个数据块并调用 model.fit

Answer 2

您可以查看 TensorFlow I/O which is a collection of file systems and file formats that are not available in TensorFlow's built-in support. Here you can find functionalities such tfio.IODataset.from_parquet, and also tfio.IOTensor.from_parquet 以使用 parquet 文件格式。

!pip install tensorflow_io -U -q 
import tensorflow_io as tfio

df = pd.DataFrame({"data": tf.random.normal([20], 0, 1, tf.float32),
                   "label": np.random.randint(2, size=(20))})
df.to_parquet("df.parquet") 
pd.read_parquet('/content/df.parquet')[:2]
    data    label
0   0.721347    1
1   -1.215225   1

ds = tfio.IODataset.from_parquet('/content/df.parquet')
ds

仅供参考，我认为您还应该考虑使用 feather format rather than the parquet 文件格式，据我所知，parquet 文件的加载量可能非常大，并且会减慢您的训练管道，而 feather比较快（很快）。

如何为 Keras 打开一个大的镶木地板文件？

How can I open a large parquet file for Keras?

parquet

pyspark

keras

tensorflow