将数据帧保存到磁盘会丢失 numpy 数据类型

Question

我有一个大数据框，我需要将其保存到磁盘。列的类型类似于 numpy.int32 或 numpy.floatxx

             int32Data     ColumName  ...  float32Data  otherTypeData
0        150294240   4260.0  ...                  3.203908e+02  7960.0
1        150294246   4260.0  ...                  0.000000e+00  7960.0
2        150294252   4280.0  ...                  1.117543e+03  7960.0
3        150294258   4260.0  ...                  5.117185e+01  7960.0
4        150294264   4260.0  ...                  5.999993e+02  7960.0
           ...      ...  ...                           ...     ...
1839311  161375508  54592.0  ...                  8.990022e+05     0.0
1839312  161375514  54624.0  ...                  2.097199e+06     0.0
1839313  161375520  54656.0  ...                  1.192150e+06     0.0
1839314  161375526  54688.0  ...                  1.249997e+06     0.0
1839315  161375532  54592.0  ...                  8.949273e+05     0.0

使用正确的数据类型可以节省大量 space 和强大的处理能力。

但是当我将数据帧 df 保存到磁盘时

np.save(FilePath,df)

重读

ReadData=np.load(FilePath).tolist()
df=DataFrame(ReadData)

然后所有数据都转换为 numpy.float64（并删除列名）

是否可以在保存和加载数据帧的同时保留每列（和列名）的数据类型？

Answer 1

HDF5 storage may be exactly what you are looking for, it allows you to efficiently store large amounts of data, saves data types and allows you to retrieve data very quickly. You can find more details in the documentation.

如何使用它的示例：

import pandas as pd

with pd.HDFStore(file_path) as hdf:
  # to save the dataframe to the HDF
  hdf.put(key, df)

  # and to retrieve it later
  df = hdf.get(key)

将数据帧保存到磁盘会丢失 numpy 数据类型

Saving dataframe to disk loses numpy datatype

python

numpy

save

dataframe

pandas