使用 Pandas、Python 将数据附加到 HDF5 文件

Question

我有大型 pandas DataFrame 和财务数据。我可以毫无问题地将额外的列和 DataFrame 附加和连接到我的 .h5 文件。

财务数据每分钟更新一次，我需要每分钟向 .h5 文件中的所有现有 table 追加一行数据。

这是我到目前为止尝试过的方法，但无论我做什么，它都会覆盖 .h5 文件，而不仅仅是附加数据。

HDFS存储方式：

#we open the hdf5 file
save_hdf = HDFStore('test.h5') 

ohlcv_candle.to_hdf('test.h5')

#we give the dataframe a key value
#format=table so we can append data
save_hdf.put('name_of_frame',ohlcv_candle, format='table',  data_columns=True)

#we print our dataframe by calling the hdf file with the key
#just doing this as a test
print(save_hdf['name_of_frame'])

我试过的另一种方式，to_hdf：

#format=t so we can append data , mode=r+ to specify the file exists and
#we want to append to it
tohlcv_candle.to_hdf('test.h5',key='this_is_a_key', mode='r+', format='t')

#again just printing to check if it worked 
print(pd.read_hdf('test.h5', key='this_is_a_key'))

这是其中一个 DataFrame 在 read_hdf 之后的样子：

           time     open     high      low    close     volume           PP  
0    1505305260  3137.89  3147.15  3121.17  3146.94   6.205397  3138.420000   
1    1505305320  3146.86  3159.99  3130.00  3159.88   8.935962  3149.956667   
2    1505305380  3159.96  3160.00  3159.37  3159.66   4.524017  3159.676667   
3    1505305440  3159.66  3175.51  3151.08  3175.51   8.717610  3167.366667   
4    1505305500  3175.25  3175.53  3170.44  3175.53   3.187453  3173.833333

下次我获取数据时（每分钟），我希望将其中一行添加到我所有列的索引 5 中......然后是 6 和 7 ......等等，而不必阅读和操纵内存中的整个文件，因为那样会破坏这样做的意义。如果有更好的方法解决这个问题，不要羞于推荐它。

P.S。抱歉这里 table 的格式

Answer 1

pandas.HDFStore.put() 具有参数 append（默认为 False）- 指示 Pandas 覆盖而不是附加。

所以试试这个：

store = pd.HDFStore('test.h5')

store.append('name_of_frame', ohlcv_candle, format='t',  data_columns=True)

我们也可以使用store.put(..., append=True)，但是这个文件也应该以table格式创建：

store.put('name_of_frame', ohlcv_candle, format='t', append=True, data_columns=True)

注意： 附加仅适用于 table（format='t' - 是 format='table' 的别名）格式。

Answer 2

tohlcv_candle.to_hdf('test.h5',key='this_is_a_key', append=True, mode='r+', format='t')

您需要传递另一个参数 append=True 以指定如果在该键下找到数据，则将数据附加到现有数据，而不是覆盖它.

没有这个，默认是 False，如果它遇到 'this_is_a_key' 下的现有 table，那么它会覆盖。

mode=参数仅在文件级，告诉整个文件是被覆盖还是附加。

一个文件可以有任意数量的键，因此 mode='a', append=False 设置将意味着只有一个键被覆盖而其他键保留。

我和你有类似的经历，并在参考文档中找到了额外的 append 参数。设置后，现在它对我来说是正确的附加。

参考：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html

注意：hdf5 不会对数据帧的索引做任何事情。我们需要在放入数据之前或取出数据时解决这些问题。

使用 Pandas、Python 将数据附加到 HDF5 文件

Append data to HDF5 file with Pandas, Python

python

hdf5

dataframe

pandas