如何将多个 pandas 数据帧组合成一个 key/group 下的 HDF5 对象?
How do I combine multiple pandas dataframes into an HDF5 object under one key/group?
我正在从一个大小为 800 GB 的大型 csv 中解析数据。对于每一行数据,我将其保存为 pandas 数据框。
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
现在,我想将其保存为 HDF5 格式,并像查询整个 csv 文件一样查询 h5。
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
到目前为止我的方法是:
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
store.append(hdf5_key, df, data_columns=csv_columns, index=False)
也就是我尝试将每个数据帧df
一键保存到HDF5中。但是,这失败了:
Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'
所以,我可以尝试先将所有内容保存到一个 pandas 数据帧中,即
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
total_df = pd.concat([total_df, df]) # creates one big CSV
现在存储为 HDF5 格式
store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)
但是,我不认为我有 RAM/storage 将所有 csv 行保存为 total_df
为 HDF5 格式。
那么,如何将每个 "single-line" df 附加到 HDF5 中,以便它最终成为一个大数据帧(如原始 csv)?
编辑:这是一个具有不同数据类型的 csv 文件的具体示例:
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
你的代码应该可以工作,你能试试下面的代码吗:
import pandas as pd
import numpy as np
store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
store.append(hdf5_key, df, data_column=csv_columns, index=False)
store.close()
如果代码有效,则说明您的数据有问题。
我正在从一个大小为 800 GB 的大型 csv 中解析数据。对于每一行数据,我将其保存为 pandas 数据框。
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
现在,我想将其保存为 HDF5 格式,并像查询整个 csv 文件一样查询 h5。
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
到目前为止我的方法是:
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
store.append(hdf5_key, df, data_columns=csv_columns, index=False)
也就是我尝试将每个数据帧df
一键保存到HDF5中。但是,这失败了:
Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'
所以,我可以尝试先将所有内容保存到一个 pandas 数据帧中,即
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
total_df = pd.concat([total_df, df]) # creates one big CSV
现在存储为 HDF5 格式
store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)
但是,我不认为我有 RAM/storage 将所有 csv 行保存为 total_df
为 HDF5 格式。
那么,如何将每个 "single-line" df 附加到 HDF5 中,以便它最终成为一个大数据帧(如原始 csv)?
编辑:这是一个具有不同数据类型的 csv 文件的具体示例:
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
你的代码应该可以工作,你能试试下面的代码吗:
import pandas as pd
import numpy as np
store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
store.append(hdf5_key, df, data_column=csv_columns, index=False)
store.close()
如果代码有效,则说明您的数据有问题。