HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]!
HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]!
使用 pandas.HDFStore().append()
后出现以下错误
ValueError: Trying to store a string with len [150] in [values_block_0] column but this column has a limit of [127]!
Consider using min_itemsize to preset the sizes on these columns
我正在创建一个 pandas DataFrame 并将其附加到 HDF5 文件,如下所示:
import pandas as pd
store = pd.HDFStore("test1.h5", mode='w')
hdf_key = "one_key"
columns = ["col1", "col2", ... ]
df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
....
store.append(hdf_key, df, data_column=columns, index=False)
我得到上面的错误:"ValueError: Trying to store a string with len [150] in [values_block_0] column but this column has a limit of [127]!"
之后,我执行代码:
store.get_storer(hdf_key).table.description
输出
{
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=127, shape=(5,), dflt=b'', pos=1),
"values_block_1": Int64Col(shape=(5,), dflt=0, pos=2),
"col1": StringCol(itemsize=20, shape=(), dflt=b'', pos=3),
"col2": StringCol(itemsize=39, shape=(), dflt=b'', pos=4)}
什么是values_block_0
和values_block_1
?
所以,在这个 Whosebug Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex 之后,我尝试了
store.append(hdf_key, df, data_column=columns, index=False, min_itemsize={"values_block_0":250})
虽然这不起作用---现在我得到这个错误:
ValueError: Trying to store a string with len [250] in [values_block_0] column but this column has a limit of [127]!
Consider using min_itemsize to preset the sizes on these columns
我做错了什么?
编辑:此代码从 filename.py
产生错误 ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column
import pandas as pd
store = pd.HDFStore("test1.h5", mode='w')
hdf_key = "one_key"
my_columns = ["col1", "col2", ... ]
df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
....
store.append(hdf_key, df, data_column=my_columns, index=False, min_itemsize={"values_block_0":350})
这是完整的错误:
(python-3) -bash:1008 $ python filename.py
Traceback (most recent call last):
File "filename.py", line 50, in <module>
store.append(hdf_key, dicts_into_df, data_column=my_columns, index=False, min_itemsize={'values_block_0':350})
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 970, in append
**kwargs)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 1315, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 4263, in write
obj=obj, data_columns=data_columns, **kwargs)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3853, in write
**kwargs)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3535, in create_axes
self.validate_min_itemsize(min_itemsize)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3174, in validate_min_itemsize
"data_column" % k)
ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column
更新:
您拼错了 data_columns
参数:data_column
- 应该是 data_columns
。因此,您的 HDF 存储中没有任何索引列,并且添加了 HDF 存储 values_block_X
:
In [70]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
拼写错误的参数将被忽略:
In [71]: store.append('no_idx_wrong_dc', df, data_column=df.columns, index=False)
In [72]: store.get_storer('no_idx_wrong_dc').table
Out[72]:
/no_idx_wrong_dc/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
byteorder := 'little'
chunkshape := (1213,)
与下同:
In [73]: store.append('no_idx_no_dc', df, index=False)
In [74]: store.get_storer('no_idx_no_dc').table
Out[74]:
/no_idx_no_dc/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
byteorder := 'little'
chunkshape := (1213,)
让我们拼写正确:
In [75]: store.append('no_idx_dc', df, data_columns=df.columns, index=False)
In [76]: store.get_storer('no_idx_dc').table
Out[76]:
/no_idx_dc/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"value": Float64Col(shape=(), dflt=0.0, pos=1),
"count": Int64Col(shape=(), dflt=0, pos=2),
"s": StringCol(itemsize=30, shape=(), dflt=b'', pos=3)}
byteorder := 'little'
chunkshape := (1213,)
旧答案:
据我所知,您可以 有效地 设置 min_itemsize
参数 在第一个 仅追加。
演示:
In [33]: df
Out[33]:
num s
0 11 aaaaaaaaaaaaaaaa
1 12 bbbbbbbbbbbbbb
2 13 ccccccccccccc
3 14 ddddddddddd
In [34]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
In [35]: store.append('test_1', df, data_columns=True)
In [36]: store.get_storer('test_1').table.description
Out[36]:
{
"index": Int64Col(shape=(), dflt=0, pos=0),
"num": Int64Col(shape=(), dflt=0, pos=1),
"s": StringCol(itemsize=16, shape=(), dflt=b'', pos=2)}
In [37]: df.loc[4] = [15, 'X'*200]
In [38]: df
Out[38]:
num s
0 11 aaaaaaaaaaaaaaaa
1 12 bbbbbbbbbbbbbb
2 13 ccccccccccccc
3 14 ddddddddddd
4 15 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
In [39]: store.append('test_1', df, data_columns=True)
...
skipped
...
ValueError: Trying to store a string with len [200] in [s] column but
this column has a limit of [16]!
Consider using min_itemsize to preset the sizes on these columns
现在使用 min_itemsize
,但仍附加到现有的 store
对象:
In [40]: store.append('test_1', df, data_columns=True, min_itemsize={'s':250})
...
skipped
...
ValueError: Trying to store a string with len [250] in [s] column but
this column has a limit of [16]!
Consider using min_itemsize to preset the sizes on these columns
如果我们要在我们的 store
中创建一个新对象,则以下工作:
In [41]: store.append('test_2', df, data_columns=True, min_itemsize={'s':250})
检查列大小:
In [42]: store.get_storer('test_2').table.description
Out[42]:
{
"index": Int64Col(shape=(), dflt=0, pos=0),
"num": Int64Col(shape=(), dflt=0, pos=1),
"s": StringCol(itemsize=250, shape=(), dflt=b'', pos=2)}
我大约在将 Pandas 从 18.1 更新到 22.0 的同时开始收到此错误(尽管这可能无关)。
我修复了现有 HDF5 文件中的错误,方法是手动读取数据帧,然后为错误中提到的列写入一个更大 min_itemsize
的新 HDF5 文件:
filename_hdf5 = "C:\test.h5"
df = pd.read_hdf(filename_hdf5, 'table_name')
hdf = HDFStore(filename_hdf5)
hdf.put('table_name', df, format='table', data_columns=True, min_itemsize={'ColumnNameMentionedInError': 10})
hdf.close()
然后我更新了现有代码以在创建密钥时设置 min_itemsize
。
专家额外
发生此错误是因为有人试图将更多行附加到现有数据框,该数据框的固定列宽对于新数据来说太窄了。固定列宽最初是在第一次写入dataframe时根据列中最长的字符串设置的。
我认为 pandas 应该透明地处理这个错误,而不是为所有未来的追加留下有效的定时炸弹。这个问题可能需要数周甚至数年才能浮出水面。
使用 pandas.HDFStore().append()
ValueError: Trying to store a string with len [150] in [values_block_0] column but this column has a limit of [127]!
Consider using min_itemsize to preset the sizes on these columns
我正在创建一个 pandas DataFrame 并将其附加到 HDF5 文件,如下所示:
import pandas as pd
store = pd.HDFStore("test1.h5", mode='w')
hdf_key = "one_key"
columns = ["col1", "col2", ... ]
df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
....
store.append(hdf_key, df, data_column=columns, index=False)
我得到上面的错误:"ValueError: Trying to store a string with len [150] in [values_block_0] column but this column has a limit of [127]!"
之后,我执行代码:
store.get_storer(hdf_key).table.description
输出
{
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=127, shape=(5,), dflt=b'', pos=1),
"values_block_1": Int64Col(shape=(5,), dflt=0, pos=2),
"col1": StringCol(itemsize=20, shape=(), dflt=b'', pos=3),
"col2": StringCol(itemsize=39, shape=(), dflt=b'', pos=4)}
什么是values_block_0
和values_block_1
?
所以,在这个 Whosebug Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex 之后,我尝试了
store.append(hdf_key, df, data_column=columns, index=False, min_itemsize={"values_block_0":250})
虽然这不起作用---现在我得到这个错误:
ValueError: Trying to store a string with len [250] in [values_block_0] column but this column has a limit of [127]!
Consider using min_itemsize to preset the sizes on these columns
我做错了什么?
编辑:此代码从 filename.py
ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column
import pandas as pd
store = pd.HDFStore("test1.h5", mode='w')
hdf_key = "one_key"
my_columns = ["col1", "col2", ... ]
df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
....
store.append(hdf_key, df, data_column=my_columns, index=False, min_itemsize={"values_block_0":350})
这是完整的错误:
(python-3) -bash:1008 $ python filename.py
Traceback (most recent call last):
File "filename.py", line 50, in <module>
store.append(hdf_key, dicts_into_df, data_column=my_columns, index=False, min_itemsize={'values_block_0':350})
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 970, in append
**kwargs)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 1315, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 4263, in write
obj=obj, data_columns=data_columns, **kwargs)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3853, in write
**kwargs)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3535, in create_axes
self.validate_min_itemsize(min_itemsize)
File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3174, in validate_min_itemsize
"data_column" % k)
ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column
更新:
您拼错了 data_columns
参数:data_column
- 应该是 data_columns
。因此,您的 HDF 存储中没有任何索引列,并且添加了 HDF 存储 values_block_X
:
In [70]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
拼写错误的参数将被忽略:
In [71]: store.append('no_idx_wrong_dc', df, data_column=df.columns, index=False)
In [72]: store.get_storer('no_idx_wrong_dc').table
Out[72]:
/no_idx_wrong_dc/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
byteorder := 'little'
chunkshape := (1213,)
与下同:
In [73]: store.append('no_idx_no_dc', df, index=False)
In [74]: store.get_storer('no_idx_no_dc').table
Out[74]:
/no_idx_no_dc/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
byteorder := 'little'
chunkshape := (1213,)
让我们拼写正确:
In [75]: store.append('no_idx_dc', df, data_columns=df.columns, index=False)
In [76]: store.get_storer('no_idx_dc').table
Out[76]:
/no_idx_dc/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"value": Float64Col(shape=(), dflt=0.0, pos=1),
"count": Int64Col(shape=(), dflt=0, pos=2),
"s": StringCol(itemsize=30, shape=(), dflt=b'', pos=3)}
byteorder := 'little'
chunkshape := (1213,)
旧答案:
据我所知,您可以 有效地 设置 min_itemsize
参数 在第一个 仅追加。
演示:
In [33]: df
Out[33]:
num s
0 11 aaaaaaaaaaaaaaaa
1 12 bbbbbbbbbbbbbb
2 13 ccccccccccccc
3 14 ddddddddddd
In [34]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
In [35]: store.append('test_1', df, data_columns=True)
In [36]: store.get_storer('test_1').table.description
Out[36]:
{
"index": Int64Col(shape=(), dflt=0, pos=0),
"num": Int64Col(shape=(), dflt=0, pos=1),
"s": StringCol(itemsize=16, shape=(), dflt=b'', pos=2)}
In [37]: df.loc[4] = [15, 'X'*200]
In [38]: df
Out[38]:
num s
0 11 aaaaaaaaaaaaaaaa
1 12 bbbbbbbbbbbbbb
2 13 ccccccccccccc
3 14 ddddddddddd
4 15 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
In [39]: store.append('test_1', df, data_columns=True)
...
skipped
...
ValueError: Trying to store a string with len [200] in [s] column but
this column has a limit of [16]!
Consider using min_itemsize to preset the sizes on these columns
现在使用 min_itemsize
,但仍附加到现有的 store
对象:
In [40]: store.append('test_1', df, data_columns=True, min_itemsize={'s':250})
...
skipped
...
ValueError: Trying to store a string with len [250] in [s] column but
this column has a limit of [16]!
Consider using min_itemsize to preset the sizes on these columns
如果我们要在我们的 store
中创建一个新对象,则以下工作:
In [41]: store.append('test_2', df, data_columns=True, min_itemsize={'s':250})
检查列大小:
In [42]: store.get_storer('test_2').table.description
Out[42]:
{
"index": Int64Col(shape=(), dflt=0, pos=0),
"num": Int64Col(shape=(), dflt=0, pos=1),
"s": StringCol(itemsize=250, shape=(), dflt=b'', pos=2)}
我大约在将 Pandas 从 18.1 更新到 22.0 的同时开始收到此错误(尽管这可能无关)。
我修复了现有 HDF5 文件中的错误,方法是手动读取数据帧,然后为错误中提到的列写入一个更大 min_itemsize
的新 HDF5 文件:
filename_hdf5 = "C:\test.h5"
df = pd.read_hdf(filename_hdf5, 'table_name')
hdf = HDFStore(filename_hdf5)
hdf.put('table_name', df, format='table', data_columns=True, min_itemsize={'ColumnNameMentionedInError': 10})
hdf.close()
然后我更新了现有代码以在创建密钥时设置 min_itemsize
。
专家额外
发生此错误是因为有人试图将更多行附加到现有数据框,该数据框的固定列宽对于新数据来说太窄了。固定列宽最初是在第一次写入dataframe时根据列中最长的字符串设置的。
我认为 pandas 应该透明地处理这个错误,而不是为所有未来的追加留下有效的定时炸弹。这个问题可能需要数周甚至数年才能浮出水面。