如何在 hdf5 文件中的多个组之间拆分数据?
How to split the data among the multiple groups in hdf5 file?
我有一些数据看起来像这样:
Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
11210LI aLI11210 61.61 19.15 3.25
11211LI aLI11211 69.99 64.64 3.17
11212LI aLI11212 70.73 11.64 3.38
11213LI aLI11213 62.67 16.16 3.44
11214LI aLI11214 3.22 9.76 3.39
61.42836 61.42836 8.47704
除了最后一行,我已经成功地将数据写入了所需的组中。我想把这行写到一组/particles/box。如您在数据文件 here 中所见,这一特定行在每一帧中重复出现。到目前为止,代码的设计方式以某种方式忽略了这一行。我尝试了一些方法但出现以下错误:
ValueError: Shape tuple is incompatible with data
最后一行是时间相关的,即每个时间帧都有一点波动我希望此数据与已在 /particles/lipids/positions/step 中定义的步骤和时间数据集链接。这是代码:
import struct
import numpy as np
import h5py
import re
# First part generate convert the .gro -> .h5 .
csv_file = 'com'
fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
# Format for footer
fmtstring1 = '1s 1s 5s 7s 7s 7s'
fieldstruct1 = struct.Struct(fmtstring1)
parse1 = fieldstruct1.unpack_from
with open(csv_file, 'r') as f, \
h5py.File('xaa_trial.h5', 'w') as hdf:
# open group for position data
## Particles group with the attributes
particles_grp = hdf.require_group('particles/lipids/positions')
box_grp = particles_grp.create_group('box')
dim_grp = box_grp.create_group('dimension')
dim_grp.attrs['dimension'] = 3
bound_grp = box_grp.create_group('boundary')
bound_grp.attrs['boundary'] = ['periodic', 'periodic', 'periodic']
edge_grp = box_grp.create_group('edges')
edge_ds_time = edge_grp.create_dataset('time', dtype='f', shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
edge_ds_step = edge_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
edge_ds_value = None
## H5MD group with the attributes
#hdf.attrs['version'] = 1.0 # global attribute
h5md_grp = hdf.require_group('h5md/version/author/creator')
h5md_grp.attrs['version'] = 1.0
h5md_grp.attrs['author'] = 'rohit'
h5md_grp.attrs['creator'] = 'known'
# datasets with known sizes
ds_time = particles_grp.create_dataset('time', dtype="f", shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
ds_step = particles_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
ds_value = None
step = 0
while True:
header = f.readline()
m = re.search("t= *(.*)$", header)
if m:
time = float(m.group(1))
else:
print("End Of File")
break
# get number of data rows, i.e., number of particles
nparticles = int(f.readline())
# read data lines and store in array
arr = np.empty(shape=(nparticles, 3), dtype=np.float32)
for row in range(nparticles):
fields = parse( f.readline().encode('utf-8') )
arr[row] = np.array((float(fields[3]), float(fields[4]), float(fields[5])))
if nparticles > 0:
# create a resizable dataset upon the first iteration
if not ds_value:
ds_value = particles_grp.create_dataset('value', dtype=np.float32,
shape=(0, nparticles, 3), maxshape=(None, nparticles, 3),
chunks=(1, nparticles, 3), compression='gzip', shuffle=True)
#edge_data = bound_grp.create_dataset('box_size', dtype=np.float32, shape=(0, nparticles, 3), maxshape=(None, nparticles, 3), compression='gzip', shuffle=True)
# append this sample to the datasets
ds_time.resize(step + 1, axis=0)
ds_step.resize(step + 1, axis=0)
ds_value.resize(step + 1, axis=0)
ds_time[step] = time
ds_step[step] = step
ds_value[step] = arr
footer = parse1( f.readline().encode('utf-8') )
dat = np.array(footer)
print(dat)
arr1 = np.empty(shape=(1, 3), dtype=np.float32)
edge_data = bound_grp.create_dataset('box_size', data=dat, dtype=np.float32, compression='gzip', shuffle=True)
step += 1
#=============================================================================
您的代码在读取和转换“页脚”行时存在一些小错误。
我修改了代码并使其正常工作....但不确定它是否完全符合您的要求。我使用了相同的组和数据集定义。因此,页脚数据写入此数据集:
/particles/lipids/positions/box/boundary/box_size
来自以下组和数据集定义:
particles_grp = hdf.require_group('particles/lipids/positions')
box_grp = particles_grp.create_group('box')
bound_grp = box_grp.create_group('boundary')
edge_data = bound_grp.create_dataset('box_size'....
有几个地方需要更正:
首先,您需要更改 parse1
的定义以匹配 3 个字段。
# Format for footer
# FROM:
fmtstring1 = '1s 1s 5s 7s 7s 7s'
# TO:
fmtstring1 = '10s 10s 10s'
接下来,您需要修改创建 box_size
数据集的位置和方式。您需要像其他人一样创建它:作为可扩展数据集(maxshape=()
参数)在 while True:
循环之上。这就是我所做的:
edge_ds_step = edge_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
# Create empty 'box_size' dataset here
edge_data = bound_grp.create_dataset('box_size', dtype=np.float32, shape=(0,3), maxshape=(None,3), compression='gzip', shuffle=True)
最后,这里是修改后的代码:
将footer
字符串解析为元组,
将元组映射到 np.array 浮点数,shape=(1,3),
调整数据集大小,最后
将数组加载到数据集中。
footer = parse1( f.readline().encode('utf-8') )
dat = np.array(footer).astype(float).reshape(1,3)
new_size = edge_data.shape[0]+1
edge_data.resize(new_size, axis=0)
edge_data[new_size-1:new_size,:] = dat
我有一些数据看起来像这样:
Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
11210LI aLI11210 61.61 19.15 3.25
11211LI aLI11211 69.99 64.64 3.17
11212LI aLI11212 70.73 11.64 3.38
11213LI aLI11213 62.67 16.16 3.44
11214LI aLI11214 3.22 9.76 3.39
61.42836 61.42836 8.47704
除了最后一行,我已经成功地将数据写入了所需的组中。我想把这行写到一组/particles/box。如您在数据文件 here 中所见,这一特定行在每一帧中重复出现。到目前为止,代码的设计方式以某种方式忽略了这一行。我尝试了一些方法但出现以下错误:
ValueError: Shape tuple is incompatible with data
最后一行是时间相关的,即每个时间帧都有一点波动我希望此数据与已在 /particles/lipids/positions/step 中定义的步骤和时间数据集链接。这是代码:
import struct
import numpy as np
import h5py
import re
# First part generate convert the .gro -> .h5 .
csv_file = 'com'
fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
# Format for footer
fmtstring1 = '1s 1s 5s 7s 7s 7s'
fieldstruct1 = struct.Struct(fmtstring1)
parse1 = fieldstruct1.unpack_from
with open(csv_file, 'r') as f, \
h5py.File('xaa_trial.h5', 'w') as hdf:
# open group for position data
## Particles group with the attributes
particles_grp = hdf.require_group('particles/lipids/positions')
box_grp = particles_grp.create_group('box')
dim_grp = box_grp.create_group('dimension')
dim_grp.attrs['dimension'] = 3
bound_grp = box_grp.create_group('boundary')
bound_grp.attrs['boundary'] = ['periodic', 'periodic', 'periodic']
edge_grp = box_grp.create_group('edges')
edge_ds_time = edge_grp.create_dataset('time', dtype='f', shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
edge_ds_step = edge_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
edge_ds_value = None
## H5MD group with the attributes
#hdf.attrs['version'] = 1.0 # global attribute
h5md_grp = hdf.require_group('h5md/version/author/creator')
h5md_grp.attrs['version'] = 1.0
h5md_grp.attrs['author'] = 'rohit'
h5md_grp.attrs['creator'] = 'known'
# datasets with known sizes
ds_time = particles_grp.create_dataset('time', dtype="f", shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
ds_step = particles_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
ds_value = None
step = 0
while True:
header = f.readline()
m = re.search("t= *(.*)$", header)
if m:
time = float(m.group(1))
else:
print("End Of File")
break
# get number of data rows, i.e., number of particles
nparticles = int(f.readline())
# read data lines and store in array
arr = np.empty(shape=(nparticles, 3), dtype=np.float32)
for row in range(nparticles):
fields = parse( f.readline().encode('utf-8') )
arr[row] = np.array((float(fields[3]), float(fields[4]), float(fields[5])))
if nparticles > 0:
# create a resizable dataset upon the first iteration
if not ds_value:
ds_value = particles_grp.create_dataset('value', dtype=np.float32,
shape=(0, nparticles, 3), maxshape=(None, nparticles, 3),
chunks=(1, nparticles, 3), compression='gzip', shuffle=True)
#edge_data = bound_grp.create_dataset('box_size', dtype=np.float32, shape=(0, nparticles, 3), maxshape=(None, nparticles, 3), compression='gzip', shuffle=True)
# append this sample to the datasets
ds_time.resize(step + 1, axis=0)
ds_step.resize(step + 1, axis=0)
ds_value.resize(step + 1, axis=0)
ds_time[step] = time
ds_step[step] = step
ds_value[step] = arr
footer = parse1( f.readline().encode('utf-8') )
dat = np.array(footer)
print(dat)
arr1 = np.empty(shape=(1, 3), dtype=np.float32)
edge_data = bound_grp.create_dataset('box_size', data=dat, dtype=np.float32, compression='gzip', shuffle=True)
step += 1
#=============================================================================
您的代码在读取和转换“页脚”行时存在一些小错误。 我修改了代码并使其正常工作....但不确定它是否完全符合您的要求。我使用了相同的组和数据集定义。因此,页脚数据写入此数据集:
/particles/lipids/positions/box/boundary/box_size
来自以下组和数据集定义:
particles_grp = hdf.require_group('particles/lipids/positions')
box_grp = particles_grp.create_group('box')
bound_grp = box_grp.create_group('boundary')
edge_data = bound_grp.create_dataset('box_size'....
有几个地方需要更正:
首先,您需要更改 parse1
的定义以匹配 3 个字段。
# Format for footer
# FROM:
fmtstring1 = '1s 1s 5s 7s 7s 7s'
# TO:
fmtstring1 = '10s 10s 10s'
接下来,您需要修改创建 box_size
数据集的位置和方式。您需要像其他人一样创建它:作为可扩展数据集(maxshape=()
参数)在 while True:
循环之上。这就是我所做的:
edge_ds_step = edge_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
# Create empty 'box_size' dataset here
edge_data = bound_grp.create_dataset('box_size', dtype=np.float32, shape=(0,3), maxshape=(None,3), compression='gzip', shuffle=True)
最后,这里是修改后的代码:
将
footer
字符串解析为元组,将元组映射到 np.array 浮点数,shape=(1,3),
调整数据集大小,最后
将数组加载到数据集中。
footer = parse1( f.readline().encode('utf-8') ) dat = np.array(footer).astype(float).reshape(1,3) new_size = edge_data.shape[0]+1 edge_data.resize(new_size, axis=0) edge_data[new_size-1:new_size,:] = dat