如何从数据文件中读取指定间隔的行？

Question

我需要读取测试文件并将数据存储为新的 HDF5 文件格式。到目前为止，我已经成功完成了这项工作，但现在我需要将数据分成不同的组。让我解释。数据文件看起来像这样：

Generated by trjconv : P/L=1/400 t=   0.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
    3P2     aP2    3  18.53  -9.69   4.68
    4P2     aP2    4  55.39  74.34   4.60
    5P3     aP3    5  22.11  68.71   3.85
    6P3     aP3    6  -4.13  24.04   3.73
    7P4     aP4    7  40.16   6.39   4.73
    8P4     aP4    8  -5.40  35.73   4.85
    9P5     aP5    9  36.67  22.45   4.08
   10P5     aP5   10  -3.68 -10.66   4.18
   11P6     aP6   11  35.95  36.43   5.15
   12P6     aP6   12  57.17   3.88   5.08
   13P7     aP7   13 -23.64  50.44   4.32
   14P7     aP7   14   6.78   8.24   4.36
   15LI     aLI   15  21.34  50.63   5.21
   16LI     aLI   16  16.29  -1.34   5.28
   17LI     aLI   17  22.26  71.25   5.40
   18LI     aLI   18  19.76  10.38   5.34
   19LI     aLI   19  78.62  11.13   5.69
   20LI     aLI   20  22.14  59.70   4.92
   21LI     aLI   21  15.65  47.28   5.22
   22LI     aLI   22  82.41   2.09   5.24
   23LI     aLI   23  16.87 -11.68   5.35

您可以从第一列看到每一行都有其唯一的 ID。直到现在我都在考虑它是一个数据集，但现在我需要将 id *P 和 id LI 的行分成不同的组。我有整个数据集的运行代码，但我不确定是否可以修改以解决当前问题。代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import struct
import numpy as np
import h5py
import re

csv_file = 'com'
fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
# Format for footer
fmtstring1 = '10s 10s 10s'
fieldstruct1 = struct.Struct(fmtstring1)
parse1 = fieldstruct1.unpack_from

with open(csv_file, 'r') as f, \
    h5py.File('test.h5', 'w') as hdf:
    ## Particles group with the attributes
    particles_grp = hdf.require_group('particles/lipids/box')
    particles_grp.attrs['dimension'] = 3
    particles_grp.attrs['boundary'] = ['periodic', 'periodic', 'periodic']
    pos_grp = particles_grp.require_group('positions')
    edge_grp = particles_grp.require_group('edges')
    ## h5md group with the attributes
    h5md_grp = hdf.require_group('h5md')
    h5md_grp.attrs['version'] = 1.0
    author_grp = h5md_grp.require_group('author')
    author_grp.attrs['author'] = 'foo', 'email=foo@googlemail.com'
    creator_grp = h5md_grp.require_group('creator')
    creator_grp.attrs['name'] = 'foo'
    creator_grp.attrs['version'] = 1.0
    # datasets with known sizes
    ds_time = pos_grp.create_dataset('time', dtype="f", shape=(0,),
                                           maxshape=(None,), compression='gzip', 
                                           shuffle=True)
    ds_step = pos_grp.create_dataset('step', dtype=np.uint64, shape=(0,),
                                           maxshape=(None,), compression='gzip',
                                           shuffle=True)
    ds_protein = None
    ds_lipid = None
    # datasets in edge group
    edge_ds_time = edge_grp.create_dataset('time', dtype="f", shape=(0,),
                                           maxshape=(None,), compression='gzip', 
                                           shuffle=True)
    edge_ds_step = edge_grp.create_dataset('step', dtype="f", shape=(0,),
                                           maxshape=(None,), compression='gzip', 
                                           shuffle=True)
    edge_ds_value = None

    edge_data = edge_grp.require_dataset('box_size', dtype=np.float32, shape=(0,3),
                                              maxshape=(None,3), compression='gzip', 
                                              shuffle=True)
    
    step = 0
    while True:
        header = f.readline()
        m = re.search("t= *(.*)$", header)
        if m:
            time = float(m.group(1))
        else:
            print("End Of File")
            break

        # get number of data rows, i.e., number of particles
        nparticles = int(f.readline())
        # read data lines and store in array
        arr = np.empty(shape=(nparticles, 3), dtype=np.float32)
        for row in range(nparticles):
            fields = parse( f.readline().encode('utf-8') )
            arr[row] = np.array((float(fields[3]), float(fields[4]), float(fields[5])))
        if nparticles > 0:
            # create a resizable dataset upon the first iteration
            if not ds_lipid:
## It is reading the whole dataset
                ds_protein = pos_grp.create_dataset('lipid', dtype=np.float32,
                                                        shape=(0, nparticles, 3), maxshape=(None, nparticles, 3),
                                                        chunks=(1, nparticles, 3), compression='gzip', shuffle=True)
                edge_ds_value = edge_grp.create_dataset('value', dtype=np.float32,
                                                        shape=(0, 3), maxshape=(None, 3),chunks=(1, 3), compression='gzip', shuffle=True)
            # append this sample to the datasets
            ds_time.resize(step + 1, axis=0)
            ds_step.resize(step + 1, axis=0)
            ds_protein.resize(step + 1, axis=0)
            # append the datasets in edge group
            edge_ds_time.resize(step + 1, axis=0)
            edge_ds_step.resize(step + 1, axis=0)
            edge_ds_value.resize(step + 1, axis=0)
            
            ds_time[step] = time
            ds_step[step] = step
            ds_protein[step] = arr
            edge_ds_time[step] = time
            edge_ds_step[step] = step

        footer = parse1( f.readline().encode('utf-8') )
        dat = np.array(footer).astype(float).reshape(1,3)
        new_size = edge_data.shape[0] + 1
        edge_data.resize(new_size, axis=0)
        edge_data[new_size-1 : new_size, :] = dat
        step += 1
        #=============================================================================

让我稍微解释一下代码。 nparticles逐行读取整个文件并将它们存储在ds_protein中。一帧中的总行数为 11214，然后在 10 帧中重复，这里我没有显示。准确地说，我只想在 ds_protein 中读取带有 P id 的行，在一个数据集中只有 14 个，其余为 11200 ds_lipid。有什么方法可以使用索引或任何条件来做到这一点，因为我不想拆分文本文件？

Answer 1

首先，请注意...您在第 45 行定义了 ds_protein = None，然后在第 81 行创建数据集时重新定义（ds_lipid = None 也是如此）。不确定这是否会如您所愿。请参阅稍后关于检查数据集是否存在的评论。

目前您将所有数据添加到数组 arr，然后从该数组加载 ds_protein。由于您不保存第一列数据，因此需要使用@white's建议：在阅读每一行时检查fields[0]中的值。代码中的第 75 行将数据解析为 fields 变量。每行的第一列变为 fields[0]。检查该值。如果它有 'P'，请将其添加到 ds_protein 数据集的数组中。如果它有 'LI'，请为 ds_lipid 数据集添加一个数组。

获得蛋白质和脂质数据后，您需要修改创建数据集的方法。现在您在 while 循环中使用 .create_dataset()，这将在循环中第二次发出错误。为避免这种情况，我在适当的组中添加了对每个数据集名称的检查。

修改以下代码。

    # get number of data rows, i.e., number of particles
    nparticles = int(f.readline())
    # read data lines and store in array
    arr_protein = np.empty(shape=(nparticles, 3), dtype=np.float32)
    arr_lipid = np.empty(shape=(nparticles, 3), dtype=np.float32)
    protein_cnt, lipid_cnt = 0, 0
    for row in range(nparticles):
        fields = parse( f.readline().encode('utf-8') )
        if 'P' in str(fields[0]):
            arr_protein[protein_cnt] = np.array((float(fields[3]), float(fields[4]), float(fields[5])))
            protein_cnt += 1
        elif 'LI' in str(fields[0]):
            arr_lipid[lipid_cnt] = np.array((float(fields[3]), float(fields[4]), float(fields[5])))
            lipid_cnt += 1

    arr_protein = arr_protein[:protein_cnt,:]  ## New    
    arr_lipid = arr_lipid[:lipid_cnt,:]        ## New      

    if nparticles > 0:
        # create resizable datasets upon the first iteration
        if 'protein' not in pos_grp.keys():
            ds_protein = pos_grp.create_dataset('protein', dtype=np.float32,
                                                    shape=(0, protein_cnt, 3), maxshape=(None, protein_cnt, 3),
                                                    chunks=(1, protein_cnt, 3), compression='gzip', shuffle=True)
        if 'lipid' not in pos_grp.keys():
            ds_lipid   = pos_grp.create_dataset('lipid', dtype=np.float32,
                                                    shape=(0, lipid_cnt, 3), maxshape=(None, lipid_cnt, 3),                                                        
                                                    chunks=(1, lipid_cnt, 3), compression='gzip', shuffle=True)
        if 'value' not in edge_grp.keys():
             edge_ds_value = edge_grp.create_dataset('value', dtype=np.float32,
                                                    shape=(0, 3), maxshape=(None, 3),
                                                    chunks=(1, 3), compression='gzip', shuffle=True)
        # append this sample to the datasets
        ds_time.resize(step + 1, axis=0)
        ds_step.resize(step + 1, axis=0)
        ds_protein.resize(step + 1, axis=0) ## Modified
        ds_lipid.resize(step + 1, axis=0)   ## Modified
        # append the datasets in edge group
        edge_ds_time.resize(step + 1, axis=0)
        edge_ds_step.resize(step + 1, axis=0)
        edge_ds_value.resize(step + 1, axis=0)
        
        ds_time[step] = time
        ds_step[step] = step
        ds_protein[step] = arr_protein ## Modified
        ds_lipid[step] = arr_lipid     ## Modified
        edge_ds_time[step] = time
        edge_ds_step[step] = step

如何从数据文件中读取指定间隔的行？

How to read lines with a specified interval from a data file?

python

hdf5